Motivation
Morphology
Factoring words
• Cats CAT + N(oun) + PL(ural)
Read Chapter 3 - Speech and
Language Processing
Used in:
•
•
•
•
•
Traditional NLP applications
Finding word boundaries (e.g., Latin, Chinese)
Document retrieval (keyword retrieval)
Text classification
…
1
Morphology
2
Inflectional morphology
Morphology is the study of how words are built up from
smaller meaningful units called morphemes
the same class as the stem
relates to the syntax of a sentence
Ex:
disadvantages = dis + advantage + s
a stem + a grammatical morpheme a
word:
2 lớp:
Inflectional morphology (Hình thái học biến tố)
Derivational morphology (Hình thái học dẫn xuất)
Example: subject-verb agreement
He hit-s the ball
We hit the ball
Plural and possessive markers
Cats, cat’s
3
4
Problem
Derivational morphology
Build a morphological parser to compute the
morphology of words:
a stem + a grammatical morpheme a word:
different class, e.g., transmit->transmission (Verb to
Noun)
Input
Cats
Cat
Cities
Goose
Geese
Gooses
Merging
caught
Irregular meaning change
Suffix
-ation
-ee
-er
-ness
-less
Base Verb/Adjective
computerize(V)
appoint(V)
love(V)
fuzzy(Adj)
clue(N)
Derived form
computerization(N)
appointee(N)
lover(N)
fuzziness
clueless
Morphological Parsed Output
Cat + N + PL
Cat + N + SG
City + N + PL
Goose + N + SG
Goose + N + PL
Goose + V + 3SG
Merge + V + PRES-PART
(catch + V + PAST-PART)
or (catch + V + PAST-PART)
5
6
Solution 2: Look individual
morphemes up
Solution 1: A large dictionary
Impractical: some languages associate a single
meaning with a number of distinct surface forms
(600 billion in Turkish)
German:
Leben+s+versichergun+gesellschaft+s+angestellter
(life+CmpAug+insurance+CmpAug+company+Comp
Aug+employee)
Chinese compounding: about 3000 ‘words,’ combine to
yield tens of thousands
7
mis + interpret
+ ation
+ s
MIS + INTERPRET + noun form + plural
unrealistic: we might not find all the pieces in
the dictionary, because of interference from
the sound system (phonology)
Ex: cities citie + s; cities citi + es
8
Define the problem
Basic Terminology &
Motivation
What knowledge do we need?
What endings follow what roots, and in what order
Cat/cats (inflectional)
Dog/dogged (derivational)
Only some endings go on some words, not others
Do+er ok; (a class of verbs) but not following be
Stem: core meaning unit (morpheme) of a
word
Affixes: pieces that combine with the stem
to modify its meaning and grammatical
functions
Prefix: un- , anti-, etc.
Suffix: -ity, -ation, etc.
Infix:
Spelling change rules adjust the surface form vs. the
lexicon form:
Get+er double the t getter
Fox+s insert e foxes
Fly+s insert e flyes Y to I flies
Tagalog: um+hinigi humingi (borrow)
9
Picture of finite-state automata
(fsa):
How to do?
We want to model pure concatenation
We need to ‘remember’ that certain items can
only combine with certain other items
There’s a perfect model for this –
finite-state automata
10
11
12
Definition of finite-state
automaton (fsa)
How: 2-level machine
f
l
i
e
Finite-state transducer
Lexicon
Surface form
F
s
L
Y
+
A (deterministic) finite-state automaton
(FSA) is a quintuple (Q,Σ d,
, q0, F) where
S
Q is a finite set of states
Σ is a finite set of terminal symbols, the alphabet
q0 Q is the initial state
F Q, the set of final states
is a function from Q x Σ into Q, the transition
function
Underlying form
13
Formal languages & grammars
Plan:
A language is a set of strings defined over
some alphabet Σ, with some properties:
1.
Build fsa to recognize different stemendings and prefix-stems
Suppose Σ ={a, b}. Then we can have:
2.
Build fsa to recognize spelling changes
3.
Turn these into parsers by turning the fsa’s
into finite-state transducers
L {x * | P ( x)}
14
15
16
Using fsa’s to build recognizer
for morphophonemic forms
1.
2.
3.
4.
5.
FSA for nominal inflection
Build fsa system for English inflectional
morphology
English derivational morphology fsa
Use this to recognize a valid word
Then show how to parse by extended to
transducer
Add spelling-change rules
Remember, we don’t have to worry about
spelling changes
2 classes of word:
Regular: cat, table, city: add s
Irregular: goose, mouse, sheep (memorize)
17
18
English derivational
morphology
Resulting fsa
Much more complex than inflectional
Consider adjectives:
19
Big, bigger, biggest
Cool, cooler, coolest, coolly
Clear, clearer, clearest, clearly, unclear, unclearly
Happy, happier, happiest, happily
Unhappy, unhappier, unhappiest, unhappily
20
Will this fsa work?
Will this fsa work? NO!
Accepts all adjectives above, but
Also accepts unbig, realest
Common problem: overgeneration
Solution?
Need classes of roots that say which can occur
with which suffixes
21
22
Revised picture
More English
23
24
FSA at the level of individual letters
From recognizer to transducer
Why: need to map (correspond) inputs and
outputs (e.g., goose-geese)
A finite state transducer is a quintuple:
Q a finite set of states;
Σ a finite alphabet of complex symbols. Each is an
input-output pair, i:o, I from alphabet I and o from
alphabet O. So Σ I x O.
I,O can include the
empty symbol ε or λ ;
q0 a start state
F, the set of final states, FQ
the transition function between states
Aardvarks, foxs, …
25
FSTs in morphological
processing
FSA vs. FST
26
2 operations
An FSA defines a formal language (a set of
strings)
An FST defines a relation between sets of
strings (defines a set of pairs of strings)
27
Composition (tổng hợp): if transducer T1 maps from
I1 to O1 and T2 from I2 to O2 then T1o T2 maps from I1
to O2
Useful to replace series of transducers
Inversion (đảo): T(T-1) switches input and output
labels
Useful to convert parser to generator
28
Automaton for singular/plural
suffix, call this Tnum
Automaton for stems, call this
Tstem
(cats#, cat N PL)
(geese#, goose N PL)
29
30
Spelling change rules
Tlex=TnumTstems
31
Name
Description
Example
Consonant
Doubling
(gemination, G)
E deletion
(elision, EL),
1-letter consonant
doubled before -ing/ed
beg/begging
E insertion
(epenthesis, EP)
e added after -s, -z, -ch,
-sh before -s
Y replacement
(Y)
-y changes to -ie before - try/tries
ed
I spelling (I)
I goes to y before vowel
Silent e dropped before - make/making
ing, -ed
fox/foxes
lie/lying
32
So another view of the situation
is this (see notes2)
recognizing ‘foxes’
Fst spelling of
“foxes”“FOX+S”
root= always 1st ‘class’
root
F/f
f:f,o:o
x:x
+:e
= FST1 (word
classes)
O/o
e:
e
= FST2 (spell
changes)
s:s
X/x
0/e
+/0
Automaton blocks
+/e
#:#
f o
x
e e s # surface
F O X + e S # underlying
Noun
C1
leftover input s
33
Two-level morphology parsing
(analysis) algorithm
END!
S/s
C2
#/#
Fox+s, Plural
34
Parsing Algorithm, cont’d
1. Initialize set of paths to P = {}.
2. Read input symbols, one at a time.
3. At each symbol, generate all lexical symbols
possibly corresponding to the 0 (empty) symbol
4. Prolong all paths in P by all such possible (x:0)
pairs.
5. Check each new path extension against the
phonological FST and lexical FSA (lexical symbols
only); delete impossible paths prefixes.
6. Repeat 4-5 until max. # of consecutive 0s reached.
7. Generate all possible lexical symbols (get from all
FSTs) for the current input symbol, form pairs.
8. Extend all paths from P using all such pairs.
9. Check all paths from P (next step in FST/FSA).
Delete all outright impossible paths.
10. Repeat from 3 until end of input.
11. Collect lexical “glosses” from all surviving paths.
35
36
Generation algorithm
Do not use the lexicon (well you have to put
the “right” lexical strings together somehow!)
Start with a lexical string L.
Generate all possible pairs l:s for every
symbol in L.
Find all (hopefully only 1!) traversals through
the FST which end in a final state.
From all such traversals, print out the
sequence of surface letters.
37