Tải bản đầy đủ (.pdf) (10 trang)

2 morphology

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (427.85 KB, 10 trang )

Motivation

Morphology



Factoring words
• Cats  CAT + N(oun) + PL(ural)



Read Chapter 3 - Speech and
Language Processing

Used in:






Traditional NLP applications
Finding word boundaries (e.g., Latin, Chinese)
Document retrieval (keyword retrieval)
Text classification


1

Morphology


2

Inflectional morphology




Morphology is the study of how words are built up from
smaller meaningful units called morphemes

 the same class as the stem
 relates to the syntax of a sentence

Ex:
disadvantages = dis + advantage + s


a stem + a grammatical morpheme  a
word:



2 lớp:
 Inflectional morphology (Hình thái học biến tố)
 Derivational morphology (Hình thái học dẫn xuất)

Example: subject-verb agreement
 He hit-s the ball
 We hit the ball




Plural and possessive markers
 Cats, cat’s

3

4


Problem

Derivational morphology

Build a morphological parser to compute the
morphology of words:

 a stem + a grammatical morpheme  a word:
 different class, e.g., transmit->transmission (Verb to
Noun)

Input
Cats
Cat
Cities
Goose
Geese
Gooses
Merging
caught


 Irregular meaning change
Suffix
-ation
-ee
-er
-ness
-less

Base Verb/Adjective
computerize(V)
appoint(V)
love(V)
fuzzy(Adj)
clue(N)

Derived form
computerization(N)
appointee(N)
lover(N)
fuzziness
clueless

Morphological Parsed Output
Cat + N + PL
Cat + N + SG
City + N + PL
Goose + N + SG
Goose + N + PL
Goose + V + 3SG

Merge + V + PRES-PART
(catch + V + PAST-PART)
or (catch + V + PAST-PART)

5

6

Solution 2: Look individual
morphemes up

Solution 1: A large dictionary
Impractical: some languages associate a single
meaning with a number of distinct surface forms
(600 billion in Turkish)
German:
Leben+s+versichergun+gesellschaft+s+angestellter
(life+CmpAug+insurance+CmpAug+company+Comp
Aug+employee)
Chinese compounding: about 3000 ‘words,’ combine to
yield tens of thousands
7



mis + interpret
+ ation
+ s
MIS + INTERPRET + noun form + plural


 unrealistic: we might not find all the pieces in
the dictionary, because of interference from
the sound system (phonology)
Ex: cities  citie + s; cities  citi + es
8


Define the problem

Basic Terminology &
Motivation

What knowledge do we need?
 What endings follow what roots, and in what order
 Cat/cats (inflectional)
 Dog/dogged (derivational)




 Only some endings go on some words, not others
 Do+er ok; (a class of verbs) but not following be

Stem: core meaning unit (morpheme) of a
word
Affixes: pieces that combine with the stem
to modify its meaning and grammatical
functions
 Prefix: un- , anti-, etc.
 Suffix: -ity, -ation, etc.

 Infix:

 Spelling change rules adjust the surface form vs. the
lexicon form:
 Get+er double the t  getter
 Fox+s  insert e  foxes
 Fly+s  insert e  flyes  Y to I  flies

Tagalog: um+hinigi  humingi (borrow)
9

Picture of finite-state automata
(fsa):

How to do?


We want to model pure concatenation



We need to ‘remember’ that certain items can
only combine with certain other items



There’s a perfect model for this –
finite-state automata

10


11

12


Definition of finite-state
automaton (fsa)

How: 2-level machine



f

l

i

e







Finite-state transducer
Lexicon


Surface form

F

s

L

Y

+

A (deterministic) finite-state automaton
(FSA) is a quintuple (Q,Σ d,
, q0, F) where

S

Q is a finite set of states
Σ is a finite set of terminal symbols, the alphabet
q0  Q is the initial state
F  Q, the set of final states
 is a function from Q x Σ into Q, the transition
function

Underlying form
13

Formal languages & grammars



Plan:

A language is a set of strings defined over
some alphabet Σ, with some properties:

1.

Build fsa to recognize different stemendings and prefix-stems

Suppose Σ ={a, b}. Then we can have:

2.

Build fsa to recognize spelling changes

3.

Turn these into parsers by turning the fsa’s
into finite-state transducers

L  {x  * | P ( x)}



14

15

16



Using fsa’s to build recognizer
for morphophonemic forms
1.

2.
3.
4.

5.

FSA for nominal inflection


Build fsa system for English inflectional
morphology
English derivational morphology fsa
Use this to recognize a valid word
Then show how to parse by extended to
transducer
Add spelling-change rules



Remember, we don’t have to worry about
spelling changes
2 classes of word:
 Regular: cat, table, city: add s
 Irregular: goose, mouse, sheep (memorize)


17

18

English derivational
morphology

Resulting fsa




Much more complex than inflectional
Consider adjectives:






19

Big, bigger, biggest
Cool, cooler, coolest, coolly
Clear, clearer, clearest, clearly, unclear, unclearly
Happy, happier, happiest, happily
Unhappy, unhappier, unhappiest, unhappily

20



Will this fsa work?

Will this fsa work? NO!





Accepts all adjectives above, but
Also accepts unbig, realest
Common problem: overgeneration
Solution?
Need classes of roots that say which can occur
with which suffixes

21

22

Revised picture
More English

23

24


FSA at the level of individual letters


From recognizer to transducer




Why: need to map (correspond) inputs and
outputs (e.g., goose-geese)
A finite state transducer is a quintuple:
 Q a finite set of states;
 Σ a finite alphabet of complex symbols. Each is an
input-output pair, i:o, I from alphabet I and o from
alphabet O. So Σ I x O.
I,O can include the
empty symbol ε or λ ;
 q0 a start state
 F, the set of final states, FQ
  the transition function between states

Aardvarks, foxs, …
25

FSTs in morphological
processing

FSA vs. FST





26

2 operations

An FSA defines a formal language (a set of
strings)
An FST defines a relation between sets of
strings (defines a set of pairs of strings)

27



Composition (tổng hợp): if transducer T1 maps from
I1 to O1 and T2 from I2 to O2 then T1o T2 maps from I1
to O2
 Useful to replace series of transducers



Inversion (đảo): T(T-1) switches input and output
labels
 Useful to convert parser to generator

28


Automaton for singular/plural
suffix, call this Tnum


Automaton for stems, call this
Tstem

(cats#, cat N PL)

(geese#, goose N PL)
29

30

Spelling change rules

Tlex=TnumTstems

31

Name

Description

Example

Consonant
Doubling
(gemination, G)
E deletion
(elision, EL),

1-letter consonant
doubled before -ing/ed


beg/begging

E insertion
(epenthesis, EP)

e added after -s, -z, -ch,
-sh before -s

Y replacement
(Y)

-y changes to -ie before - try/tries
ed

I spelling (I)

I goes to y before vowel

Silent e dropped before - make/making
ing, -ed
fox/foxes

lie/lying
32


So another view of the situation
is this (see notes2)
recognizing ‘foxes’


Fst spelling of
“foxes”“FOX+S”

root= always 1st ‘class’

root

F/f

f:f,o:o
x:x

+:e

= FST1 (word
classes)

O/o

e:
e

= FST2 (spell
changes)

s:s
X/x
0/e


+/0
Automaton blocks

+/e

#:#
f o
x
e e s # surface
F O X + e S # underlying

Noun

C1

leftover input s
33

Two-level morphology parsing
(analysis) algorithm

END!

S/s
C2

#/#
Fox+s, Plural

34


Parsing Algorithm, cont’d

1. Initialize set of paths to P = {}.
2. Read input symbols, one at a time.
3. At each symbol, generate all lexical symbols
possibly corresponding to the 0 (empty) symbol
4. Prolong all paths in P by all such possible (x:0)
pairs.
5. Check each new path extension against the
phonological FST and lexical FSA (lexical symbols
only); delete impossible paths prefixes.
6. Repeat 4-5 until max. # of consecutive 0s reached.

7. Generate all possible lexical symbols (get from all
FSTs) for the current input symbol, form pairs.
8. Extend all paths from P using all such pairs.
9. Check all paths from P (next step in FST/FSA).
Delete all outright impossible paths.
10. Repeat from 3 until end of input.
11. Collect lexical “glosses” from all surviving paths.

35

36


Generation algorithm







Do not use the lexicon (well you have to put
the “right” lexical strings together somehow!)
Start with a lexical string L.
Generate all possible pairs l:s for every
symbol in L.
Find all (hopefully only 1!) traversals through
the FST which end in a final state.
From all such traversals, print out the
sequence of surface letters.
37



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×