Tải bản đầy đủ (.pdf) (299 trang)

portable language technology a resource-light approach to morpho-syntactic tagging

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.27 MB, 299 trang )

PORTABLE LANGUAGE TECHNOLOGY:
A RESOURCE-LIGHT APPROACH TO MORPHO-SYNTACTIC TAGGING
DISSERTATION
Presented in Partial Fulfillment of the Requirements for
the Degree Doctor of Philosophy in the
Graduate School of The Ohio State University
By
Anna Feldman
*****
The Ohio State University
2006
Dissertation Committee: Approved by
Professor Christopher H. Brew, Advisor
Professor Brian D. Joseph
Professor W. Detmar Meurers
Advisor, Graduate Program in Linguistics
UMI Number: 3226393
3226393
2006
UMI Microform
Copyright
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, MI 48106-1346
by ProQuest Information and Learning Company.
Copyright by
Anna Feldman
2006


ABSTRACT
Morpho-syntactic tagging is the process of assigning part of speech (POS), case, number,
gender, and other morphological information to each word in a corpus. Morpho-syntactic
tagging is an important step in natural language processing. Corpora that have been mor-
phologically tagged are very useful both for linguistic research, e.g. finding instances or
frequencies of particular constructions in large corpora, and for further computational pro-
cessing, such as syntactic parsing, speech recognition, stemming, and word-sense disam-
biguation, among others. Despite the importance of morphological tagging, there are many
languages that lack annotated resources. This is almost inevitable because these resources
are costly to create. But, as described in this thesis, it is possible to avoid this expense.
This thesis describes a method for transferring annotation from a morphologically
annotated corpus of a source language to a corpus of a related target language. Unlike
unsupervised approaches that do not require annotated data at all and, as a consequence,
lack precision, the approach proposed in this dissertation relies on linguistic knowledge, but
avoids large-scale grammar engineering. The approach needs neither a parallel corpus nor
a bilingual lexicon, and requires much less linguistic labor than the standard technology.
This dissertation describes experiments with Russian, Czech, Polish, Spanish, Por-
tuguese, and Catalan. However, the general method proposed can be applied to any fusional
language.
ii
To Batsheva Barenfeld, Mira Barenfeld, and Ilia Feldman who made me who I am, and
Gera and Naomi who like me this way.
iii
ACKNOWLEDGMENTS
Even though my name is the only author on this work, many have contributed to its devel-
opment and completion — those who provided insights, comments, and suggestions, and
those who provided friendship, love, and support.
First I want to thank Chris Brew for (surprisingly easily) agreeing to take me as
his advisee and for being a terrific advisor. Always with bright insights (mostly in the
form of interrogation), always knowledgeable, always generous with his time, always with

anecdotal stories and jokes, always with good advice — Chris has become an object of
appreciation. It was his seminar on Corpora and Multilingual Verb Classification where I
realized that Czech is rather useful for processing Russian verbs.
Special thanks goes to my friend and colleague, Jirka Hana, who contributed an
incredible amount of work and ideas to this project. He developed a resource-light portable
morphological analyzer which became the basis for the cross-language system described
in this thesis. This work started as a joint project and many ideas developed in this thesis
were inspired by discussions with Jirka.
I also want to thank Detmar Meurers, another member of my dissertation commit-
tee. He was the first to introduce me to the field of Computational Linguistics and got me
excited about it. Detmar gave me a lot of good advice and support over the years. He man-
aged to keep me always in mind, pointing to the relevant literature and tools, and making
me believe that I actually can write a dissertation! Throughout the years, I took several
iv
seminars with Detmar, and that’s where I acquired most of the skills for working on my
thesis.
I also thank Brian Joseph for always extremely insightful comments and timely
feedback. What can I say? Brian knows . I am so lucky he agreed to be my
committee member.
I also want to thank people who helped me with corpora used in the experiments:
Sandra Maria Aluísio, Gemma Boleda, Toni Badia, Lukasz Debowski, Maria das Graças
Volpe Nunes, Jan Hajic, Ricardo Hasegawa, Vicente López, Lluís Padró, Carlos Rodríguez
Penagos, Adam Przepiórkowski, and Martí Quixal.
Special thanks go to Stacey Bailey for being such a nice office mate, for the hot
chocolate with marshmallows, and for being ready to proofread the entire draft of this
dissertation.
I would like to thank my parents for giving me the freedom of choice and always
trusting and supporting me. Linguistics is definitely not a profession that runs in our family.
Then, there is a long list of people who deserve a word of thanks because of one
or more of the following things: their teaching, their willingness to discuss whatever lin-

guistic or non-linguistic topic, their collegiality, and their friendship. These are (in the
alphabetical order) Luiz Amaral (a friend and an expert in Romance languages), Mary
Beckman, Ilana Bromberg, Donna Byron, Angelo Costanzo (for the Catalan-Spanish false
cognates), Peter Culicover, Mike Daniels (another person who knows practically every-
thing and is always ready to help), Eric Fosler-Lussier, Kordula De Kuthy, Markus Dick-
inson, Edit Doron, David Dowty, Stefan Dyła, Yakov Feldman (for making my life full of
art), Zhenya Gabrilovich (There are computer scientists who can actually understand lin-
guists. Well, at least to some extent. ), Anna Ghazaryan (a friend and a mathematician!),
Jonathan Ginzburg, Jan Hajic, Hanka Hanova (swimming, hiking, cooking, reminding me
that there is life beyond Oxley), Jirka Hana (a friend and colleague, whose contribution
v
rates a second mention), Jim Harmon, Erhard Hinrichs (for always good advice, encour-
agement, and the subtaggers idea), Beth Hume, Martin Jansche, Dimitra Kolliakou, Greg
Kondrak, Soyoung Kang, Chandana and Rupan Kundu (I wouldn’t have finished this thesis
without you, guys!), Bob Levine (for making me believe I can do it and for making me
want to know physics), Xiaofei Lu (my former office mate, my current lab mate, full of
good ideas and jokes), Vahagn Manukian (for friendship, math, and grill), Arantxa Martin-
Lozano (my dear friend), Dennis Mehay, Vanessa Metcalf (for the devoted friendship and
mental support), Marcela Michalkova, Martin Michalek, Rick Nouwen (Utrecht, Utrecht),
Carl Pollard (for making me love syntax, logic, and math), Craige Roberts, Anton Rytting
(for being such a friendly office mate and for being always ready to discuss Arabic vowels,
Lettish dialects, and entropy), Andrea Sims (an expert in Slavic languages), Shari Speer,
Soundar Srinivasan, Maya Schwekher, Nathan Vaillette, Shravan Vasishth (strict, but fair),
Pauline Welby (who always has some interesting story to tell), Don Winford, Mike White,
Yael Ziv, and many, many other people. Thank you all!
Last but not the least, I thank Gera, who asked me not to include his name here. So,
read the dedication.
vi
VITA
1997 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.A. English Language and Literature,

B.A. East-Asian Studies,
Hebrew University of Jerusalem, Israel
1997–1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research Assistant,
Hebrew University of Jerusalem
1999 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.A. English Linguistics,
Hebrew University of Jerusalem
1999–2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research Assistant,
The Ohio State University
2001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marie-Curie Fellow, Utrecht Institute of
Linguistics, The Netherlands. Marie-
Curie Fellow
2000–2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teaching Assistant,
The Ohio State University
2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.A. Linguistics,
The Ohio State University
2005–present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Language Consultant,
Zi Corporation, Canada
2005–2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Presidential Fellow,
The Ohio State University
vii
PUBLICATIONS
1. Anna Feldman, Jirka Hana, and Chris Brew (2006). A Cross-language Approach
to Rapid Creation of New Morpho-syntactically Annotated Resources. In Proceed-
ings of the Fifth International Conference on Language Resources and Evaluation
(LREC), Genoa, Italy.
2. Jirka Hana, Anna Feldman, Luiz Amaral, and Chris Brew (2006). Tagging Por-
tuguese with a Spanish Tagger Using Cognates. In Proceedings of the Workshop
on Cross-language Knowledge Induction hosted in conjunction with the 11th Con-
ference of the European Chapter of the Association for Computational Linguistics
(EACL), Trento, Italy, pp. 33–40.

3. Anna Feldman, Jirka Hana, and Chris Brew (2006). Experiments in Morphological
Annotation Transfer. In Proceedings of Computational Linguistics and Intelligent
Text Processing (CICLing), A. Gelbukh (editor), Lecture Notes in Computer Science,
Mexico City, Mexico, pp. 41–50. Springer-Verlag.
4. Anna Feldman, Jirka Hana, and Chris Brew (2005). Buy One, Get One Free or
What to Do When Your Linguistic Resources are Limited. In Proceedings of the
Third International Seminar on Computer Treatment of Slavic and East-European
Languages (Slovko), Bratislava, Slovakia.
5. Jirka Hana, Anna Feldman, and Chris Brew (2004). A Resource-light Approach to
Russian Morphology: Tagging Russian Using Czech Resources. In Proceedings of
Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain,
pp. 222–229.
6. Jirka Hana and Anna Feldman (2004). Portable Language Technology: Russian via
Czech. In Proceedings of the First Midwest Computational Linguistics Colloquium,
Bloomington, Indiana.
7. Stefan Dyła and Anna Feldman (2003). On Comitative Constructions in Polish and
Russian. In Proceedings of the Fifth European Conference on Formal Description of
Slavic Languages, Leipzig, Germany.
8. Anna Feldman (2003). On S-Coordination and Plural Pronoun Constructions. In
Balkan and Slavic Linguistics, vol.2, ed. Daniel E. Collins and Andrea D. Sims, The
Ohio State University, Columbus, Ohio, USA, pp. 49–75.
9. Anna Feldman (2002). Kim and Sandy, Kim with Sandy, Just Me or Both of Us?
In Proceedings of European Summer School of Logic, Language, and Information
(ESSLLI), Trento, Italy, pp. 41–52.
10. Anna Feldman (2002). On NP-coordination. The UiL OTS 2002 Yearbook, Utrecht,
The Netherlands, pp. 39–66.
11. Anna Feldman (2001). Comitative and Plural Pronoun Constructions. In Proceed-
ings of the 17th Annual Meeting of the Israel Association of Theoretical Linguistics
(IATL), Jerusalem, Israel.
viii

12. Anna Feldman (2000). Že: Codification of ’Hearer-Old’ Information. In Proceed-
ings of the 27th Linguistic Association of Canada and the United States (LACUS)
Forum, Houston, Texas, USA, pp. 187–202.
FIELDS OF STUDY
Major Field: Linguistics
Specialization: Computational Linguistics
ix
TABLE OF CONTENTS
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Language technology and resource-poor languages . . . . . . . . . . . . . 2
1.2 Morphological tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Reducing the annotation burden by cross-language knowledge induction . . 5
1.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Dissertation structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
I Linguistic and Computational Foundations 9
2 Language properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Slavic languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Czech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Polish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Russian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4 Contrastive study . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Romance languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.2 Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.3 Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.4 Contrastive study . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
x
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Tagsets and corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.1 Slavic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.1.1 Czech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.1.2 Russian . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.1.3 Polish . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1.2 Romance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.2.1 Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.2.2 Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.2.3 Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Language population and language technology . . . . . . . . . . . . . . . . 49
3.3 Tagsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Slavic tagsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1.1 Czech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1.2 Russian . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.1.3 Polish . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.2 Romance tagsets . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.2.1 Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.2.2 Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2.3 Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.3 Tagset design and inflected languages . . . . . . . . . . . . . . . . 61
3.3.4 Why a positional tagset? . . . . . . . . . . . . . . . . . . . . . . . 65
4 Tagging techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 Supervised methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.1 N-gram taggers/Markov models . . . . . . . . . . . . . . . . . . . 68

4.1.1.1 TnT (Brants 2000) . . . . . . . . . . . . . . . . . . . . . 71
4.1.1.2 Tagging inflected languages with MMs . . . . . . . . . . 74
4.1.2 Transformation-based error-driven learning (TBL) . . . . . . . . . 75
4.1.2.1 Tagging inflected languages with TBL . . . . . . . . . . . 77
4.1.3 Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.3.1 Tagging inflected languages with the MaxEnt-tagger . . . 80
4.1.4 Memory-based tagging (MBT) . . . . . . . . . . . . . . . . . . . . 80
4.1.4.1 Tagging inflected languages with MBT . . . . . . . . . . 81
4.1.5 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.5.1 Tagging inflected languages with decision trees . . . . . . 83
4.1.6 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.6.1 Tagging inflected languages with neural networks . . . . . 85
4.2 Unsupervised methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.1 Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.1.1 Tagging inflectional languages with HMMs . . . . . . . . 87
xi
4.2.2 Transformation-based learning (TBL) . . . . . . . . . . . . . . . . 87
4.3 Comparison of the tagging approaches . . . . . . . . . . . . . . . . . . . . 89
4.4 Classifier combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.1 Subsampling of training examples . . . . . . . . . . . . . . . . . . 90
4.4.2 Simple voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2.1 Pairwise voting . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Stacked classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.4 Combining POS-taggers . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 A special approach to tagging highly inflected languages . . . . . . . . . . 98
4.5.1 Exponential tagger . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.2 Other experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Previous resource-light approaches to NLP tasks . . . . . . . . . . . . . . . . 105
5.1 Unsupervised or minimally supervised approaches . . . . . . . . . . . . . . 106

5.1.1 Unsupervised POS tagging . . . . . . . . . . . . . . . . . . . . . . 106
5.1.2 Minimally supervised morphology learning . . . . . . . . . . . . . 107
5.1.2.1 Yarowsky and Wicentowski (2000) . . . . . . . . . . . . 109
5.2 Cross-language knowledge induction . . . . . . . . . . . . . . . . . . . . . 113
5.2.1 Cross-language knowledge transfer using parallel texts . . . . . . . 113
5.2.2 Bilingual lexicon acquisition . . . . . . . . . . . . . . . . . . . . . 114
5.2.2.1 POS tagging . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2.2.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2.2.3 Semantic classes . . . . . . . . . . . . . . . . . . . . . . 118
5.2.3 Cross-language knowledge transfer without parallel corpora . . . . 119
5.2.3.1 Word sense disambiguation (WSD) and translation lexicons119
5.2.3.2 Named Entity (NE) recognition . . . . . . . . . . . . . . 120
5.2.3.3 Verb classes . . . . . . . . . . . . . . . . . . . . . . . . 122
5.2.3.4 Inducing POS taggers with a bilingual lexicon . . . . . . 124
5.2.3.5 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
II A New Resource-light Approach to Morphological Tagging of
Inflected Languages 130
6 A new resource-light approach to morpho-syntactic tagging. The set up. . . 132
6.1 Tag system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.1.1 Slavic tagsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1.1.1 Czech tagset . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1.1.2 Russian tagset . . . . . . . . . . . . . . . . . . . . . . . 135
6.1.1.3 Polish tagset . . . . . . . . . . . . . . . . . . . . . . . . 137
6.1.2 Romance tagsets . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
xii
6.1.2.1 Spanish tagset . . . . . . . . . . . . . . . . . . . . . . . 138
6.1.2.2 Catalan tagset . . . . . . . . . . . . . . . . . . . . . . . 139
6.1.2.3 Portuguese tagset . . . . . . . . . . . . . . . . . . . . . . 139
6.2 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.2.1 Slavic corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.1.1 Czech corpora . . . . . . . . . . . . . . . . . . . . . . . 141
6.2.1.2 Polish corpora . . . . . . . . . . . . . . . . . . . . . . . 141
6.2.1.3 Russian corpora . . . . . . . . . . . . . . . . . . . . . . 141
6.2.2 Romance corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2.2.1 Spanish corpora . . . . . . . . . . . . . . . . . . . . . . 142
6.2.2.2 Portuguese corpora . . . . . . . . . . . . . . . . . . . . . 142
6.2.2.3 Catalan corpora . . . . . . . . . . . . . . . . . . . . . . . 142
6.3 Morphological analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.1 Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3.1.1 Russian paradigms . . . . . . . . . . . . . . . . . . . . . 146
6.3.1.2 Portuguese paradigms . . . . . . . . . . . . . . . . . . . 147
6.3.1.3 Catalan paradigms . . . . . . . . . . . . . . . . . . . . . 147
6.3.2 Closed-class words . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.3.3 Ending-based guesser . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.4 Lexicon-based analyzer . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.5 Abbreviation processor . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3.6 Lexicon acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3.7 The algorithm for lexicon acquisition . . . . . . . . . . . . . . . . 152
6.4 Quantifying language properties . . . . . . . . . . . . . . . . . . . . . . . 153
6.4.1 Tagset size, tagset coverage . . . . . . . . . . . . . . . . . . . . . . 153
6.4.2 How much training data is necessary? . . . . . . . . . . . . . . . . 156
6.4.3 Data sparsity, context, and tagset size . . . . . . . . . . . . . . . . 161
6.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7 Experiments in cross-language morphological annotation transfer . . . . . . 165
7.1 Why a Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.2 Performance expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.2.1 The lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.2.2 The upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.3 The basic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

7.4 Upper bounds of transitions and emissions . . . . . . . . . . . . . . . . . . 175
7.4.1 Upper bound — word order . . . . . . . . . . . . . . . . . . . . . 175
7.4.2 Upper bound — lexicon . . . . . . . . . . . . . . . . . . . . . . . 179
7.5 Further approximation of transitions . . . . . . . . . . . . . . . . . . . . . 181
7.5.1 “Russifications” . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.5.2 Slavic Interlingua . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.5.3 Combining two language models . . . . . . . . . . . . . . . . . . . 189
7.6 Further approximation of emissions: Cognates . . . . . . . . . . . . . . . . 189
xiii
7.6.1 Cognate detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.6.2 Cognate transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7.7 Dealing with data sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.7.1 Tag decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.7.2 Combining sub-taggers . . . . . . . . . . . . . . . . . . . . . . . . 199
7.8 Evaluation and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.8.1 Comparing the performance of the models on different languages . . 202
7.8.2 Alternative ways of evaluation . . . . . . . . . . . . . . . . . . . . 205
7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8 Summary and further work . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.1 Summary of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.2.1 Cognates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.2.2 Other morpho-syntactic features . . . . . . . . . . . . . . . . . . . 212
8.2.3 Other annotation schemes . . . . . . . . . . . . . . . . . . . . . . . 214
8.2.4 Alternative evaluation methods . . . . . . . . . . . . . . . . . . . . 214
8.2.5 Other types of knowledge induction . . . . . . . . . . . . . . . . . 214
8.2.6 Comparison with the standard approaches . . . . . . . . . . . . . . 215
8.2.7 Language closeness or size of the training data? . . . . . . . . . . . 215
8.2.8 Other inflected languages . . . . . . . . . . . . . . . . . . . . . . . 215
8.2.9 Cross-language morphology induction and active learning . . . . . 216

8.2.10 Language transfer in language acquisition . . . . . . . . . . . . . . 217
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
A Czech positional tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
B Detailed specifications of the Russian positional tagset . . . . . . . . . . . . . 229
C Detailed specifications of the positional tagset for Spanish, Catalan, and Por-
tuguese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
D Russian tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
E Polish tag correspondences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
F Spanish tag correspondences . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
G Portuguese tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
H Catalan tag correspondences . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
I Sub-taggers: complementarity rate . . . . . . . . . . . . . . . . . . . . . . . 249
xiv
J Tagging accuracy on all categories for Catalan, Portuguese, and Russian. . . 250
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Citation Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
xv
LIST OF TABLES
Table Page
2.1 Declension Ia – an example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 I-conjugation – grabit’ ‘rob’ . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Slavic: shallow contrastive analysis . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Basic words: comparison of Russian, Czech, and Polish . . . . . . . . . . . . . 23
2.5 Romance: Shallow contrastive analysis . . . . . . . . . . . . . . . . . . . . . . 39
2.6 Germanic influence on Spanish, Portuguese, and Catalan . . . . . . . . . . . . 40
2.7 Arabic influence on Spanish, Portuguese, and Catalan . . . . . . . . . . . . . . 41
2.8 Basic words: Comparison of Spanish, Portuguese, and Catalan . . . . . . . . . 41
3.1 Language population, language technology . . . . . . . . . . . . . . . . . . . 49
3.2 Positional Tag System for Czech . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1 Overview and comparison of the Slavic tagsets . . . . . . . . . . . . . . . . . 136

6.2 Size of Slavic tagsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3 Overview and comparison of the Romance tagsets . . . . . . . . . . . . . . . . 140
6.4 Size of Romance tagsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.5 Masculine nouns ending on a “hard” (non-palatalized) consonant. e.g. student
‘student’, stol ‘table’, slon ‘elephant’ etc. . . . . . . . . . . . . . . . . . . . . 146
6.6 The -ar verbs in Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.7 The -ar verbs in Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
xvi
6.8 The corpus and detailed tagset size, n-gram counts, entropy (H), mutual infor-
mation (I), and average tag/token ambiguity: Slavic, Romance, English. . . . . 154
6.9 The corpus and reduced tagset size, n-gram counts, entropy (H), mutual infor-
mation (I), and average tag/token ambiguity: Slavic, Romance, English. . . . . 155
7.1 Lower bound of performance on all categories. . . . . . . . . . . . . . . . . . 168
7.2 Lower bound of performance on nouns. . . . . . . . . . . . . . . . . . . . . . 169
7.3 Lower bound of performance on verbs. . . . . . . . . . . . . . . . . . . . . . . 169
7.4 Lower bound of performance on adjectives. . . . . . . . . . . . . . . . . . . . 170
7.5 Homonymy of the -a ending in Russian. . . . . . . . . . . . . . . . . . . . . . 170
7.6 Comparison of recall and average ambiguity in morphological analysis. . . . . 171
7.7 Evaluation of the basic model on all categories. . . . . . . . . . . . . . . . . . 172
7.8 Evaluation of the basic model on nouns. . . . . . . . . . . . . . . . . . . . . . 173
7.9 Evaluation of the basic model on verbs. . . . . . . . . . . . . . . . . . . . . . 173
7.10 Evaluation of the basic model on adjectives. . . . . . . . . . . . . . . . . . . . 174
7.11 Overview and comparison of the tagsets . . . . . . . . . . . . . . . . . . . . . 176
7.12 Upper bounds of transitions for all categories compared to the basic model. . . 177
7.13 Upper bounds of transitions for nouns. . . . . . . . . . . . . . . . . . . . . . . 177
7.14 Upper bounds of transitions for verbs. . . . . . . . . . . . . . . . . . . . . . . 178
7.15 Upper bounds of transitions for adjectives. . . . . . . . . . . . . . . . . . . . . 178
7.16 Upper bounds of emissions for all categories. . . . . . . . . . . . . . . . . . . 179
7.17 Upper bounds of emissions for nouns. . . . . . . . . . . . . . . . . . . . . . . 180
7.18 Upper bounds of emissions for verbs. . . . . . . . . . . . . . . . . . . . . . . . 180

7.19 Upper bounds of emissions for adjectives. . . . . . . . . . . . . . . . . . . . . 181
7.20 Czech transitions compared with ‘russified’ transitions. Evaluation on all cate-
gories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
7.21 Czech transitions compared with ‘russified’ transitions. Evaluation on nouns. . 184
xvii
7.22 Czech transitions compared with ‘russified’ transitions. Evaluation on adjectives.185
7.23 Czech transitions compared with ‘russified’ transitions. Evaluation on verbs. . . 185
7.24 Czech, Russified, Polish, Interlingua, and Hybrid transitions for all categories. . 187
7.25 Czech, Russified, Polish, Interlingua, and Hybrid transitions for nouns. . . . . . 187
7.26 Czech, Russified, Polish, Interlingua, and Hybrid transitions for adjectives. . . . 188
7.27 Czech, Russified, Polish, Interlingua, and Hybrid transitions for verbs. . . . . . 188
7.28 Evaluation of Russian tagging of all categories with various parameters. . . . . 193
7.29 Evaluation of Russian tagging of nouns with various parameters. . . . . . . . . 194
7.30 Evaluation of Russian tagging of adjectives with various parameters. . . . . . . 194
7.31 Evaluation of Russian tagging of verbs with various parameters. . . . . . . . . 195
7.32 Evaluation of Catalan and Portuguese tagging of all categories: even vs. cognate-
approximated emissions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.33 Evaluation of Catalan and Portuguese tagging of nouns: even vs. cognate-
approximated emissions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.34 Evaluation of Catalan and Portuguese tagging of adjectives: even vs. cognate-
approximated emissions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.35 Evaluation of Catalan and Portuguese tagging of verbs: even vs. cognate-
approximated emissions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.36 Russian tagger performance trained on individual slots vs. tagger performance
trained on the full tag. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.37 Russian tagger performance trained on the combination of two features vs.
tagger performance trained on the full tag. . . . . . . . . . . . . . . . . . . . . 198
7.38 Russian tagger performance trained on the combination of three or four features
vs. tagger performance trained on the full tag. . . . . . . . . . . . . . . . . . . 198
7.39 Russian tagging accuracy of the model with cognate-approximated emissions

vs. the voted classifier (best three subtaggers) for all categories. . . . . . . . . . 200
7.40 Russian tagging accuracy of the model with cognate-approximated emissions
vs. the voted classifier (best three subtaggers) for nouns. . . . . . . . . . . . . . 200
xviii
7.41 Russian tagging accuracy of the model with cognate-approximated emissions
vs. the voted classifier (best three subtaggers) for adjectives. . . . . . . . . . . 201
7.42 Russian tagging accuracy of the model with cognate-approximated emissions
vs. the voted classifier (best three subtaggers) for verbs. . . . . . . . . . . . . . 201
7.43 A contingency table for testing the models. . . . . . . . . . . . . . . . . . . . . 203
7.44 McNemar’s χ
2
test results for Catalan, Portuguese, and Russian. . . . . . . . . 204
7.45 Number of feature changes needed to recreate gold standard . . . . . . . . . . 205
D.1 Sample tags for Russian nouns . . . . . . . . . . . . . . . . . . . . . . . . . . 244
E.1 A fragment of the Polish tagset . . . . . . . . . . . . . . . . . . . . . . . . . . 245
F.1 A fragment of the Spanish tagset . . . . . . . . . . . . . . . . . . . . . . . . . 246
G.1 A sample of Portuguese tags . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
H.1 A fragment of the Catalan tagset . . . . . . . . . . . . . . . . . . . . . . . . . 248
I.1 Complementarity rate of subtaggers (Brill and Wu 1998). . . . . . . . . . . . . 249
xix
LIST OF FIGURES
Figure Page
2.1 Slavic languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Romance languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 The number of distinct tags plotted against the number of tokens for the de-
tailed tagset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2 The number of distinct tags plotted against the number of tokens for the re-
duced tagset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.3 The percentage of the tagset covered by the number of tokens for the detailed
tagset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

6.4 The percentage of the tagset covered by the number of tokens for the reduced
tagset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.5 The percentage of the corpus covered by the 5 most frequent tags for the de-
tailed tagset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.6 The percentage of the corpus covered by the 5 most frequent tags for the re-
duced tagset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.7 Accession rate for the detailed tagset. . . . . . . . . . . . . . . . . . . . . . . . 162
6.8 Accession rate for the reduced tagset. . . . . . . . . . . . . . . . . . . . . . . . 163
7.1 An algorithm for combining subtaggers. . . . . . . . . . . . . . . . . . . . . . 199
7.2 Complementarity rate analysis (Brill and Wu 1998) . . . . . . . . . . . . . . . 202
7.3 McNemar’s χ
2
test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
xx
J.1 Tagging accuracy on all categories. . . . . . . . . . . . . . . . . . . . . . . . . 250
J.2 Tagging accuracy on nouns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
J.3 Tagging accuracy on adjectives. . . . . . . . . . . . . . . . . . . . . . . . . . 252
J.4 Tagging accuracy on verbs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
J.5 Tagging accuracy on subPOS. . . . . . . . . . . . . . . . . . . . . . . . . . . 254
J.6 Tagging accuracy on gender. . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
J.7 Tagging accuracy on number. . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
J.8 Tagging accuracy on case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
xxi
CHAPTER 1
INTRODUCTION
The year is 1944, and World War II is near its end. A simple stroke of fate brings together
three people — a Finnish soldier who is being punished for displaying reluctance in battle,
a disgraced Soviet captain injured in a bomb attack en route to trial, and a Lapp widow
working a reindeer farm. The three discover that they have no language in common, and
they struggle to understand each other while hostilities are running high. This is the story

depicted in a Russian film, The Cuckoo (Kukushka, 2002). At the end of the movie, as
in any well-intentioned, man-made story, life wins and the barriers fall, giving mankind a
sense of hope and reconciliation.
As shown in the movie, language barriers contribute a great deal to misunderstand-
ing and miscommunication. Today’s technology is doing a tremendous job of overcoming
language barriers. For instance, by using some online machine translation systems, Inter-
net users can gain access to information from the original source language, and therefore,
ideally, form unbiased opinions. The process of learning foreign languages is also facili-
tated by technology. It is no longer a luxury to have an intelligent computer language tutor
that will detect and correct our spelling, grammar, and stylistic errors. These are just a few
examples of what language technology is capable of doing.
It is unfortunate, however, that not all languages receive equal attention. Many
languages lack even the most rudimentary technological resources.
1
1.1 Language technology and resource-poor languages
This thesis concerns the development of a method for morphological tagging of resource-
poor languages. “Morphological tagging” is the process of assigning POS, case, number,
gender, and other morphological information to each word in a corpus. “Resource-poor”
languages are languages for which few digital resources exist; and thus, languages whose
computerization poses unique challenges. “Resource-poor” languages are also those lan-
guages with limited financial, political, and legal support — languages that lack the global
importance of the world’s major languages.
In spite of these challenges, resource-poor languages and their speakers are not
being ignored. Individuals, governments, and companies alike are busy developing tech-
nologies and tools to support such languages (e.g. ILASH 2002). They are driven by a
variety of motivations. First, there is a sincere aspiration among academics and community
activists to preserve or revitalize endangered or threatened languages — creating electronic
resources for such languages is not the solution, of course, but an important contribution
to the enterprise. Second, some governments strive to promote minority languages. Third,
there is a need by other governments to detect hostile chatter in diverse tongues. Finally,

some companies are trying to enhance their stature in emerging markets such as China and
South America. Even though the system developed in the thesis has been tested on lan-
guages which are relatively resource-poor and are not endangered (Catalan, Portuguese,
and Russian), the same method can be applied to any pair of related inflected languages, if
one of them has an annotated corpus.
Success in natural language processing (NLP) depends crucially on good resources.
Standard tagging techniques are accurate, but they rely heavily on high-quality annotated
training data. The training data also has to be statistically representative of the data on
which the system will be tested. In order to adapt a tagger to new kinds of data, it has to
2

×