Vietnamese Language
Processing:
Issues and Challenges
Ho Tu Bao
Vietnamese
Academy of Science
and Technology
Japan Advanced
Institute of Science
and Technology
(Keynote talk at international conference IEEE RIVF 2009)
IEEE RIVF’09, 16 July 2009
nstitute of Information Technology
ietnamese Academy of Science & Technology
Japan Advanced Institute of
Science and Technology
IEEE RIVF’09, 16 July 2009
Outline
Problems and progress in natural
language processing
Issues and challenges in Vietnamese
language processing
Our VLSP project (Vietnamese
Language and Speech Processing)
IEEE RIVF’09, 16 July 2009
Natural language processing?
Psychological view: Understand
human language processing
Alan Turing: Propose
to consider the question:
“Can machine
think?”
Engineering view: Build systems
to process language
IEEE RIVF’09, 16 July 2009
More languages than you might have
thought
6912 distinct languages (230 spoken in Europe,
2197 in Asia)
We meet here today to talk about processing of Vietnamese language
and speech.
Aujourd'hui nous nous réunissons ici pour discuter le traitement de
langue et de parole vietnamienne.
Cегодня мы встрачаемся здесь, чтобы говорить о обработке вьетнамского языкa и
речи.
今今今今今今今今今今今今今今今今今今今今今今今今今今今今今今今
今今 今今今 今今今 今今今 今今今今今 今今今今今 今今今 今今今今今今今 .
أننا نجتمع هنا اليوم لنتحدث عن اللغة الفيتنامية و لغة
الخطاب
Hôm nay chúng ta gặp nhau ở đây để nói về xử lý ngôn ngữ và
tiếng nói tiếng Việt.
IEEE RIVF’09, 16 July 2009
54 ethnic groups in Vietnam
Language
groups
Mon-Khmer
Tay-Thai
TibetoBurman
MalayoPolysian
Kadai
Mong-Dao
Han
IEEE RIVF’09, 16 July 2009
English websites and Vietnamese?
IEEE RIVF’09, 16 July 2009
Translation and machine translation
Translate the following sentence into English
“Ông già đi nhanh quá”?
Many possible translations
1. [Ông già] [đi] [nhanh quá]
The old man walks too fast
My father walks too fast
2. [Ông già] [đi] [nhanh quá]
The old man died too fast
My father died too fast
3. [Ông] [già đi] [nhanh quá]
You get old too fast
Grandfather gets old too fast
Ambiguity of language
IEEE RIVF’09, 16 July 2009
Two approaches to machine
translation
Linguistic rule-based
machine translation
words are translated by
using linguistic rules
about the two
languages, the
correspondence transfer
between them
(morphology, syntax,
etc)
Statistical machine
translation
generate translations
using statistical
learning methods
based on bilingual
text corpora
(statistically similar)
Requires large and
Requires understanding
qualified bilingual text
natural language
corpora.
DOMINATING!
IEEE RIVF’09, 16 July 2009
From text to the meaning
Natural Language Processing (NLP)
text
Lexical / Morphological Analysis
Tagging
Shallow parsing
The woman will give Mary a book
POS tagging
Chunking
Syntactic Analysis
Grammatical Relation Finding
The/Det woman/NN will/MD give/VB
Mary/NNP a/Det book/NN
chunking
Named Entity Recognition
Word Sense Disambiguation
Semantic Analysis
Reference Resolution
[The/Det woman/NN]NP [will/MD give/VB]VP
[Mary/NNP]NP [a/Det book/NN]NP
relation finding
subject
[The woman] [will give] [Mary] [a book]
Discourse Analysis
meaning
i-object
object
IEEE RIVF’09, 16 July 2009
Archeology of natural language
processing
1990s–2000s:
algorithms, evaluation, corpora
1980s:
Kernel (vector) spaces
clustering, information retrieval (IR)
1960s:
Standard resources and tasks
Representation Transformation
Trainable FSMs
Natural
language
processing
Finite state machines (FSM) and
Augmented transition networks (ATNs)
1960s:
Representation—beyond the
word level
Trainable
parsers
Penn Treebank, WordNet, MUC
1970s:
Statistical learning
lexical features, tree structures, networks
Information
retrieval and
Information
extraction
(Hovy, COLING 2004)
IEEE RIVF’09, 16 July 2009
ML and statistical methods in NLP
some ML/Stat
no ML/Stat
(Pages 11-12 from Marie Claire, ECML/PKDD 2005)
IEEE RIVF’09, 16 July 2009
Recent learning methods in NLP
IEEE RIVF’09, 16 July 2009
NLP R&D in other countries
Large investment from the government and industry
National Institute of Standards and Technology (NIST), ATR, NICT
USA, CHINA, Singapore, etc.
NLP & CL organizations
ACL (Assoc. Comp. Linguistics)
NACL( North Amer. Assoc. on CL)
EACL (Euro Association on CL)
PACLIC (Pacific Assoc. on CL)
ICCL (Inter. committee CL)
Many NLP people
Rich resources and tools
Linguistic Data Consortium
IEEE RIVF’09, 16 July 2009
Vietnamese language
Vietnamese language was
established a long time
ago
Chinese characters was
used for a long time
Unique writing system of
Vietnam called Chu Nom
( 今今 ) in the 10th century
Romanced script to
represent the Quốc Ngữ
since the beginning of the
20th century
Nam quốc sơn hà Nam đế
cư
今今今今今今今
Over Mountains and Rivers of the
South, Reigns the Emperor of the
South
IEEE RIVF’09, 16 July 2009
Vietnamese language
Vietnamese is an analytic language (words are
composed of a single morpheme).
Vietnamese does not use morphological marking of
case, gender, number, and tense.
ngôn ngữ (analytic), lang-gua-ge (synthetic), 言言 (synthetic)
Trưa nay tôi ăn ba thằng tôm
Syntax conforms to Subject Verb Object word order
Cái
thằng
chồng
em nó chẳng ra
FOCUS CLASSIFIER husband I
he not
gì.
turn.out what
“That husband of mine, he is good for nothing.”
IEEE RIVF’09, 16 July 2009
Vietnamese Language and Speech
Processing
Most work aims at machine translation or other tasks at
top layers but very few basic work at lower layers
Work done in isolation, no inheritance people have
to do their work from the scratch without sharing and
collaborating no standards.
Almost no resources and tools for VLSP
今今今今今今今今今今今今今
Many tools such as ChaSen,
Yamcha, …
No tool to do such a simple task
IEEE RIVF’09, 16 July 2009
VLSP national project
(KC01.01.05/06-10)
5.2007-8.2009
National project with eleven
active research VLSP
groups from Ho Chi Minh
City to Hanoi, with two
objectives:
Building VLSP
infrastructure, especially
indispensable resources
and tools for the
VLSPdevelopment.
Building and developing
several typical VLSP
products for public endusers.
Pragmatics:
Speech, text
and Web data
mining
Natural
language
processing
methods
Tools,
corpora,
resources
IEEE RIVF’09, 16 July 2009
Project target products
SP8.1
Speech analysis tools
SP6.1
Corpora for
speech recognition
SP1
Apllicationoriented
systems based on
Vietnamese speech
recognition & synthesis
SP2
Speech recognition
system with
large vocabulary
SP6.2
Corpora for
speech synthesis
SP6.3
Corpora for
specific words
SP7.3
Vietnamese treebank
SP7.4
E-V corpora of aligned
sentences
SP7.1
English-Vietnamese
dictionary
SP7.2
Viet dictionary
SP8.2
Vietnamese word
Segmentation
SP8.3
Vietnamese POS tagger
SP8.4
Vietnamese chunker
SP8.5
Vietnamese syntax
analyser
SP5
SP5
Vietnamese
spelling
Vietnamese
checkerspelling
checker
SP3
English-Vietnamese
translation system
SP4
IREST: Internet use
support system
To be
standard for
long term
development
IEEE RIVF’09, 16 July 2009
VLSP website: open soon to the
public
IEEE RIVF’09, 16 July 2009
SP7.2: Viet Machine Readable
Dictionary
Study other MRDs
EDR Electronic Dictionary
FrameNet (UC Berkeley)
TCL's Computational
Lexicon
Institute of Electronic Dictionary, 1980s-1990s
Build a model of VCL
(Vietnamese
Computational Lexicon)
The macroscopic structure
The microscopic structure
The content and VCL
structure
Tool and VCL construction
Japanese
EDR
IEEE RIVF’09, 16 July 2009
SP7.2: Viet Machine Readable
Dictionary
Microscopic structure
Morphological information
Syntactic information, e.g.,
two kinds of verb
Semantic information: logic
and semantic constraints,
definition, context
Sub-V-Obj
Lợn ăn rau
Xe ăn xăng
Sub-V
Chim bay
Chó chạy
bé ngủ
bé đang ngủ
VCL content and structure
Tool for the construction
35,000 common used words
in modern Vietnamese
Develop a tool for building
VCL with XML
representation.
IEEE RIVF’09, 16 July 2009
SP7.3: Viet Treebank
A Treebank or parsed corpus is a text
corpus in which each sentence has been
parsed, i.e. annotated with syntactic structure.
NP
VP
English: Penn Treebank (4.5M words) and many
P
V
NP
others;
Chinese: Penn Chinese Treebank (507K words),
Ông già đi
T
Sinica Treebank (61,087 trees, 361K words);
Japanese: ATR Dependency corpus, Kyoto Text
nhanh
quá
Corpus, Verbmobil treebanks;
Korean: Korean Treebank
Viet Treebank
(5078 trees, 54K words)
Viet Treebank (7.2007-5.2009):
S
10,000 trees
1,000,000 morphemes
Viet word
segmenter
Viet
Viet POS
Viet
tagger chunker syntacti
c parser
Viet machine translation, info
extraction, etc.
IEEE RIVF’09, 16 July 2009
SP7.3: Viet Treebank
Study various existing treebanks, modern theories for
syntax and Vietnamese language
Build guidelines for word segmentation, POS, and syntax
“Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả”
(“the house is in jumble” and “at home the door
is not closed”)
“Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”
(She keeps her beauty” and “this painting has better color”)
Build the tools
Labeling
Agreement between labelers (95%)
IEEE RIVF’09, 16 July 2009
SP7.4: English-Vietnamese parallel
corpus
Pairs of corresponding
sentences in English and
Vietnamese (size & quality)
Easy for many languages
(LDC:
English-French corpus of
2.8M sentences, source
from Canadian Parliament)
No publicly available
parallel corpus for
Vietnamese
Building corpora needs
time, money and human
resources (boring job)
Parallel Corpus
(L1-L2)
Sentence
s
L1
Words
L2
Words
German-English
1,313,096
34,700,3
62
36,663,08
3
Greek-English
662,090
18,834,7
58
18,827,24
1
Spanish-English
1,304,116
37,870,7
51
36,429,27
4
Finnish-English
1,257,720
24,895,7
90
34,802,61
7
French-English
1,334,080
41,573,1
17
37,436,22
2
Italian-English
1,251,315
36,411,1
66
36,510,03
3
Dutch-English
1,326,412
36,784,1
68
36,690,39
2
PortugueseEnglish
1,287,757
37,342,4
26
36,355,90
7
Swedish-English
1,164,536
28,882,1
42
32,053,62
8
()
IEEE RIVF’09, 16 July 2009