Vietnamese Language Processing: Issues and Challenges

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.48 MB, 50 trang )

Vietnamese Language
Processing:
Issues and Challenges
Ho Tu Bao
Vietnamese
Academy of Science
and Technology

Japan Advanced
Institute of Science
and Technology

(Keynote talk at international conference IEEE RIVF 2009)

IEEE RIVF’09, 16 July 2009

nstitute of Information Technology
ietnamese Academy of Science & Technology

Japan Advanced Institute of
Science and Technology

IEEE RIVF’09, 16 July 2009

Outline


Problems and progress in natural
language processing



Issues and challenges in Vietnamese
language processing



Our VLSP project (Vietnamese
Language and Speech Processing)

IEEE RIVF’09, 16 July 2009

Natural language processing?


Psychological view: Understand
human language processing




Alan Turing: Propose
to consider the question:
“Can machine
think?”

Engineering view: Build systems
to process language

IEEE RIVF’09, 16 July 2009

More languages than you might have
thought
6912 distinct languages (230 spoken in Europe,
2197 in Asia)



We meet here today to talk about processing of Vietnamese language
and speech.



Aujourd'hui nous nous réunissons ici pour discuter le traitement de
langue et de parole vietnamienne.



Cегодня мы встрачаемся здесь, чтобы говорить о обработке вьетнамского языкa и
речи.



今今今今今今今今今今今今今今今今今今今今今今今今今今今今今今今



今今今今今今今今今今今今今今今今今今今今今今今今今今今今今今今 .



‫أننا نجتمع هنا اليوم لنتحدث عن اللغة الفيتنامية و لغة‬

‫الخطاب‬



Hôm nay chúng ta gặp nhau ở đây để nói về xử lý ngôn ngữ và
tiếng nói tiếng Việt.
IEEE RIVF’09, 16 July 2009

54 ethnic groups in Vietnam
Language
groups


Mon-Khmer



Tay-Thai



TibetoBurman



MalayoPolysian



Kadai



Mong-Dao



Han

IEEE RIVF’09, 16 July 2009

English websites and Vietnamese?

IEEE RIVF’09, 16 July 2009

Translation and machine translation


Translate the following sentence into English
“Ông già đi nhanh quá”?



Many possible translations
1. [Ông già] [đi] [nhanh quá]

 The old man walks too fast

 My father walks too fast
2. [Ông già] [đi] [nhanh quá]

 The old man died too fast

 My father died too fast
3. [Ông] [già đi] [nhanh quá]

 You get old too fast

 Grandfather gets old too fast

Ambiguity of language
IEEE RIVF’09, 16 July 2009

Two approaches to machine
translation
Linguistic rule-based
machine translation




words are translated by
using linguistic rules
about the two
languages, the
correspondence transfer
between them
(morphology, syntax,
etc)

Statistical machine
translation


generate translations
using statistical
learning methods
based on bilingual
text corpora
(statistically similar)

Requires large and
Requires understanding
qualified bilingual text
natural language
corpora.
DOMINATING!


IEEE RIVF’09, 16 July 2009

From text to the meaning
Natural Language Processing (NLP)

text

Lexical / Morphological Analysis
Tagging

Shallow parsing
The woman will give Mary a book
POS tagging

Chunking

Syntactic Analysis
Grammatical Relation Finding

The/Det woman/NN will/MD give/VB
Mary/NNP a/Det book/NN
chunking

Named Entity Recognition
Word Sense Disambiguation

Semantic Analysis
Reference Resolution

[The/Det woman/NN]NP [will/MD give/VB]VP

[Mary/NNP]NP [a/Det book/NN]NP
relation finding

subject

[The woman] [will give] [Mary] [a book]

Discourse Analysis

meaning

i-object

object

IEEE RIVF’09, 16 July 2009

Archeology of natural language
processing
 1990s–2000s:


algorithms, evaluation, corpora

 1980s:


Kernel (vector) spaces

clustering, information retrieval (IR)

 1960s:


Standard resources and tasks

Representation Transformation

Trainable FSMs
 Natural
language
processing

Finite state machines (FSM) and
Augmented transition networks (ATNs)

 1960s:

Representation—beyond the
word level


Trainable
parsers

Penn Treebank, WordNet, MUC

 1970s:


Statistical learning

lexical features, tree structures, networks

 Information
retrieval and
Information
extraction

(Hovy, COLING 2004)
IEEE RIVF’09, 16 July 2009

ML and statistical methods in NLP

some ML/Stat

no ML/Stat

(Pages 11-12 from Marie Claire, ECML/PKDD 2005)
IEEE RIVF’09, 16 July 2009

Recent learning methods in NLP

IEEE RIVF’09, 16 July 2009

NLP R&D in other countries



Large investment from the government and industry





National Institute of Standards and Technology (NIST), ATR, NICT
USA, CHINA, Singapore, etc.

NLP & CL organizations






ACL (Assoc. Comp. Linguistics)
NACL( North Amer. Assoc. on CL)
EACL (Euro Association on CL)
PACLIC (Pacific Assoc. on CL)
ICCL (Inter. committee CL)



Many NLP people



Rich resources and tools

Linguistic Data Consortium

IEEE RIVF’09, 16 July 2009

Vietnamese language


Vietnamese language was
established a long time
ago



Chinese characters was
used for a long time



Unique writing system of
Vietnam called Chu Nom
( 今今 ) in the 10th century



Romanced script to
represent the Quốc Ngữ
since the beginning of the

20th century

Nam quốc sơn hà Nam đế
cư
今今今今今今今
Over Mountains and Rivers of the
South, Reigns the Emperor of the
South
IEEE RIVF’09, 16 July 2009

Vietnamese language


Vietnamese is an analytic language (words are
composed of a single morpheme).




Vietnamese does not use morphological marking of
case, gender, number, and tense.




ngôn ngữ (analytic), lang-gua-ge (synthetic), 言言 (synthetic)

Trưa nay tôi ăn ba thằng tôm

Syntax conforms to Subject Verb Object word order


Cái

thằng

chồng

em nó chẳng ra

FOCUS CLASSIFIER husband I

he not

gì.
turn.out what

“That husband of mine, he is good for nothing.”

IEEE RIVF’09, 16 July 2009

Vietnamese Language and Speech
Processing


Most work aims at machine translation or other tasks at
top layers but very few basic work at lower layers



Work done in isolation, no inheritance  people have
to do their work from the scratch without sharing and
collaborating  no standards.



Almost no resources and tools for VLSP

今今今今今今今今今今今今今

Many tools such as ChaSen,
Yamcha, …

No tool to do such a simple task

IEEE RIVF’09, 16 July 2009

VLSP national project
(KC01.01.05/06-10)
5.2007-8.2009

National project with eleven
active research VLSP
groups from Ho Chi Minh
City to Hanoi, with two
objectives:

Building VLSP
infrastructure, especially
indispensable resources
and tools for the
VLSPdevelopment.
Building and developing
several typical VLSP
products for public endusers.

Pragmatics:
Speech, text
and Web data
mining

Natural
language
processing
methods

Tools,
corpora,
resources
IEEE RIVF’09, 16 July 2009

Project target products

SP8.1
Speech analysis tools

SP6.1
Corpora for
speech recognition
SP1
Apllicationoriented
systems based on
Vietnamese speech
recognition & synthesis

SP2
Speech recognition
system with
large vocabulary

SP6.2
Corpora for
speech synthesis

SP6.3
Corpora for
specific words

SP7.3
Vietnamese treebank

SP7.4
E-V corpora of aligned
sentences

SP7.1

English-Vietnamese
dictionary

SP7.2
Viet dictionary

SP8.2
Vietnamese word
Segmentation

SP8.3
Vietnamese POS tagger

SP8.4
Vietnamese chunker

SP8.5
Vietnamese syntax
analyser

SP5
SP5
Vietnamese
spelling
Vietnamese
checkerspelling
checker

SP3
English-Vietnamese

translation system

SP4
IREST: Internet use
support system

To be
standard for
long term
development
IEEE RIVF’09, 16 July 2009

VLSP website: open soon to the
public

IEEE RIVF’09, 16 July 2009

SP7.2: Viet Machine Readable
Dictionary


Study other MRDs
EDR Electronic Dictionary
 FrameNet (UC Berkeley)
 TCL's Computational
Lexicon

Institute of Electronic Dictionary, 1980s-1990s





Build a model of VCL
(Vietnamese
Computational Lexicon)


The macroscopic structure



The microscopic structure



The content and VCL
structure



Tool and VCL construction

Japanese
EDR

IEEE RIVF’09, 16 July 2009

SP7.2: Viet Machine Readable
Dictionary


Microscopic structure







Morphological information
Syntactic information, e.g.,
two kinds of verb
Semantic information: logic
and semantic constraints,
definition, context

Sub-V-Obj

Lợn ăn rau
Xe ăn xăng

Sub-V

Chim bay
Chó chạy

bé ngủ
bé đang ngủ

VCL content and structure
Tool for the construction



35,000 common used words
in modern Vietnamese



Develop a tool for building
VCL with XML
representation.
IEEE RIVF’09, 16 July 2009

SP7.3: Viet Treebank


A Treebank or parsed corpus is a text
corpus in which each sentence has been
parsed, i.e. annotated with syntactic structure.












NP

VP

English: Penn Treebank (4.5M words) and many
P
V
NP
others;
Chinese: Penn Chinese Treebank (507K words),
Ông già đi
T
Sinica Treebank (61,087 trees, 361K words);
Japanese: ATR Dependency corpus, Kyoto Text
nhanh
quá
Corpus, Verbmobil treebanks;
Korean: Korean Treebank
Viet Treebank
(5078 trees, 54K words)

Viet Treebank (7.2007-5.2009):


S

10,000 trees
1,000,000 morphemes

Viet word
segmenter

Viet
Viet POS
Viet
tagger chunker syntacti
c parser

Viet machine translation, info
extraction, etc.
IEEE RIVF’09, 16 July 2009

SP7.3: Viet Treebank


Study various existing treebanks, modern theories for
syntax and Vietnamese language



Build guidelines for word segmentation, POS, and syntax


“Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả”
(“the house is in jumble” and “at home the door
is not closed”)



“Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”
(She keeps her beauty” and “this painting has better color”)



Build the tools



Labeling

Agreement between labelers (95%)

IEEE RIVF’09, 16 July 2009

SP7.4: English-Vietnamese parallel
corpus








Pairs of corresponding
sentences in English and
Vietnamese (size & quality)
Easy for many languages
(LDC:
English-French corpus of
2.8M sentences, source
from Canadian Parliament)
No publicly available
parallel corpus for
Vietnamese
Building corpora needs
time, money and human
resources (boring job)

Parallel Corpus
(L1-L2)

Sentence
s

L1
Words

L2
Words

German-English

1,313,096

34,700,3
62

36,663,08
3

Greek-English

662,090

18,834,7
58

18,827,24
1

Spanish-English

1,304,116

37,870,7
51

36,429,27
4

Finnish-English

1,257,720

24,895,7
90

34,802,61
7

French-English

1,334,080

41,573,1
17

37,436,22
2

Italian-English

1,251,315

36,411,1
66

36,510,03
3

Dutch-English

1,326,412

36,784,1
68

36,690,39
2

PortugueseEnglish

1,287,757

37,342,4
26

36,355,90
7

Swedish-English

1,164,536

28,882,1
42

32,053,62
8

()

IEEE RIVF’09, 16 July 2009

Vietnamese Language Processing: Issues and Challenges

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về