Tải bản đầy đủ (.ppt) (50 trang)

Vietnamese Language Processing: Issues and Challenges

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.48 MB, 50 trang )

Vietnamese Language
Processing:
Issues and Challenges
Ho Tu Bao
Vietnamese
Academy of Science
and Technology

Japan Advanced
Institute of Science
and Technology

(Keynote talk at international conference IEEE RIVF 2009)

IEEE RIVF’09, 16 July 2009


nstitute of Information Technology
ietnamese Academy of Science & Technology

Japan Advanced Institute of
Science and Technology

IEEE RIVF’09, 16 July 2009


Outline


Problems and progress in natural
language processing





Issues and challenges in Vietnamese
language processing



Our VLSP project (Vietnamese
Language and Speech Processing)

IEEE RIVF’09, 16 July 2009


Natural language processing?


Psychological view: Understand
human language processing




Alan Turing: Propose
to consider the question:
“Can machine
think?”

Engineering view: Build systems
to process language


IEEE RIVF’09, 16 July 2009


More languages than you might have
thought
6912 distinct languages (230 spoken in Europe,
2197 in Asia)



We meet here today to talk about processing of Vietnamese language
and speech.



Aujourd'hui nous nous réunissons ici pour discuter le traitement de
langue et de parole vietnamienne.



Cегодня мы встрачаемся здесь, чтобы говорить о обработке вьетнамского языкa и
речи.



今今今今今今今今今今今今今今今今今今今今今今今今今今今今今今今




今今 今今今 今今今 今今今 今今今今今 今今今今今 今今今 今今今今今今今 .



‫أننا نجتمع هنا اليوم لنتحدث عن اللغة الفيتنامية و لغة‬

‫الخطاب‬



Hôm nay chúng ta gặp nhau ở đây để nói về xử lý ngôn ngữ và
tiếng nói tiếng Việt.
IEEE RIVF’09, 16 July 2009


54 ethnic groups in Vietnam
Language
groups


Mon-Khmer



Tay-Thai



TibetoBurman




MalayoPolysian



Kadai



Mong-Dao



Han

IEEE RIVF’09, 16 July 2009


English websites and Vietnamese?

IEEE RIVF’09, 16 July 2009


Translation and machine translation


Translate the following sentence into English
“Ông già đi nhanh quá”?




Many possible translations
1. [Ông già] [đi] [nhanh quá]

 The old man walks too fast

 My father walks too fast
2. [Ông già] [đi] [nhanh quá]

 The old man died too fast

 My father died too fast
3. [Ông] [già đi] [nhanh quá]

 You get old too fast

 Grandfather gets old too fast

Ambiguity of language
IEEE RIVF’09, 16 July 2009


Two approaches to machine
translation
Linguistic rule-based
machine translation





words are translated by
using linguistic rules
about the two
languages, the
correspondence transfer
between them
(morphology, syntax,
etc)

Statistical machine
translation


generate translations
using statistical
learning methods
based on bilingual
text corpora
(statistically similar)

Requires large and
Requires understanding
qualified bilingual text
natural language
corpora.
DOMINATING!


IEEE RIVF’09, 16 July 2009



From text to the meaning
Natural Language Processing (NLP)

text

Lexical / Morphological Analysis
Tagging

Shallow parsing
The woman will give Mary a book
POS tagging

Chunking

Syntactic Analysis
Grammatical Relation Finding

The/Det woman/NN will/MD give/VB
Mary/NNP a/Det book/NN
chunking

Named Entity Recognition
Word Sense Disambiguation

Semantic Analysis
Reference Resolution

[The/Det woman/NN]NP [will/MD give/VB]VP

[Mary/NNP]NP [a/Det book/NN]NP
relation finding

subject

[The woman] [will give] [Mary] [a book]

Discourse Analysis

meaning

i-object

object

IEEE RIVF’09, 16 July 2009


Archeology of natural language
processing
 1990s–2000s:


algorithms, evaluation, corpora

 1980s:


Kernel (vector) spaces


clustering, information retrieval (IR)

 1960s:


Standard resources and tasks

Representation Transformation

Trainable FSMs
 Natural
language
processing

Finite state machines (FSM) and
Augmented transition networks (ATNs)

 1960s:

Representation—beyond the
word level


Trainable
parsers

Penn Treebank, WordNet, MUC

 1970s:



Statistical learning

lexical features, tree structures, networks

 Information
retrieval and
Information
extraction

(Hovy, COLING 2004)
IEEE RIVF’09, 16 July 2009


ML and statistical methods in NLP

some ML/Stat

no ML/Stat

(Pages 11-12 from Marie Claire, ECML/PKDD 2005)
IEEE RIVF’09, 16 July 2009


Recent learning methods in NLP

IEEE RIVF’09, 16 July 2009


NLP R&D in other countries



Large investment from the government and industry





National Institute of Standards and Technology (NIST), ATR, NICT
USA, CHINA, Singapore, etc.

NLP & CL organizations






ACL (Assoc. Comp. Linguistics)
NACL( North Amer. Assoc. on CL)
EACL (Euro Association on CL)
PACLIC (Pacific Assoc. on CL)
ICCL (Inter. committee CL)



Many NLP people




Rich resources and tools

Linguistic Data Consortium

IEEE RIVF’09, 16 July 2009


Vietnamese language


Vietnamese language was
established a long time
ago



Chinese characters was
used for a long time



Unique writing system of
Vietnam called Chu Nom
( 今今 ) in the 10th century



Romanced script to
represent the Quốc Ngữ
since the beginning of the

20th century

Nam quốc sơn hà Nam đế

今今今今今今今
Over Mountains and Rivers of the
South, Reigns the Emperor of the
South
IEEE RIVF’09, 16 July 2009


Vietnamese language


Vietnamese is an analytic language (words are
composed of a single morpheme).




Vietnamese does not use morphological marking of
case, gender, number, and tense.




ngôn ngữ (analytic), lang-gua-ge (synthetic), 言言 (synthetic)

Trưa nay tôi ăn ba thằng tôm


Syntax conforms to Subject Verb Object word order


Cái

thằng

chồng

em nó chẳng ra

FOCUS CLASSIFIER husband I

he not

gì.
turn.out what

“That husband of mine, he is good for nothing.”

IEEE RIVF’09, 16 July 2009


Vietnamese Language and Speech
Processing


Most work aims at machine translation or other tasks at
top layers but very few basic work at lower layers




Work done in isolation, no inheritance  people have
to do their work from the scratch without sharing and
collaborating  no standards.



Almost no resources and tools for VLSP

今今今今今今今今今今今今今

Many tools such as ChaSen,
Yamcha, …

No tool to do such a simple task

IEEE RIVF’09, 16 July 2009


VLSP national project
(KC01.01.05/06-10)
5.2007-8.2009

National project with eleven
active research VLSP
groups from Ho Chi Minh
City to Hanoi, with two
objectives:


Building VLSP
infrastructure, especially
indispensable resources
and tools for the
VLSPdevelopment.
Building and developing
several typical VLSP
products for public endusers.

Pragmatics:
Speech, text
and Web data
mining

Natural
language
processing
methods

Tools,
corpora,
resources
IEEE RIVF’09, 16 July 2009


Project target products

SP8.1
Speech analysis tools


SP6.1
Corpora for
speech recognition
SP1
Apllicationoriented
systems based on
Vietnamese speech
recognition & synthesis

SP2
Speech recognition
system with
large vocabulary

SP6.2
Corpora for
speech synthesis

SP6.3
Corpora for
specific words

SP7.3
Vietnamese treebank

SP7.4
E-V corpora of aligned
sentences

SP7.1

English-Vietnamese
dictionary

SP7.2
Viet dictionary

SP8.2
Vietnamese word
Segmentation

SP8.3
Vietnamese POS tagger

SP8.4
Vietnamese chunker

SP8.5
Vietnamese syntax
analyser

SP5
SP5
Vietnamese
spelling
Vietnamese
checkerspelling
checker

SP3
English-Vietnamese

translation system

SP4
IREST: Internet use
support system

To be
standard for
long term
development
IEEE RIVF’09, 16 July 2009


VLSP website: open soon to the
public

IEEE RIVF’09, 16 July 2009


SP7.2: Viet Machine Readable
Dictionary


Study other MRDs
EDR Electronic Dictionary
 FrameNet (UC Berkeley)
 TCL's Computational
Lexicon

Institute of Electronic Dictionary, 1980s-1990s






Build a model of VCL
(Vietnamese
Computational Lexicon)


The macroscopic structure



The microscopic structure



The content and VCL
structure



Tool and VCL construction

Japanese
EDR

IEEE RIVF’09, 16 July 2009



SP7.2: Viet Machine Readable
Dictionary


Microscopic structure







Morphological information
Syntactic information, e.g.,
two kinds of verb
Semantic information: logic
and semantic constraints,
definition, context

Sub-V-Obj

Lợn ăn rau
Xe ăn xăng

Sub-V

Chim bay
Chó chạy


bé ngủ
bé đang ngủ

VCL content and structure
Tool for the construction



35,000 common used words
in modern Vietnamese



Develop a tool for building
VCL with XML
representation.
IEEE RIVF’09, 16 July 2009


SP7.3: Viet Treebank


A Treebank or parsed corpus is a text
corpus in which each sentence has been
parsed, i.e. annotated with syntactic structure.













NP

VP

English: Penn Treebank (4.5M words) and many
P
V
NP
others;
Chinese: Penn Chinese Treebank (507K words),
Ông già đi
T
Sinica Treebank (61,087 trees, 361K words);
Japanese: ATR Dependency corpus, Kyoto Text
nhanh
quá
Corpus, Verbmobil treebanks;
Korean: Korean Treebank
Viet Treebank
(5078 trees, 54K words)

Viet Treebank (7.2007-5.2009):



S

10,000 trees
1,000,000 morphemes

Viet word
segmenter

Viet
Viet POS
Viet
tagger chunker syntacti
c parser

Viet machine translation, info
extraction, etc.
IEEE RIVF’09, 16 July 2009


SP7.3: Viet Treebank


Study various existing treebanks, modern theories for
syntax and Vietnamese language



Build guidelines for word segmentation, POS, and syntax



“Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả”
(“the house is in jumble” and “at home the door
is not closed”)



“Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”
(She keeps her beauty” and “this painting has better color”)



Build the tools



Labeling

Agreement between labelers (95%)

IEEE RIVF’09, 16 July 2009


SP7.4: English-Vietnamese parallel
corpus









Pairs of corresponding
sentences in English and
Vietnamese (size & quality)
Easy for many languages
(LDC:
English-French corpus of
2.8M sentences, source
from Canadian Parliament)
No publicly available
parallel corpus for
Vietnamese
Building corpora needs
time, money and human
resources (boring job)

Parallel Corpus
(L1-L2)

Sentence
s

L1
Words

L2
Words


German-English

1,313,096

34,700,3
62

36,663,08
3

Greek-English

662,090

18,834,7
58

18,827,24
1

Spanish-English

1,304,116

37,870,7
51

36,429,27
4


Finnish-English

1,257,720

24,895,7
90

34,802,61
7

French-English

1,334,080

41,573,1
17

37,436,22
2

Italian-English

1,251,315

36,411,1
66

36,510,03
3


Dutch-English

1,326,412

36,784,1
68

36,690,39
2

PortugueseEnglish

1,287,757

37,342,4
26

36,355,90
7

Swedish-English

1,164,536

28,882,1
42

32,053,62
8


()

IEEE RIVF’09, 16 July 2009


×