Tải bản đầy đủ (.ppt) (23 trang)

Vietnamese-English Cross Language Search Information Retrieval (CLIR) - Discovering Noun Phrases for Translation

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (782.03 KB, 23 trang )

Vietnamese-English Cross Language Search
Information Retrieval (CLIR) Discovering Noun Phrases for Translation
CSC 177 Presentation
Nguyen Doan H, Ph.D


Outline






Motivations
Crosslingual Query
Noun phrase translation extraction
Experiments and results
Conclusion and next steps


Motivations – Unknown Translations


Words that outside scope of bilingual dictionary






Compound nouns








Meaning might not be inferable from individual
components
Might required expert knowledge for translation
Might have multiple correct translations

Applicability






Brand names, Place names, Personal names
Titles (music, book, video)
Terminologies (Science, Computer, Medical, Space,
Farming etc)

Cross-language Information Retrieval (CLIR)
Machine Translation (MT)
Machine-Readable Dictionary (MRD)

Most of the words are Out-Of-Vocabulary (OOV)



Examples
Example 1: Computer Terminology (phần mềm ->
software)


Examples
Example 2: Personal Name (ca sĩ Quang Dũng -> Singer
Quang Dung)


Searching the web for translation?
• Parallel Data on the Web:

Vietnamese to English
Translation


Searching the web for translation?
• Comparable corpus on the web:


Searching the web for translation?
• Mixed language web pages:

English
Translation


Our Approach
• Extensions to CMU’s Ying Zhang 2005 paper (Credit)

• Addressing issues focusing to Vietnamese-English
OOV translations
• Proper name translation is using pattern recognition
technique and not by phonetic similarity and string
alignment
• Detection of borrowed English words
• Improving translation suggestions by utilizing
contextual information


Crosslingual Query to Obtain Mixed Languages
WebPages
• Extend the source query, VS , with extended

words/phrases VEX: (tend to frequently co-occur)
– VS : phần mềm → ?
– VSVEX : phần mềm miễn phí
• Translate the extended words/phrases, VEX, , to English,

EEX:
– VEX : miễn phí → EEX : free

• Submit both source query and translated words/phrases
to a search engine
– VS EEX : phần mềm free


How to Find This VEX ?
• Find co-occurred terms in web


log
• Use co-occurred terms in search

query (in CLIR)
• Search Google, with VS, and

select Vietnamese words, VEX,
with high frequency

Overture Search Log


Original Source Query


Crosslingual Query


Our Approach: Noun Phrase Translation
Extraction
• Proper noun recognition & Transliteration
• Preprocessing
• Frequency-Distance Model
• Contextual Ordering Model & Result Ranking


Yahoo Search API - XML Data Returning

Snippet



Proper name recognition & Transliteration
• Extract and concatenate Title, Summary, and URL
• Recognize that proper name text pattern
is likely to appear in capital with the
first letter
• Compute the likelihood of a query text is a proper name

Occurences of First_Letter_In_Cap(Vs )in Snippet Text
P (Vs) =
All occurences of Vs in Snippet Text
• Once recognized, map Vietnamese vowels to English vowels:
– i.e á → a, à → a … , ũ → u…

• Suggest a translation candidate VN: Quang Dũng → Eng: Quang Dung
• Compute and assign a weight to a translation candidate


Preprocessing (Query: Thuật toán genetic)
– Extracting and concatenation of Title, Summary, and URL
Thuật toán-Cấu trúc dữ liệu ... (Reserve Polish Notation – RPN), một thuật toán "kinh điển"
trong lĩnh vực trình biên dịch. ... THUẬT GIẢI DI TRUYỀN – GENETIC ALGORITHM Kỳ 2 ... ity.vnuit.edu.vn/thuattoan/index.htm

– Mark query, normalize text, remove noise text
~123456789 cấu trúc dữ liệu reserve polish notation – rpn một ~123456789 kinh điển
trong lĩnh vực trình biên dịch thuẬt giẢi di truyỀn – ~987654321 algorithm kỳ 2 ity
vnuit edu vn thuattoan index htm

– Mark recognized Vietnamese text with VNW tag
~123456789 VNW VNW VNW VNW reserve polish notation VNW rpn ~123456789 VNW

VNW trong VNW VNW VNW VNW VNW VNW di VNW VNW ~987654321 algorithm
VNW ity vnuit edu vn thuattoan index htm

– Group continuous English words and build word list
['~123456789', 'VNW', 'VNW', 'VNW', 'VNW', '', '', 'reserve_polish_notation', 'VNW', 'rpn',
'~123456789', 'VNW', 'VNW', 'trong', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'di',
'VNW', 'VNW', '~987654321', 'algorithm', 'VNW', 'ity', 'vnuit', 'edu', 'vn', 'thuattoan', 'index',
'htm']


Frequency-Distance Model
• Frequency-Distance model:
– Frequency of co-occurrence
– Distance of either VS or EEX within a snippet text
– For all doc returned summaries

1
1
w(e) = ∑ (∑
+∑
)
si
v S i d (V S i , e)
E EX i d ( E EX i , e)
• Example: Thuật toán genetic


Contextual Ordering Model &
Result Ranking
• Estimate Closeness Probability

ADJ (V e)
s
P (Vs e) =
∑ e c (e) + ∑Vs c (Vs )

ADJ (eE

)
EX
P (eE EX ) =
∑ e c(e) + ∑ E c( E EX )
EX

• Overall Score for each candidate

RankScore (e) = w(e) ∗ P (Vs e) ∗ P(eE EX )
• Sort score and present top 5 suggestions


Sample Program Output # 1
(dân ca -> folk or traditional music)


Sample Program Output # 2
(Quang Dũng -> Quang Dung)


Sample of Translation Results
Category


Vietnamese
Phrase/Word

Vietnamese-English
Web-mining
Translation

Vdict
(Machine
Translation)

Vietdict
(Online Dictionary)

Organization
Name

WTO là gì?

What is world trade
organization ?

What is WTO?

No definition found

Science & Tech

thuật toán di
truyền


Genetic algorithms

Heredity algorism

No definition found

Location Name

Thừa Thiên Huế

Thua Thien Hue

Partial Excess Hue

No definition found

Person Name

ca sĩ Quang Dũng

Singer Quang Dung

Optical singer Dũng

N/A

Medical Term

viêm màng não


Meningitis brain
infection

meningitis

No definition found

Geographical
name

Đại dương Bắc
Băng Dương

Arctic ocean

Đạtôi glacial ocean
Boreal Yang

No definition found

Education

học vị
Tiến sỹ

Phd degree

Advance academical
degree sỹ


No definition found

Music

dân ca

Folk music

folk-song

folk-song

Music

nhạc hip hop

Hip hop music or
Rap music

music cây hu-blông
hông

No definition found

Space

phi hành gia Sally
Ride


Former astronaut
Sally Ride

air-man sự Phá vây
cưỡi

No definition found

Plant

cây kiểng vườn
Nhật

Bonsai Japanese
garden

Japanese garden plant
kiểng

No definition found

Farming

những nghề cá
thủy sản

Aquaculture fisheries

seafood fisheries


No definition found

Laws

cư trú thường trực

permanent resident

populate permanent

No definition found

Astrological

Thuật chiêm tinh
phong thủy

feng shui astrology

Geomancy astrology

Geomancy


Conclusion and Next Steps
• Contributions
– Recognize and translate important phrases
– Translate: persons, locations, concepts
– Low cost for implementation with reasonable
performance


• Future work
– Experiment with a larger set of test data
– Integration with Vietnamese-English CLIR work
– Automate the generation of extended
words/phrase to derived English extended word
– Experiment on “Refine Result” concept for
search engine



×