Vietnamese-English Cross Language Search
Information Retrieval (CLIR) Discovering Noun Phrases for Translation
CSC 177 Presentation
Nguyen Doan H, Ph.D
Outline
•
•
•
•
•
Motivations
Crosslingual Query
Noun phrase translation extraction
Experiments and results
Conclusion and next steps
Motivations – Unknown Translations
•
Words that outside scope of bilingual dictionary
•
•
•
•
Compound nouns
•
•
•
•
Meaning might not be inferable from individual
components
Might required expert knowledge for translation
Might have multiple correct translations
Applicability
•
•
•
•
Brand names, Place names, Personal names
Titles (music, book, video)
Terminologies (Science, Computer, Medical, Space,
Farming etc)
Cross-language Information Retrieval (CLIR)
Machine Translation (MT)
Machine-Readable Dictionary (MRD)
Most of the words are Out-Of-Vocabulary (OOV)
Examples
Example 1: Computer Terminology (phần mềm ->
software)
Examples
Example 2: Personal Name (ca sĩ Quang Dũng -> Singer
Quang Dung)
Searching the web for translation?
• Parallel Data on the Web:
Vietnamese to English
Translation
Searching the web for translation?
• Comparable corpus on the web:
Searching the web for translation?
• Mixed language web pages:
English
Translation
Our Approach
• Extensions to CMU’s Ying Zhang 2005 paper (Credit)
• Addressing issues focusing to Vietnamese-English
OOV translations
• Proper name translation is using pattern recognition
technique and not by phonetic similarity and string
alignment
• Detection of borrowed English words
• Improving translation suggestions by utilizing
contextual information
Crosslingual Query to Obtain Mixed Languages
WebPages
• Extend the source query, VS , with extended
words/phrases VEX: (tend to frequently co-occur)
– VS : phần mềm → ?
– VSVEX : phần mềm miễn phí
• Translate the extended words/phrases, VEX, , to English,
EEX:
– VEX : miễn phí → EEX : free
• Submit both source query and translated words/phrases
to a search engine
– VS EEX : phần mềm free
How to Find This VEX ?
• Find co-occurred terms in web
log
• Use co-occurred terms in search
query (in CLIR)
• Search Google, with VS, and
select Vietnamese words, VEX,
with high frequency
Overture Search Log
Original Source Query
Crosslingual Query
Our Approach: Noun Phrase Translation
Extraction
• Proper noun recognition & Transliteration
• Preprocessing
• Frequency-Distance Model
• Contextual Ordering Model & Result Ranking
Yahoo Search API - XML Data Returning
Snippet
Proper name recognition & Transliteration
• Extract and concatenate Title, Summary, and URL
• Recognize that proper name text pattern
is likely to appear in capital with the
first letter
• Compute the likelihood of a query text is a proper name
Occurences of First_Letter_In_Cap(Vs )in Snippet Text
P (Vs) =
All occurences of Vs in Snippet Text
• Once recognized, map Vietnamese vowels to English vowels:
– i.e á → a, à → a … , ũ → u…
• Suggest a translation candidate VN: Quang Dũng → Eng: Quang Dung
• Compute and assign a weight to a translation candidate
Preprocessing (Query: Thuật toán genetic)
– Extracting and concatenation of Title, Summary, and URL
Thuật toán-Cấu trúc dữ liệu ... (Reserve Polish Notation – RPN), một thuật toán "kinh điển"
trong lĩnh vực trình biên dịch. ... THUẬT GIẢI DI TRUYỀN – GENETIC ALGORITHM Kỳ 2 ... ity.vnuit.edu.vn/thuattoan/index.htm
– Mark query, normalize text, remove noise text
~123456789 cấu trúc dữ liệu reserve polish notation – rpn một ~123456789 kinh điển
trong lĩnh vực trình biên dịch thuẬt giẢi di truyỀn – ~987654321 algorithm kỳ 2 ity
vnuit edu vn thuattoan index htm
– Mark recognized Vietnamese text with VNW tag
~123456789 VNW VNW VNW VNW reserve polish notation VNW rpn ~123456789 VNW
VNW trong VNW VNW VNW VNW VNW VNW di VNW VNW ~987654321 algorithm
VNW ity vnuit edu vn thuattoan index htm
– Group continuous English words and build word list
['~123456789', 'VNW', 'VNW', 'VNW', 'VNW', '', '', 'reserve_polish_notation', 'VNW', 'rpn',
'~123456789', 'VNW', 'VNW', 'trong', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'di',
'VNW', 'VNW', '~987654321', 'algorithm', 'VNW', 'ity', 'vnuit', 'edu', 'vn', 'thuattoan', 'index',
'htm']
Frequency-Distance Model
• Frequency-Distance model:
– Frequency of co-occurrence
– Distance of either VS or EEX within a snippet text
– For all doc returned summaries
1
1
w(e) = ∑ (∑
+∑
)
si
v S i d (V S i , e)
E EX i d ( E EX i , e)
• Example: Thuật toán genetic
Contextual Ordering Model &
Result Ranking
• Estimate Closeness Probability
ADJ (V e)
s
P (Vs e) =
∑ e c (e) + ∑Vs c (Vs )
ADJ (eE
)
EX
P (eE EX ) =
∑ e c(e) + ∑ E c( E EX )
EX
• Overall Score for each candidate
RankScore (e) = w(e) ∗ P (Vs e) ∗ P(eE EX )
• Sort score and present top 5 suggestions
Sample Program Output # 1
(dân ca -> folk or traditional music)
Sample Program Output # 2
(Quang Dũng -> Quang Dung)
Sample of Translation Results
Category
Vietnamese
Phrase/Word
Vietnamese-English
Web-mining
Translation
Vdict
(Machine
Translation)
Vietdict
(Online Dictionary)
Organization
Name
WTO là gì?
What is world trade
organization ?
What is WTO?
No definition found
Science & Tech
thuật toán di
truyền
Genetic algorithms
Heredity algorism
No definition found
Location Name
Thừa Thiên Huế
Thua Thien Hue
Partial Excess Hue
No definition found
Person Name
ca sĩ Quang Dũng
Singer Quang Dung
Optical singer Dũng
N/A
Medical Term
viêm màng não
Meningitis brain
infection
meningitis
No definition found
Geographical
name
Đại dương Bắc
Băng Dương
Arctic ocean
Đạtôi glacial ocean
Boreal Yang
No definition found
Education
học vị
Tiến sỹ
Phd degree
Advance academical
degree sỹ
No definition found
Music
dân ca
Folk music
folk-song
folk-song
Music
nhạc hip hop
Hip hop music or
Rap music
music cây hu-blông
hông
No definition found
Space
phi hành gia Sally
Ride
Former astronaut
Sally Ride
air-man sự Phá vây
cưỡi
No definition found
Plant
cây kiểng vườn
Nhật
Bonsai Japanese
garden
Japanese garden plant
kiểng
No definition found
Farming
những nghề cá
thủy sản
Aquaculture fisheries
seafood fisheries
No definition found
Laws
cư trú thường trực
permanent resident
populate permanent
No definition found
Astrological
Thuật chiêm tinh
phong thủy
feng shui astrology
Geomancy astrology
Geomancy
Conclusion and Next Steps
• Contributions
– Recognize and translate important phrases
– Translate: persons, locations, concepts
– Low cost for implementation with reasonable
performance
• Future work
– Experiment with a larger set of test data
– Integration with Vietnamese-English CLIR work
– Automate the generation of extended
words/phrase to derived English extended word
– Experiment on “Refine Result” concept for
search engine