A Community Based Vietnamese Question Answering System

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.18 MB, 12 trang )

A Community-Based
Vietnamese Question Answering System
Quan Hung Tran, Nien Dinh Nguyen, Kien Duc Do, Thinh Khanh Nguyen,
Dang Hai Tran, Minh Le Nguyen, and Son Bao Pham

Abstract. Most recent Vietnamese QA systems have not considered so far in using the data crawled from the community web services as a useful resource. In this
paper, we take into accounts the community-based resource to build a Vietnamese
question answering system named VnCQAs. Our system comprises of three modules for building the database of question-answer pairs, analyzing questions and
choosing the best answer respectively. Experimental results show that our system
achieves promising performances.

1 Introduction
Nowadays, the community web services play a crucial role in significantly supporting human users to seek desired responses, especially in technology domain. Users
often pose their queries on Yahoo! Answer, technology web forums or Facebook
for finding helps as well as personal experience-based advice from others. However, queries are often complex and contain multiple sub-questions whilst others’
feedbacks and comments miss valuable information or only deal with a part of these
queries. For example, a question “Có nên mua Samsung Galaxy S4 không?” (should
I buy Samsung Galaxy S4?) expects the answer about individual opinions instead of
the specifications of the phone itself.
Quan Hung Tran · Nien Dinh Nguyen · Kien Duc Do · Thinh Khanh Nguyen ·
Dang Hai Tran · Son Bao Pham
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
e-mail: {quanth_55,niennd_55,kiendd_55,thinhnk_55,
dangth,sonpb}@vnu.edu.vn
Minh Le Nguyen
School of Information Science
Japan Advanced Institute of Science and Technology
e-mail:
© Springer International Publishing Switzerland 2015

V.-H. Nguyen et al. (eds.), Knowledge and Systems Engineering,
Advances in Intelligent Systems and Computing 326, DOI: 10.1007/978-3-319-11680-8_10

117

118

Q.H. Tran et al.

Assuming that we have a collection of users’ queries from community web
services and a corresponding collection of feedbacks and comments, building a
community-based question answering (cQA) system to return a best answer for each
user’s whole query raises a challenge issue. It is because that the task is under the
key research problems of how to construct the database of question-answer pairs,
how to analysis questions from users’ queries, and how to produce a best answer.
Regarding to these problems, some researches concern about question identification
[5, 16, 6, 15], question similarity [2], question generation [17], question analysis
[10, 9], answer summarization [3] and answer re-ranking [13].
At this time, most recent Vietnamese QA systems have not considered so far in
using community web services as a useful resource for such researches. Existing
Vietnamese QA systems [8, 14, 7, 12] are usually rule/grammar-based ones and utilizes structured databases or crawled web-pages. Additionally, there is a Vietnamese
QA system that used community data as described in the Dang et al. [1] ’s research.
The Dang et al. ’s system responds to a new question by finding the similar questions from Yahoo Answer. However, this system did not return the answers of those
similar questions, and the reported accuracy was not high.
In this paper, we present a community-based Vietnamese question answering system, namely VnCQAs. Our system solves the issue of domain adaptability and inability to be able to answer complex questions. Furthermore, our VnCQAs system
uses machine learning techniques to obtain high accuracy. Our system contains three
main modules: Database Construction, Question Analysis and Answer Selection,
which are responsible for building the database, analyzing questions and choosing
the best answer, respectively. Figure 1 illustrates an example1 with the input question: “nên mua Ipad hay laptop” (should I buy Ipad or latop?). The output includes

the best available answer and related questions. Users can find the answers of the
related questions by clicking the corresponding links.
The paper is presented as follows. In the section 2, we introduce the overview of
the whole system and describe the modules. We describe the experimental results in
section 3. The conclusion and future work are presented in section 4.

2 System Architecture
In this section, we introduce the VnCQAs system architecture (as displayed in Figure
1) and briefly describe all modules in the system. When a new question is presented
to the system, our system finds the most similar questions from the database. The
answers of these similar questions are called candidate answers. These candidate
answers are then processed to output the final answer.
We build a system with three modules: Database Construction, Question Analysis, Answer selection.
Database Construction module is responsible for building the database of
question-answer pairs.
1

The online demonstration is available at: http://150.65.242.39:8080/VNQA/

A Community-Based Vietnamese Question Answering System

119

Fig. 1 The system user interface

Fig. 2 The system architecture

Question analysis module analyzes the questions and gives useful information
about the question such as keywords, questions types and words’ synonyms.

Answer selection module processes the candidate answers and return the final
answers for the given question.

120

Q.H. Tran et al.

2.1 Database Construction Module
This module is to extract the question-answer pairs for constructing the database
from the community data by two steps: question detection and answer detection (as
shown in Figure 3).

Fig. 3 The database construction module

Our main community sources of question-answer pairs are threads collected from
some famous technology forums in Vietnamese such as Vatgia, VnZoom and Tinhte.
These sites include the series of threads, in which each thread has a specific topic
and it is further divided into posts. Typically one of the posts presents the question, and some other posts contain the answer. Furthermore, the community data we
crawled from the sites have different layouts and therefore we standardize the data
by parsing them to our predefined XML format for later processing.
Figure 4 give the question: “Mấy anh cho em hỏi tại sao khi em vào My computer
hay bị treo vài giây” (Why when I enter My computer folder, the computer stops
responding for a few second). The only suitable answer is the last post: “cũng có thể
do phần mềm AV và cũng có thể là do 1 phần mềm nào trong máy nên bạn kiểm tra
lại máy xem nhé.” (maybe because of the AV software or maybe because of some
other software, you should check your computer again).
2.1.1

Question Detection

In the question detection, we use a machine learning model to classify whether a post
is the question post or not. Our features used for the machine learning are sequence
patterns which are based on the generalized form of text. E.g., a sentence: “Subnet
Mask dùng để làm gì?” (What is Subnet Mask used for?) can be represented in
sequence form as follows: “Np V E làm gì”, in which Np, V, E are part of speech
(POS) tags of the respective words.
We define the question words that also appear many times over questions e.g.,
giúp(help), phân biệt (distinguish), đánh giá (evaluate), làm sao (how to do), tại sao
(why). These question words are kept in their original form and arranged into 18
groups namely:

A Community-Based Vietnamese Question Answering System

121

Fig. 4 Thread XML example

• q0000: gì (what)
• q0001: nào, nào là (which)
• q0002: ai (who)
• q0003: đâu (where)
• q0004: hay (or)
• q0005: sao, vì sao, tại sao (why)
• q0006: làm sao (how to do)
• q0007: làm gì (what to do)
• q0008: ra sao, thế nào, như thế nào (how)
• q0009: bao nhiêu (how many), bao lâu (how long), bao xa (how far)
• q0010: khơng (not), chưa (not yet)

• q0100: giúp(help), tư vấn(advise), dạy(teach), hướng dẫn(instruct), chỉ
dẫn(instruct)
• q0101: hỏi (ask), thắc mắc (worry)
• q0102: khắc phục (overcome)
• q0103: vấn đề (problem)
• q0104: cách (solution)
• q0105: so sánh (compare), đánh giá (evaluate)
• q0106: phân biệt (distinguish)
Other words in question are replaced by their POS tags to make the sequence
more general. The sequence patterns are extracted by using Prefix Span algorithm
[15]. After that, we select the patterns that contain the question words. We then

122

Q.H. Tran et al.

apply the method called “Multiple minimum supports” [4] to guarantee the quality
of patterns.
2.1.2

Answer Detection

After finding the questions from the previous step, we detect the corresponding
answers for each question by classifying the remaining posts through using a SVM
model with a set of features:
• Is the post belonged to the author of the thread?
• Does the post contain quote of the questioner?
• Does the post contain quote of other users?
• The relative position of the post in the thread.

• The relative length of the post compared to others in the thread.
• Similarity between the post and the detected question.
• The proportion of noun, verb and pronoun that the post contains.
If the post is from the question’s owner, it is unlikely to be the answer. Otherwise, the remaining post which contains the quote of the questioner often has a high
possibility to be the answer.

2.2 Question Analysis Module
The question analysis module aims to extract important information from the questions for finding the similar questions in the later module. In this module, we investigate three steps: question classification, question keyword identification, and similar
word identification as presented in Figure 5. Figure 2 shows an example of data extracted by the question analysis module from the question: “Làm thế nào để tạo vùng
nhớ ảo thay thế RAM” (How to create virtual memory to replace RAM).
2.2.1

Question Classification

We classify questions because a question that is classified as a different type from
the original question is unlikely to be a similar question. Moreover, the question type
also provides the constraints for verifying the answers.
We categorize the questions into 3 classes: Fact, Solution and Explanation by
using the machine learning method with a set of features: Unigrams, Bigrams, and
Similarity. The unigram and bigram features are calculated as the boolean value,
while the value of the similarity feature which represents a measure of similarity
between two questions is estimated by using the phrasal overlap.
2.2.2

Keyword Identification

The questions which are likely to be similar usually have the same set of keywords.
Besides, many questions in online forums and QA sites contain the unnecessary
words and phrases, removing these helps to improve the ability of finding similar
questions.

A Community-Based Vietnamese Question Answering System

123

Fig. 5 The question analysis module

Fig. 6 An example of analyzing question

The keyword identification aims to find the most important words in a question.
We compute a score for each word appearing in the question corpus by using term
frequency - inverse document frequency (tf*idf) weighting scheme. Then we use a
threshold to determine whether the word is a keyword or not.
2.2.3

Similar Word Identification

Regarding to the performance of our system for finding the similar questions, we
also use a synonym dictionary to return the words that has the same meaning with
the original words in the input question.

124

Q.H. Tran et al.

Fig. 7 An example of giving the answer

2.3 Answer Selection

The answer selection module is responsible for finding the similar questions with
their corresponding candidate answers from the database, and finally it give the best
answer. Figure 7 shows an example of how an input question is processed in this
module.
As shown in Figure 8, after the input question is analyzed by the question analysis
module, we use the extracted information as the input for finding the similar questions by using the Lucene2. For each candidate answer corresponding to a similar
question, we then apply the supervised learning approaches to estimate a score of
classification confidence. The score for each candidate answer is used to re-rank the
list of candidate answers, and finally the candidate answer with the highest score is
selected as the final answer.
We consider the following triplet: (Qnew, Qpast, A), where Qnew is the original
question, Qpast is the similar question and A is the candidate answer for Qpast. Each
triplet is classified as satisfied if the answer A can be used to respond to the question
Qnew. Otherwise, the triplet is classified as unsatisfied. We employ the supervised
learning approaches with a set of features:
• Text length
• Number of question marks
• Number of stopwords
• IDF statistics
• Query clarity
2

/>

A Community-Based Vietnamese Question Answering System

125

Fig. 8 The answer selection module

• Cosine similarity
• Topic model

3 Evaluation
3.1 Experimental Result
We evaluate our system by the results of finding the similar questions and giving the
correct answer for each test question.
We collect the community data from three famous technology forums in Vietnamese: Vatgia, VnZoom and Tinhte. For each question, we assign a score arranged
from 0 to 4 to each candidate answer corresponding to the question, in which the
exact answers are given the score of 4, the irrelevant answers are given the score
of 0.
The evaluation data for finding similar questions consists of 1704 questions obtained from the database construction module. These questions are checked by hand
to ensure that each question is not similar to other questions. Then we paraphrase
each question into 3 different versions with the same meaning. The paraphrased
questions with the corresponding candidate answers are indexed into Lucene as
mentioned in section 2.3. We use 1704 original questions as the input questions
for testing our system performance, if the returned question is one of 3 paraphrased
questions, we evaluate this as a good result, and otherwise we count it as a bad result.
The accuracy result for finding the similar questions is presented in table 1.

126

Q.H. Tran et al.

Table 1 The accuracy results for finding the similar questions

Method

Accuracy

Cosine similarity

80.86 (%)

Cosine similarity + Question analysis 87.61 (%)

From 1704 questions, we choose 605 questions as the input questions, in which
each question have an exact answer to test the performance of giving the correct
answer. We consider an returned answer as the satisfied answer if it matches the
exact answer that we assigned. The accuracy result for evaluating the correct answers
is shown in table 2.

Table 2 The accuracy result for finding the answer

Method

Accuracy

Baseline using the default of Lucene 59.66 (%)
Our approach

71.19 (%)

4 Conclusion
In this paper, we proposed the community-based question answering system for
Vietnamese. Our system consists of three modules: database construction, question
analysis and answer selection. The database construction module is used for creating the database of question-answer pairs, in which each question corresponds to
the candidate answers. The question analysis module is responsible for extracting
useful information such as keywords, question types and synonyms. The answer selection module takes the extracted information from the input question for finding

the similar questions in the database, and then re-rank the list of corresponding candidate answers to give the best answer. Experimental results are promising, where
the question analysis module helps to improve the accuracy from 80.86% to 87.61%
in finding the similar questions, and the answer selection module get the accuracy
of 71.19% that is 11.53% higher than the baseline using the default of Lucene.
In the future, we will extend the question analysis module by using other additional features based on the dependency tree [11]. We will also expand the database
to be able to deal with a wide range of questions and improve the answer selection
module.
Acknowledgment. This work is partially supported by the Research Grant from Vietnam
National University, Hanoi No. QG.14.04.

A Community-Based Vietnamese Question Answering System

127

References
[1] Son, D.T., Dung, D.T.: Apply a mapping question approach in building the question
answering system for vietnamese language. In: Proceedings of the Conference on Green
Technology and Sustainable Development (2012)
[2] Bernhard, D., Gurevych, I.: Answering learners’ questions by retrieving question paraphrases from social q&a sites. In: Proceedings of the Third Workshop on Innovative
Use of NLP for Building Educational Applications, pp. 44–52. Association for Computational Linguistics (June 2008)
[3] Chan, W., Zhou, X., Wang, W., Chua, T.-S.: Community answer summarization for
multi-sentence question with group l1 regularization. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Long Papers, vol. 1, pp.
582–591 (2012)
[4] Jindal, N., Liu, B.: Identifying comparative sentences in text documents. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 244–251 (2006)
[5] Li, B., Jin, T., Lyu, M.R., King, I., Mak, B.: Analyzing and predicting question quality
in community question answering services. In: Proceedings of the 21st International
Conference Companion on World Wide Web, WWW 2012 Companion, pp. 775–782
(2012)
[6] Li, B., Si, X., Lyu, M.R., King, I., Chang, E.Y.: Question identification on twitter. In:

Proceedings of the 20th ACM International Conference on Information and Knowledge
Management, CIKM 2011, pp. 2477–2480 (2011)
[7] Nguyen, A.K., Le, H.T.: Natural language interface construction using semantic grammars. In: Ho, T.-B., Zhou, Z.-H. (eds.) PRICAI 2008. LNCS (LNAI), vol. 5351, pp.
728–739. Springer, Heidelberg (2008)
[8] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Vietnamese Question Answering System.
In: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, KSE 2009, pp. 26–32 (2009)
[9] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Semantic Approach for Question Analysis.
In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds.) IEA/AIE 2012. LNCS (LNAI), vol. 7345,
pp. 156–165. Springer, Heidelberg (2012)
[10] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: Systematic Knowledge Acquisition for Question Analysis. In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp. 406–412 (2011)
[11] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., Nguyen, P.-T., Le Nguyen, M.: From Treebank Conversion to Automatic Dependency Parsing for Vietnamese. In: Métais, E.,
Roche, M., Teisseire, M. (eds.) NLDB 2014. LNCS, vol. 8455, pp. 196–207. Springer,
Heidelberg (2014)
[12] Nguyen, D.T., Hoang, T.D., Pham, S.B.: A vietnamese natural language interface to
database. In: Proc. of the 2012 IEEE Sixth International Conference on Semantic Computing, pp. 130–133 (2012)
[13] Surdeanu, M., Ciaramita, M., Zaragoza, H.: Learning to rank answers on large online
QA collections. In: Proceedings of ACL 2008. HLT (June 2008)
[14] Tran, V.M., Nguyen, V.D., Tran, O.T., Pham, U.T.T., Ha, T.Q.: An experimental study of
vietnamese question answering system. In: International Conference on Asian Language
Processing, IALP 2009, pp. 152–155 (December 2009)
[15] Wang, K., Chua, T.-S.: Exploiting salient patterns for question detection and question
retrieval in community-based question answering. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, pp. 1155–1163 (2010)

128

Q.H. Tran et al.

[16] Yang, L., Bao, S., Lin, Q., Wu, X., Han, D., Su, Z., Yu, Y.: Analyzing and predicting
not-answered questions in community-based question answering services. In: AAAI

(2011)
[17] Zhao, S., Wang, H., Li, C., Liu, T., Guan, Y.: Automatically generating questions from
queries for community-based question answering. In: Proceedings of 5th International
Joint Conference on Natural Language Processing, pp. 929–937 (2011)

A Community Based Vietnamese Question Answering System

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về