Tải bản đầy đủ (.pdf) (11 trang)

Question Analysis for a Community Based Vietnamese Question Answering System

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (918.73 KB, 11 trang )

Question Analysis for a Community-Based
Vietnamese Question Answering System
Quan Hung Tran, Minh Le Nguyen, and Son Bao Pham

Abstract. This paper describes the approach for analyzing questions in our
community-based Vietnamese question answering system (VnCQAs), in which we
focus on two subtasks: question classification and keyword identification. The
question classification employs the machine learning approaches with a feature
which represents a measure of similarity between two questions, while the keyword
identification uses the dependency-tree-based features. Experimental results are
promising, in which the question classification obtains the accuracy of 95.7% and
the keyword identification gains the accuracy of 85.8%. Furthermore, these two subtasks help to improve the accuracy for finding the similar questions in our VnCQAs
by 6.75%.

1 Introduction
Question answering systems usually have a module for analyzing questions in order
to extract the important information such as keywords, question types or semantic
constraints. In this research, we focus on two subtasks of question analysis: question
classification and keyword identification. Identifying important words from a set of
documents is an important task on information retrieval and question answering with
two main approaches: using the corpus-based statistics for term weighting [7, 9] and
employing the supervised methods [3, 18, 10]. The question classification aims to
Quan Hung Tran · Son Bao Pham
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
e-mail: {quanth_55,sonpb}@vnu.edu.vn
Minh Le Nguyen
School of Information Science
Japan Advanced Institute of Science and Technology
e-mail:



© Springer International Publishing Switzerland 2015
V.-H. Nguyen et al. (eds.), Knowledge and Systems Engineering,
Advances in Intelligent Systems and Computing 326, DOI: 10.1007/978-3-319-11680-8_51

641


642

Q.H. Tran, M.L. Nguyen, and S.B. Pham

classify questions into several pre-defined classes for seeking the suitable answers.
Li et al. [8] proposed a two-layer taxonomy with 6 coarse classes and 50 fine-grained
classes, while Bu et al. [4] introduced a six-types taxonomy. Futhermore, regarding
to the methods for classifying questions, some researches employed the rule-based
approaches [6] and the machine learning algorithms [4, 1], while other researches
considered on combining rule-based and machine learning-based techniques [5].
Recently, some question analysis techniques have been examined for Vietnamese
[11, 13, 12, 16]. However, these researches experimented on a standard corpus,
where the words’ spellings are generally good. The question analysis for our VnCQAs system, on the other hand, has to deal with noisy data from the communitybased resources. In this paper, we propose the dependency-tree-based features in
finding keywords. We also introduce a new feature called “similarity feature” for
classifying questions. To the best of our knowledge, it is the first time the question
analysis are adapted to the Vietnamese community data.
The paper was presented as follows: in section 2, we briefly describe the architecture of our VnCQAs system. Section 3 presents the overview of our approach for
analyzing questions, while the question classification and the keyword identification are introduced in section 4 and 5 respectively. Section 6 gives the experimental
results and the conclusion are shown in section 7.

2 The VnCQAs System Architecture
The architecture of our question answering system [17] is shown in Figure 1. It

includes three modules: Database Construction, Question Analysis and Answer selection, in which the database construction module aims to build the database of
question-answer pairs, while the question analysis module extract the useful information such keywords, question types and synonyms. The answer selection module
finds the most similar questions for the input question from the database, in which
each similar question corresponds to a candidate answer. The candidate answers
then are processed to output the best answer.
In this paper, we focus on analyzing questions in the question analysis module
with two main tasks: Question Classification and Keyword Identification. Furthermore, in this module, we also use a dictionary of 6626 entities to find the synonyms
in the question. Figure 2 shows an example for the question analysis module with the
question: “Làm thế nào để tạo vùng nhớ ảo thay thế RAM” (How to create virtual
memory to replace RAM).

3 Question Classification
3.1 Question Types
We classify questions into three types: Fact, Explanation and Solution according to
the main purpose of the questioner.


Question Analysis for a VnCQAs

643

Fig. 1 The system architecture

Fig. 2 An example for the question analysis module

• Fact: The questions is only about objects and resources, the expected answer
is about the general facts and/or attributes. E.g., with the question: “Tấm dán màn
hình từ tính là gì?” (What is magnetic screen stickers?), the object is “Tấm dán màn
hình từ tính” (the magnetic screen stickers), and the expected attribute in the answer
is definition.

• Explanation: The questions require explanations or opinions e.g., “Vì sao điện
thoại của mình hay bị mất sóng?” (Why does my phone frequently lose signal?).


644

Q.H. Tran, M.L. Nguyen, and S.B. Pham

• Solution: The questions ask for the solution for a problem e.g.,“Chỉ cho em
cách vào facebook trên Iphone?” (How to access Facebook from Iphone?).

3.2 Methodology
In the VnCQAs system, we use the support vector machines (SVMs) for learning
classification (as shown in Figure 3) with a set of features: Unigrams, Bigrams. The
unigram and bigram features are common in the natural language processing tasks,
in which the value of each unigram/bigram feature is calculated as a boolean value
indicating whether that unigram/bigram feature is included in the question or not.
Furthermore, another feature we used for training the SVM model is the similarity feature, for which the value of the similarity feature which represents a measure
of similarity between two questions is estimated by using the phrasal overlap [2, 15]:
overlap phrase (s1 , s2 ) = ∑ni=1 ∑m i2 for m phrasal n-word overlaps, where m is a
number of i-word phrases appearing in sentence pairs.
simoverlap,phrase (s1 , s2 ) = tanh(

overlap phrase(s1 , s2 )
)
|s1 | + |s2|

Fig. 3 The question classification

4 Keyword Identification

This section describes the keyword identification by using the machine learning
technique with the dependency-tree-based features.


Question Analysis for a VnCQAs

645

4.1 Keyword Definition
We define the keywords as:
• The most informative words in the question is the set of keywords which contains most of the information (e.g., topics, main objects and actions).
• The words can be used to distinguish different questions, two questions that
have the same set of keywords are likely to be similar.
E.g.,: The question: “Hỏi cách xóa tin nhắn trên iphone?” (How to delete
messages on iphone?) have the keywords of: “cách” (how), “xóa” (delete), “tin
nhắn” (messages), “iphone”. The main verb “hỏi” (ask) is not considered a
keyword because this word does not represent any important information and it also
cannot be used to distinguish among questions.

4.2 Methodology
We use the dependency tree to find keywords on the premise that is to identify a
word as informative or not, we take into account the relationships of that word with
other words in the question by using the Vietnamese dependency parser [14]. For
each question, the dependency parser creates a tree that contains the tree structure,
the relation of the words and several other information such as part-of-speech tag
(as shown in Figure 4, the question is “How do I delete messages in Iphone?”).

Fig. 4 An example of the dependency tree

The features are then extracted from the tree and used for training the SVM model

which is used to classify a word as a keyword or not (as presented in Figure 5). These
features are grouped as follows:
• The part of speech (POS) tag of each word: POSW
• The part of speech (POS) tag of the parent word of each word in the dependency
tree: POSP
• Unigrams
• The common dependent words: CDW
The POSW feature is used because words with certain POS tags (e.g., Noun, Verb,
and Adjective) are more likely to be a keyword of a sentence. The POSP feature


646

Q.H. Tran, M.L. Nguyen, and S.B. Pham

Fig. 5 Dependency tree method work flow

helps to identify that words that are children of verbs are more likely to be the object of the question. The CDW feature is employed because the children of several
words (e.g. “cách” (solution)) have a high chance of being keywords. During implementation for the CDW feature, we use a map that stores the words and an Integer
that indicates the number of times that word is the parent of a keyword. A manual
threshold is then used to identify which words have a high frequency of being the
parent of keywords.

5 Experimental Results
5.1 Question Classification Evaluation
We use a set of 1013 manually tagged questions focusing on the technology domain
with three mentioned types: Fact, Explanation, and Solution. The tagged questions
come from the database construction module of our VnCQAs system. These questions are kept in its original form, no modifications are made. However, some of the
questions in online forums are not understandable, they lack information or context
to be understood. These questions are removed from the set of data. The question

distribution is shown in Figure 6.
Regarding to learn the SVM model, we use the LIBLinear and SVM-SMO algorithms with 10-fold cross-validation scheme. The experiments were conducted on a
Window PC with Core i7 CPU and 8GB of RAM. The highest accuracy (95.7%) is
achieved with the combination of phrasal overlap, unigram and bigram features (as
shown in Table 1).
Although, the accuracy of the question classification is around 95.7%, our corpus
is different from the corpus of other published works, it is hard to directly compare
our method to other available methods in the question classification task. To make
a meaningful comparison of our method with other methods, we investigate on the


Question Analysis for a VnCQAs

647

Fig. 6 The distribution of questions in the tagged data set
Table 1 The question classification accuracy on the community data
Features

Accuracy (SVM - LIBLinear) Accuracy (SVM-SMO)

Bigram

89.8 (%)

89.5 (%)

Unigram

94.8 (%)


95.2 (%)

phrasal overlap

93.4 (%)

92.9 (%)

phrasal overlap + Unigram + bigram

95.6 (%)

95.7 (%)

TREC corpus in Vietnamese [16]. Our obtained accuracy is comparable to the accuracy of the Tran et al. [16]’s approach with the same 10-fold cross-validation scheme
(as shown in Table 2).
Table 2 The question classification accuracy on the TREC data

Classes

Our methodTran’s method
85.0 (%)

86.0 (%)

fine grain classes classification 84.9 (%)

84.7 (%)


coarse classes classification

5.2 Keyword Identification Evaluation
To test the performance of the keyword identification, we use a set of 753 words
tagged from the sentences in our database (as shown in Figure 7, the question is
“How do I delete messages in Iphone?”).
To make a comparison, we implement a baseline for identifying the keywords
by using the using term frequency - inverse document frequency (TF-IDF) method.


648

Q.H. Tran, M.L. Nguyen, and S.B. Pham

Fig. 7 An example of keywords

Fig. 8 The TF-IDF method’s accuracy

The TF-IDF score of each word in a sentence will be calculated, and a threshold is
chosen to identify whether a word is a keyword or not. The accuracy results of the
TF-IDF method are presented in figure 8.
Our method outperforms the TF-IDF method as we can see from the obtained
accuracies in Table 3).

5.3 Question Analysis Evaluation
In this section, we evaluate the contribution of the question analysis to our VnCQAs system by measuring the improvement in finding similar questions. To evaluate the ability of the VnCQAs system to find similar questions, we use a set of


Question Analysis for a VnCQAs


649

Table 3 The accuracy of the keyword identification

Features

Accuracy (SVM - LIBLinear)Accuracy (SVM-SMO)

POSW

79.1 (%)

79.1 (%)

POSW + POSP

81.4 (%)

81.1 (%)

POSW + POSP + BOW

85.4 (%)

85.4 (%)

POSW + POSP+ BOW+CDW

85.8 (%)


85.4 (%)

1704 questions which are checked by hand to ensure that each question is not similar to other questions. Then we paraphrase each question into 3 different versions.
The paraphrased questions must have close meaning to the original questions.
We use 1704 original questions as the input questions for testing our system performance, if the returned question is one of 3 paraphrased questions, we evaluate
this as a good result, and otherwise we count it as a bad result. Table 4 shows the
accuracy improvement in finding the similar questions.

Table 4 The question analysis evaluation

Method

Accuracy)

Cosine similarity

80.86 (%)

Cosine Similarity + Question Analysis 87.61 (%)

6 Conclusion
In this paper, we described the question analysis module of our VnCQAs system on
two subtasks: question classification and keyword identifications. We classify questions into three types: Fact, Explanation and Solution by using the support vector
machines (SVMs) for learning classification with a set of features: unigrams, bigrams, and similarity. Our classification accuracy is high even though we have to
deal with noisy community data. Furthermore, on the Vietnamese TREC corpus,
we gain the competitive accuracy results. For the keyword identification subtask,
we used the machine learning method with the dependency tree-based features and
achieved the accuracy of 85.8% which outperforms the TF-IDF method.
In the future, we will improve the size and quality of data set used in both subtasks
above. We also will examine other methods for further improving the performance

accuracy in analyzing questions.


650

Q.H. Tran, M.L. Nguyen, and S.B. Pham

Acknowledgment. This work is partially supported by the Research Grant from Vietnam
National University, Hanoi No. QG.14.04.

References
[1] Paliwal, M., Kumar, U.A.: Neural networks and statistical techniques: A review of applications. Expert Systems with Applications 36(1), 2–17 (2009)
[2] Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI 2003, pp. 805–810 (2003)
[3] Bendersky, M., Croft, W.B.: Discovering key concepts in verbose queries. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp. 491–498 (2008)
[4] Bu, F., Zhu, X., Hao, Y., Zhu, X.: Function-based question classification for general
qa. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2010, pp. 1119–1128 (2010)
[5] Huang, Z., Thint, M., Qin, Z.: Question classification using head words and their hypernyms. In: Proceedings of the Conference on Empirical Methods in Natural Language
Processing, EMNLP 2008, pp. 927–936 (2008)
[6] Hui, Z., Liu, J., Ouyang, L.: Question classification based on an extended class sequential rule model. In: Proceedings of 5th International Joint Conference on Natural
Language Processing, Chiang Mai, Thailand, pp. 938–946. Asian Federation of Natural
Language Processing (November 2011)
[7] Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods
for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine
Intelligence 31(4), 721–735 (2009)
[8] Li, X., Roth, D.: Learning question classifiers. In: Proceedings of the 19th International
Conference on Computational Linguistics, COLING 2002, vol. 1, pp. 1–7 (2002)
[9] Luhn, H.P.: A business intelligence system. IBM J. Res. Dev. 2(4), 314–319 (1958)
[10] Luo, X., Raghavan, H., Castelli, V., Maskey, S., Florian, R.: Finding what matters in
questions. In: Proceedings of the 2013 Conference of the North American Chapter of the

Association for Computational Linguistics: Human Language Technologies, Atlanta,
Georgia, pp. 878–887. Association for Computational Linguistics (June 2013)
[11] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Vietnamese Question Answering System.
In: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, KSE 2009, pp. 26–32 (2009)
[12] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Semantic Approach for Question Analysis.
In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds.) IEA/AIE 2012. LNCS, vol. 7345, pp.
156–165. Springer, Heidelberg (2012)
[13] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: Systematic Knowledge Acquisition for Question Analysis. In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp. 406–412 (2011)
[14] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., Nguyen, P.-T., Le Nguyen, M.: From treebank
conversion to automatic dependency parsing for vietnamese. In: Métais, E., Roche, M.,
Teisseire, M. (eds.) NLDB 2014. LNCS, vol. 8455, pp. 196–207. Springer, Heidelberg
(2014)
[15] Ponzetto, S.P., Strube, M.: Knowledge derived from wikipedia for computing semantic
relatedness. Journal of Artificial Intelligence Research 30(1), 181–212 (2007)


Question Analysis for a VnCQAs

651

[16] Tran, D., Chu, C., Pham, S., Nguyen, M.: Learning based approaches for vietnamese
question classification using keywords extraction from the web. In: Proceedings of the
Sixth International Joint Conference on Natural Language Processing, pp. 740–746.
Asian Federation of Natural Language Processing (October 2013)
[17] Tran, Q.H., Nguyen, N.D., Do, K.D., Nguyen, T.K., Tran, D.H., Le Nguyen, M., Pham,
S.B.: A Community-based Vietnamese Question Answering System. In: Proceedings of
the 2014 International Conference on Knowledge and Systems Engineering, KSE 2014
(2014)
[18] Zhao, L., Callan, J.: Term necessity prediction. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp.
259–268 (2010)




×