Tải bản đầy đủ (.doc) (55 trang)

đánh giá chuyên môn người dùng trong các hệ thống hỏi đáp thiết kế và cài đặt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.1 MB, 55 trang )

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY
──────── * ───────
UNDERGRADUATE DISSERTATION
MAJORED IN INFORMATION TECHNOLOGY
ASSESSMENT OF USER EXPERTISE IN
QUESTION-AND-ANSWER SYSTEMS:
DESIGN AND IMPLEMENTATION
Author: Phạm Tuấn Long
Software Engineering C51
SoICT, HUST
Mentors: Prof. Dr. Huỳnh Quyết Thắng
MS. Lê Quốc
HANOI, 5-2011
TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI
VIỆN CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG
──────── * ───────
ĐỒ ÁN
TỐT NGHIỆP ĐẠI HỌC
NGÀNH CÔNG NGHỆ THÔNG TIN
ĐÁNH GIÁ CHUYÊN MÔN NGƯỜI DÙNG
TRONG CÁC HỆ THỐNG HỎI ĐÁP:
THIẾT KẾ VÀ CÀI ĐẶT
Sinh viên thực hiện: Phạm Tuấn Long
Lớp CNPM – K51
Giáo viên hướng dẫn: PGS. TS Huỳnh Quyết Thắng
ThS. Lê Quốc
HÀ NỘI 5-2011
DISSERTATION TASK SHEET
1. Student information
Full-name: Phạm Tuấn Long


Phone number: (+84) 972-889-760Email: Email:

Class: Software Engineering - C51 Form of education: Regular Form of education:
Regular
This dissertation is completed at: BkProfile group, based in Vietnam branch of Cazoodle Inc.
Execution time: FromFebruary 15 February 15
th
, 2011to May 15 to May 15
th
,
2011
2. Objectives
Design and implement an algorithm to assess expertise of users in question-and-answer
systems (Q&A systems), applied in BkProfile – a knowledge sharing community.
3. Tasks
− Research Q&A systems in the world and their methods to evaluate expertise of users
and quality of answers.
− Design an scalable assessment algorithm which is suitable with the context of
BkProfile.
− Implement the algorithm and test it in real system BkProfile, addressing at

4. Author's commitment:
I, Pham Tuan Long, hereby commit that this dissertation is my work under the instruction of
Prof. Dr. Huỳnh Quyết Thắng and Ms. Lê Quốc, incorporating no plagiarized passage.
All results and findings of this dissertation are honest.
Hanoi, February 15
th
, 2011
Author
Phạm Tuấn Long

5. Confirmation of the main mentor about the completeness of the dissertation and the
acceptance of dissertation defence:
Hanoi, February 17
th
, 2011
Mentor
Prof. Dr. Huỳnh Quyết Thắng
PHIẾU GIAO NHIỆM VỤ ĐỒ ÁN TỐT NGHIỆP
1. Thông tin về sinh viên
Họ và tên sinh viên: Phạm Tuấn Long
Điện thoại liên lạc: 0972889760Email: Email:

Lớp: Công nghệ phần mềm K51 Hệ đào tạo: Chính quy Hệ đào tạo: Chính quy
Đồ án tốt nghiệp được thực hiện tại: Nhóm BkProfile, đặt tại công ty Cazoodle@Việt Nam
Thời gian làm ĐATN: Từ ngày 15/02/2011 đến 25/05/2011
2. Mục đích nội dung của ĐATN
Xây dựng và cài đặt thuật toán đánh giá chất lượng câu trả lời và chuyên môn người dùng
trong các hệ thống hỏi đáp, ứng dụng cụ thể vào mạng cộng đồng chia sẻ tri thức BkProfile.
3. Các nhiệm vụ cụ thể của ĐATN
− Nghiên cứu các hệ thống hỏi đáp trên thế giới và cỏch tớnh điểm chuyên môn người
dùng và chất lượng câu trả lời của chúng.
− Thiết kế thuật toán đánh giá phù hợp với điều kiện của hệ thống BkProfile, và có thể
mở rộng khi hệ thống lớn lên.
− Cài đặt thuật toán và chạy thử nghiệm trên địa chỉ bkprofile.com
4. Lời cam đoan của sinh viên:
Tôi - Phạm Tuấn Long - cam kết ĐATN là công trình nghiên cứu của bản thân tôi dưới sự
hướng dẫn của PGS. TS. Huỳnh Quyết Thắng và ThS. Lê Quốc
Các kết quả nêu trong ĐATN là trung thực, không phải là sao chép toàn văn của bất kỳ công
trình nào khác.
Hà Nội, ngày 15 tháng 02 năm 2011

Tác giả ĐATN
Phạm Tuấn Long
5. Xác nhận của giáo viên hướng dẫn về mức độ hoàn thành của ĐATN và cho phép bảo vệ:
Hà Nội, ngày 17 tháng 02 năm 2011
Giáo viên hướng dẫn
PGS. TS. Huỳnh Quyết Thắng
PREFACE
To me, writing this dissertation is a prestigious chance to review my academic
knowledge, present my biggest product at school, and test my ability to deal with
real-life problems which I will have to solve in the future!
This dissertation is an important module of BkProfile, a knowledge sharing web
applications, developed by BkProfile team, a student team of School of Information
and Communication Technology, bases in Vietnam branch of Cazoodle Inc., under the
instruction of Prof. Dr. Thang Quyet Huynh and Ms. Quoc Le. The first version of the
module was integrated to the beta version of the web application, addressing at
. It connects with other important modules of the system, i.e.
SOLR search engine and JOO-framework-based presentation.
The objectives of this dissertation are as follows:
 Research Q&A systems in the world and their methods to evaluate the expertise
of users.
 Design a scalable assessment algorithm which is suitable with the context of
BkProfile.
 Implement the algorithm and test it in real system BkProfile, addressing at

As a result, the dissertation will be organized as follows:
 Chapter 1 describes my motivation to do the research, the statement of the
problem and my approach to tackle the problem.
 Chapter 2 includes a quick review of Google AardVark and Google Confucius,
as the most related research; Markov chain & PageRank as the foundation of my
research; and Hadoop MapReduce as the platform I use to implement the

algorithm.
 Chapter 3, the most important chapter of this dissertation, elaborates my method
to rank users in a Q&A system.
 Chapter 4 discusses the details of my implementation of the algorithm in
Hadoop MapReduce. It includes a technique to automate MapReduce processes.
 Chapter 5 summarizes the experimental results when I test ExpertRank in a real
system, named BkProfile. I also include the population & sampling I uses to
experiment the algorithm.
 Chapter 6 discusses the current application and current problems of ExpertRank,
and recommend a few ways to make it better.
 Chapter 7 is the conclusion.
MỞ ĐẦU
Với em, việc viết đồ án tốt nghiệp là một cơ hội quý báu để tổng kết lại kiến
thức đã học, giới thiệu sản phẩm lớn nhất mà em đã từng làm trong trường đại học,
và thử khả năng đương đầu với những thử thách của cuộc sống mà em sẽ phải đối
diện trong tương lai.
Kết quả của đồ án này là một bộ phận quan trọng của BkProfile, một ứng dụng web
chia sẻ tri thức, được phát triển bởi nhóm BkProfile, một nhóm sinh viên của Viện
Công nghệ thông tin & Truyền thông, làm việc tại chi nhánh Việt Nam của công ty
Cazoodle, dưới sự hướng dẫn của PGS. TS. Huỳnh Quyết Thắng và Ths, Lê Quốc.
Phiên bản đầu tiên của module này đã được tích hợp vào phiên bản beta của BkProfile,
tại địa chỉ htttp://www.bkprofile.com. Nó cú mối liên hệ với các modules quan trọng
khác của hệ thống như máy chủ tìm kiếm SOLR và bộ hiển thị viết trên nền JOO.
Mục tiêu của đồ án này là như sau:
 Nghiên cứu các hệ thống Hỏi & Đỏp trờn thế giới và phương pháp của chúng
trong việc đánh giá chuyên môn của người dùng.
 Thiết kế một thuật toán đánh giá có khả năng mở rộng cho các hệ thống cỡ lớn
và phải phù hợp với môi trường cụ thể của BkProfile.
 Cài đặt và triển khai chương trình tại hệ thống thực BkProfile, tại địa chỉ:


Theo đó, đồ án sẽ được tổ chức như sau:
 Chương 1 mô tả động lực thúc đẩy việc nghiên cứu, nêu vấn đề sẽ nghiên cứu
và hướng tiếp cận để giải quyết vấn đề.
 Chương 2 bao gồm một bản phân tích ngắn hướng tiếp cận của các hệ thống hỏi
đáp khác là Google AardVark và Google Confucius; bản tìm hiểu về chuỗi
Markov và thuật toán PageRank như là những nền tảng chính của nghiên cứu
này; và cuối cùng là Hadoop MapReduce là nền tảng phân tán mà em đó dựng
để cài đặt thuật toán. Trong đó, em cũng chỉ rõ đóng góp chính của nghiên cứu
này.
 Chương 3, cũng là chương quan trọng nhất của đồ án, trình bày chi tiết phương
pháp của em để xếp hạng người dùng trong các hệ thống Hỏi & Đáp.
 Chương 4 thảo luận các chi tiết về cách em cài đặt thuật toán đề xuất trên nền
tảng Hadoop MapReduce, trong đó có chứa một kỹ thuật để tự động hóa quá
trình chạy Hadoop MapReduce.
 Chương 5 tóm tắt kết quả thử nghiệm khi em kiểm thử ExpertRank trong một hệ
thống thực có tên là BkProfile. Em cũng mô tả dữ liệu mà em đã sử dụng để
kiểm thử.
 Chương 6 thảo luận những ứng dụng hiện tại, những vấn đề còn tồn tại của
ExpertRank và trình bày một vài hướng có thể cải thiện chất lượng kết quả của
thuật toán.
 Chương 7 là phần tổng kết những gì em đã làm được, đối chiếu với những gì đã
đặt ra từ trước.
Như vậy, chương 1 – 2 tương đương với phần đặt vấn đề và định hướng giải pháp,
chương 3-6 tương đương với phần các kết quả đạt được và chương 7 tương đương với
phần kết luận trong hướng dẫn viết đồ án của Viện.
ABSTRACT
This dissertation presents ExpertRank, an iterative algorithm to evaluate the expertise of users
about a specific domain of knowledge in question & answer systems, i.e. BkProfile – a
knowledge sharing community. The evaluation help users prove their ability in their
professional profiles in the system which is a critical motivation for Q&A system to run

smoothly. I base my method on Markov chain model, refer to Google's PageRank algorithm,
to convert the above problem into a probabilistic model and then use an iterative method to
solve the problem. My algorithm is designed on MapReduce programming model so that it
can be applied for large scale systems. We have experimented it in a open-source platform
named Hadoop mapReduce and deploy it in BkProfile web application,
. Results of the algorithm can be encapsulated as a reliable
parameter for evaluation system of other large-scale sharing web applications.
Keywords Iterative method, Markov chain, MapReduce, PageRank
TÓM TẮT NỘI DUNG ĐỒ ÁN TỐT NGHIỆP
Nghiên cứu này giới thiệu ExpertRank, là một thuật toán lặp đánh giá chất lượng câu trả lời và
trình độ chuyên môn của người dùng về một lĩnh vực nào đó trong các hệ thống hỏi và trả lời
cỡ lớn, mà cụ thể là mạng cộng đồng chia sẻ tri thức BkProfile. Việc đánh giá chuyên môn
người dùng sẽ giúp người dùng có thể chứng minh được kiến thức chuyên môn của mình trong
hồ sơ nghề nghiệp của họ trong hệ thống, là một động lực quan trọng thúc đẩy hoạt động của
các hệ thống Hỏi & đáp. Em đã dựa trên thuật toán phân loại trang web của máy tìm kiếm
Google có tên là PageRank, và mô hình chuỗi Markov để chuyển bài toán trên thành các mô
hình xác suất, từ đó xây dựng thuật toán lặp đánh giá cùng một lúc hai đại lượng trên. Thuật
toán của chúng tôi được thiết kế trên mô hình Map-Reduce nên có thể được áp dụng cho các
hệ thống phân tán cỡ lớn. Em đã thử nghiệm nó trờn hệ thống mã nguồn mở có tên là Hadoop
Map Reduce và triển khai nó chạy ổn định trên ứng dụng web BkProfile tại địa chỉ
. Các kết quả của thuật toán cũng có thể được đóng gói như là 1
tham số tin cậy sử dụng cho các hệ thống đánh giá trong các ứng dụng chia sẻ tri thức cỡ lớn
khác.
Từ khóa Phương pháp lặp, chuỗi Markov, MapReduce, PageRank
ACKNOWLEDGEMENT
First, I am heartily thankful to Prof. Dr. Thang Quyet Huynh whose prestigious
advice and on-time guidance enabled me to complete this dissertation.
I would like to thank Ms. Quoc Le for his in-depth discussion with me when I was
searching for a solution to tackle this research problem.
Then, I want to express my appreciation and special thank to BkProfile team,

particularly Anh Nguyen, Vi Nguyen and Khanh Ha who have supported me in
implementing ExpertRank on Hadoop MapReduce.
Finally, I want to send many heartfelt thanks to School of Information &
Communication Technology, under Hanoi University of Science & Technology, for the
knowledge, skills and spirit I have been learning there.
Pham Tuan Long
Table of Contents
INTRODUCTION 17
1.1 Motivations 17
1.1.1 The explosion of Q&A systems 17
1.1.2 The need of a good expertise evaluation formula 18
1.2 Statement of the problem 19
1.3 My approach 19
1.4 Dissertation structure 20
1.5 Summary 21
LITERATURE REVIEW 22
1.6 Analysis of Google Confucius's approach 22
1.7 Analysis of Google AardVark's approach 22
1.8 Online knowledge market 22
1.9 Recommendation-based evaluation system 23
1.10 Markov chain and its convergence condition 23
1.11 Google PageRank algorithm 24
1.12 Cloud computing 25
1.13 MapReduce programming structure 26
1.14 Hadoop MapReduce 27
1.15 Summary 27
METHODOLOGY 29
1.16 Recommendation channel aggregation 29
1.17 Random expert seeker model 29
1.18 ExpertRank basic formula 30

1.19 Examples of running ExpertRank basic formula 31
1.20 ExpertRank's convergence 34
1.20.1 Mapping from Random expert seeker model to Markov chain 34
1.20.2 Dangling links 36
1.20.3 ExpertRank traps 37
1.20.4 Solution 37
1.21 ExpertRank full formula 38
1.22 Limitation of the algorithm 38
1.23 Summary 38
IMPLEMENTATION 39
1.24 Flow of MapReduce processes 39
1.25 MapReduce Implementation on Apache Hadoop 39
1.26 Data normalization 40
1.27 ExpertRank transition calculation 40
1.28 Random visiting probability distribution 42
1.29 Halt condition 42
1.30 Final result extraction 42
1.31 Summary 42
RESULTS & FINDINGS 43
1.32 Population and Sampling 43
1.32.1 Overview of BkProfile 43
1.32.2 Sampling methods 43
1.32.3 Expected Results 43
1.33 Performance & convergence 44
1.34 Examples 44
1.35 Summary 45
DISCUSSION 46
1.36 Applications 46
1.36.1 Search engine ranking function 46
1.36.2 Answer quality evaluation 46

1.36.3 Potential answerer suggestion 47
1.36.4 User reliability evaluation 48
1.37 Limitation of current ExpertRank formula 48
1.38 ExpertRank's incentives 49
1.39 Prevention of spammers 50
1.40 ExpertRank's personalization 50
1.41 Summary 51
CONCLUSION 52
INDEX OF TABLES
Table 1: Advantages of Q&A systems in comparison with other knowledge sharing &
searching tools 18
Table 2: Incentives of current famous Q&A sites 18
INDEX OF FIRGURES
Figure 1: An example of application of Markov chain 24
Figure 2: PageRank transfer among web pages 25
Figure 3: Applications of cloud computing 26
Figure 4: Basic model of MapReduce 26
Figure 5: Random expert seeker model 30
Figure 6: ExpertRank transfer among expert network 31
Figure 7: Step 1 of the example of using ExpertRank basic formula 32
Figure 8: Step 2 of the example of using ExpertRank basic formula 33
Figure 9: Mapping from Random expert seeker model to ExpertRank 36
Figure 10: Dangling links and outer node 36
Figure 11: ExpertRank trap 37
Figure 12: A solution for the convergence of ExpertRank 37
Figure 13: Flow of MapReduce processes 39
Figure 14: Rough data 40
Figure 15: Aggregation 40
Figure 16: Transition table 42
Figure 17: The convergence of ExpertRank. 44

Figure 18: The search engine of BkProfile 46
Figure 19: Presentation of answers for a question in BkProfile 47
Figure 20: Stats in BkProfile 50
Figure 21: A profile in BkProfile 55
Figure 22: ExpertRank are listing in each topic's page 55
ABBREVIATION
No Abbreviation &
terminology
Explanation
1 Q&A Question & Answer
2 NLP Natural Language Processing
3 SoICT School of Information and Communication Technology
4 HUST Hanoi University of Science and Technology
TERMINOLOGY
No Terminology Explanation
1 Outlink A link appearing in a website to direct users to another
website
2 Outer node A node that do not have outlink
3 Dangling link A link that directs to an outer node
4 Search engine A computer program which finds information on the
Internet by looking for words which you have typed in
5 Apache Lucene A high-performance, full-featured text search engine library
6 SOLR A popular open-source text search engine based on Lucene
technology and written in Java
7 Hadoop A project develops open-source software for reliable,
scalable, distributed computing
8 MapReduce A programming model for writing applications that rapidly
process vast amounts of data in parallel on large clusters of
compute nodes
9 Hadoop MapReduce An open source volunteer project under the Apache

Software Foundation
10 Cloud computing A style of computing where massively scalable IT-related
capabilities are provided as a service using Internet
technologies to multiple external customers
11 Markov chain a mathematical system that undergoes transitions from one
state to another as a chain, endowed with the Markov
property: the next state depends only on the current state
and not on the past.
12 PageRank Google's algorithm to rank the importance of web pages in
the Internet
13 ExpertRank Name of my proposed algorithm to rank users' expertise
INTRODUCTION
In this section, I will first present the explosion of Q&A systems and the reasons
why a Q&A system need an algorithm to evaluate both expertise of users and
quality of answers to prove the need of the research. Then, I will clearly state the
problems I would like to solve and discuss my approach to solve it. At the end of
this chapter, I will describe other chapters of my dissertation and how they connect
one another.
1.1 Motivations
1.1.1 The explosion of Q&A systems
In recent years, Question and Answer systems or Q&A systems in short, have
developed in a rocket speed. In 2010, Google published a research stating that 25% of
Google's top search results include at least one link to some Q&A site [1]. It is easy to
understand when most big names in IT has participated in the so-called online
knowledge market, namely Yahoo with Yahoo Answers; Google with Google Answers
and Google Confucius; and Facebook with Facebook Q&A. There are also very good
Q&A systems from smaller companies such as Quora, StackOverFlow, OSDir, and so
on. People even built a platform, named Question2Asnwer, to quickly create new Q&A
sites. So far, it has helped build up to 1665 sites in 32 languages [2]. Most recently, in
the year of 2010, AardVark, a 2-year-old Q&A system, developed by 20 engineers, has

been acquired by Google with the price up to $50,000,000.
Now, Q&A sites are still burgeoning thanks to their advantage in comparison with
other online knowledge sharing & searching tools such as search engine, forum and
blog:
No Advantage of Q&A Search Engine Forum Blog
1 providing thorough
and direct answers for
complex questions
Search results are
usually indirect and
require users to
synthesize the
results to have the
final answer
Too many junk and
not-to-the-point
answers which make
up of very long
threads. Answers and
discussions are mixed.
Users need to
synthesize articles to
get the final answer
Answers are often
indirect since
articles are to solve
authors' problems
rather than the
searchers' ones.
2 providing answers for

completely new
questions
can't can can't
3 providing answers for
heavily local
questions
Usually can't can can
4 providing search
engine with well
N/A Forum is one of
hardest kinds of sites
Blog is also hard to
index due to its
No Advantage of Q&A Search Engine Forum Blog
indexable content
thanks to its
consistent structure.
to index due to its
heterogeneous
structure
heterogeneous
structure
Table 1: Advantages of Q&A systems in comparison with other knowledge sharing & searching
tools
1.1.2 The need of a good expertise evaluation formula
Through using various Q&A systems, i.e. OSDir, Google Confucius, Quora,
Google AardVark, and so on, as well as through reading analytical papers on
successful Q&A systems, I realize that there are three main requirements to bring an
Q&A site to success:
 New questions quickly get answers

 Answers are provided with high quality
 Questions & answers are organized so that they are easier to be found.
Each website has its own way to reach the three above criteria. Some choose
automatic methods such as automatic answering, automatic classifying, automatic
organizing and so on but most of them use motivations to encourage their users to self-
solve the three requirements. According to [1], there are two main incentives for people
to contribute to a Q&A site: finance & virtual values.
No Site Question routing Incentives Establishment Year
1 Internet Oracle to experts virtual 1989
2 Ask.com N/A N/A 1996
3 WikiAnswers to public virtual 2002
4 Yahoo! Answers to public virtual 2005
5 Baidu Zhidao to public & experts virtual & $ 2005
6 Google Confucius to public & experts virtual & $ 2007
7 Aardvark to friends virtual 2009
8 Powerset N/A N/A 2005
9 Quora to friends virtual 2009
Table 2: Incentives of current famous Q&A sites
Question routing is to bring the question from a questioner to the people who have
most incentives. Question routing is critical to ensure that questions are quickly
answered and answers are provided with high quality. However, it depends on the
incentives that the system provides. Taking its turn, incentives depend on the
evaluation system of the Q&A site. If incentive is financial, the site need to have a fair
evaluation formula to determine how much it should pay for an answer of a specific
expert. If incentive is virtual values, a fair evaluation system is also critical to make
virtual values more reliable and more beneficial. It can be said that evaluation, i.e.
evaluation of answers and evaluation of experts, are the main driven force to keep a
Q&A site running smoothly. It, together with the method to organize questions &
answers, decides the success of a Q&A site.
However, working out a good evaluation is not an easy task due to the complicated

relationship between elements in a Q&A site such as answers, questions, answer
providers and so on.
I, together with my team, are creating a Vietnamese Q&A site, named BkProfile
[3]. The 1000-user system is currently running quite smoothly without a proper
evaluation formula. However, we all understand that a good and fair evaluation
formula is absolutely important to ensure the stable development of the site.
1.2 Statement of the problem
In this dissertation, I will present a method to evaluate both expertise of users and
quality of answers. The evaluation should be endowed with the following
characteristics:
(1) Fair & objective. Since evaluation is usually subjective, there is no way to create
a perfect evaluation formula that works in all cases. However, a fair & objective
formula is a good basis for people's self-judgment.
(2) Incentive providing. Not all evaluation formula, even those which are rational in
some cases, provide incentive for the development of Q&A site. However,
providing incentive is critical here since it is the reason for the research of the
evaluation formula.
(3) Scalable. We plan to push our system to the scale of whole Vietnamese
knowledge market. Thus, a scalable algorithm is compulsory.
1.3 My approach
There are two natural ways to evaluate an answer or a user:
(1) Evaluate the content of answers and the contribution of users themselves, such
as: how the answer is relevant to its question, what level of the writing style,
how different the answer is in comparison with previous answers and so on; and
how much time the user has contributed to the system, how many answers he
has provided, how good his answers were, and so on.
(2) Evaluate the relationship among answers, users, and answers & users, such as
how many votes a user delivered to another, how many answers a user provided
another, how good an answer was compared to other ones, and so on.
Between the 2 approaches, I chose the latter because current methods to evaluate

content of a text is not good enough and particularly quite subjective which violate the
first issue of the statement of the problem. Moreover, to simplify the algorithm, I only
evaluate the expertise of users and then use it to evaluate the quality of answers
through the interaction between user and answer, i.e. voter/voted and answer-
provider/answer.
However, the problem is still complicated with a number of channels connecting
users such as:
 A answers B
 A votes for B
 A refers B
 A invites B
 A and B are in the same physical expertise group like professional chapters,
university class, technology club, etc.
 …
Thus, I solve the problem by separating the whole process into 2 independent sub-
processes:
 Aggregate all channels into one single channel, which I call recommendation
 Analyze the links among all users through only the aggregated channel. The
analysis technique I use is similar to Google's PageRank algorithm which uses
Markov chain model, the basis of my method.
The method is quite fair and objective since it considers the whole network rather
than only a piece of information like a specific answer. It provides incentives for the
system since highly ranked users in recommendation-based systems have many
privileges such as prestige, weight of influence and so forte. Finally, it can be scalable
as the algorithm can be implemented in a large-scale platform, named Hadoop
MapReduce. I implemented a simple version of ExpertRank on this platform which
gave me some encouraging result.
1.4 Dissertation structure
The dissertation is constructed as follow:
Chapter 1 describes my motivation to do the research, the statement of the problem

and my approach to tackle the problem.
Chapter 2 includes a quick review of Google AardVark and Google Confucius, as
the most related research; Markov chain & PageRank as the foundation of my research;
and Hadoop MapReduce as the platform I use to implement the algorithm.
Chapter 3, the most important chapter of this dissertation, elaborates my method to
rank users in a Q&A system.
Chapter 4 discusses the details of my implementation of the algorithm in Hadoop
MapReduce. It includes a few techniques to automate MapReduce processes.
Chapter 5 summarizes the experimental results when I test ExpertRank in a real
system, named BkProfile. I also include the population & sampling I uses to
experiment the algorithm.
Chapter 6 discusses the current application and current problems of ExpertRank,
and recommend a few ways to make it better.
Chapter 7 is the conclusion.
1.5 Summary
In this chapter, I have described my motivation to do the research, i.e. the explosion
of Q&A systems and their need of an algorithm to evaluate quality of answers and
expertise of users. I also clearly stated the problems and gave out my approaches to
tackle the problems, i.e aggregate relationship channels into only one channel called
recommendation and analyze the network of recommendation to rank experts. Next
chapter is a review of related work which express the reasons why I chose the
approach.
LITERATURE REVIEW
Despite the rapid development of Q&A systems, careful and direct research on
it is quite limited. Among all, I chose Google Confucius's and Google AardVark's
research papers to analyze because of their high quality and their similarity to the
system I and my team are building. Then I generalize Q&A system to an online
knowledge market tool to take advantage of the research on other online markets,
particularly those which are proved to work well in Vietnam. An analysis of
recommendation based evaluation system is discussed since it is the main idea of

my approach. Then, I discuss Markov chain which acts as the basis of my research
as well as Google's PageRank as an famous example of using Markov chain. My
method derives a lot of ideas from Google's PageRank. Finally, I write a brief
overview of cloud computing, MapReduce programming model and Hadoop
MapReduce because Hadoop MapReduce is the platform I use to implement my
algorithm.
1.6 Analysis of Google Confucius's approach
Google Confucius [1] is a very sophisticated system with 6 main components:
search integration, question labeling, question recommendation, answer quality
assessment, user-ranking and NLP-based answer generation. With the 6 components,
they provide a smooth flow in which users' work is minimized. They also provide
financial incentive to motivate experts work much in the system. The user rank system
is quite important to help evaluate the money they have to pay for each user. They used
HITS algorithm [4], using questioner/answerer relationship, to rank users. Their
rationale is that users in Q&A systems are not active enough to provide adequate social
iteration like votes, improvement suggestions, and so on. However, we still believe that
if constructing a Q&A site as an knowledge sharing community, people will provide
more social interaction and that they are prestigious information to mine.
1.7 Analysis of Google AardVark's approach
Google AardVark names itself a social search engine. And in fact, their way of
implementing a Q&A site quite different from all others. They do not keep the answers
as a knowledge store, but consider all queries as new questions and try to route them to
the right people. The right people are those who have good activity records in
AardVark and those who are somehow connected to the questioner. They extract the
latter information through intensively analyzing social networks, such as Facebook,
Twitter, Yahoo 360, of users. They also have a module to evaluate quality of answers
and expertise of users through probabilistic models and that is what I learned when
building formula for ExpertRank.
1.8 Online knowledge market
Online knowledge market is the common terminology of all system aiming at

routing knowledge from “providers” to “seekers”. Considering Q&A system as an
online knowledge market can help me look at the essential characteristics of its: an
online market. Researching on online markets, i.e Amazon and eBay, I realize that
there are 3 things which are very important to drive a market to success:
 The availability of goods, i.e. knowledge. In Q&A systems, the availability is
created through the size and activeness of community and supporting system
like information browser, request router and search engine.
 The evaluation of goods through review and information adequacy. In Q&A
systems, they are the comments, votes for answers, and the structure of an
answer.
 The evaluation of service providers. In Q&A systems, of course, it is mainly the
expertise of users. The expertise here is defined as the ability to provide good
answers. It is nothing related to the certificates or positions people may have in
real life.
Finally, besides the 3 above criteria, for Vietnamese, through researching on a few
online markets like vatgia.com and chodientu.vn, I see that popularity of usage of
goods is a very important criterion for a Vietnamese to buy a product. It is the
foundation for me to believe that an evaluation formula based on reviews of
community will succeed in Vietnam online knowledge markets.
1.9 Recommendation-based evaluation system
Recommendation-based evaluation is very common in Western culture. People
base their assessment of a person on trustable recommendations. For instance,
admissions officers bases their admissions decision on the letters of recommendations
from their colleagues; researchers read through the name of all articles referring to a
specific article to roughly evaluate its quality. In a world of knowledgeable people who
care much about their prestige, recommendation system work smoothly since people
will be very conscious of recommendation to promote/protect their image. The Q&A
site which we are building focuses on knowledgeable community. Therefore, using
recommendation system to evaluate people is reasonable.
About the used technique, link analysis [5] , or in another word, collaborative

filtering [6], a technique to predict characteristics of objects, is not a new one. It is
discussed intensively in an active branch of computer science, named Knowledge
Discovery in Database [7]. My contribution is to apply and develop the techniques into
Q&A systems, a new context with unique characteristics.
1.10 Markov chain and its convergence condition
A Markov chain is a mathematical system that undergoes transitions from one state
to another as a chain [8]. It is a random process endowed with the Markov property: the
next state depends only on the current state and not on the past.
In practice, Markov chain has a lot of applications. In the above figure, people
modeled the transition of conditions of markets as a Markov chain. Based on the
information, people calculated the probabilistic distribution of the conditions in a
specific period of time by iteratively calculating the new distribution after a transition
until the process is convergent. The problem is that whether a Markov chain is
convergent and how many convergent states it has. And that is one of the main
problems of Markov chain on which people research.
According to the theory of Markov chain [8], an irreducible Markov chain will
converge in one state, or in another word, will have one stationary distribution, if and
only if all of its states are positively recurrent. Of these:
 A Markov chain is irreducible if it is possible to get to any state from any state.
 A Markov chain is recurrent, or in another word persistent, if it is not transient,
i.e. from a specific state i, there is a non-zero probability that we will never
return to i. A recurrent Markov chain is positive if the process is finite.
In this dissertation, I map my research problem with a Markov chain to design a
formula in which after a finite number of iterations, it will lead to one-and-only-one
convergent state, i.e. a formula satisfies both the two above conditions.
1.11 Google PageRank algorithm
Figure 1: An example of
application of Markov chain
PageRank [9] is a very famous algorithm of Google to determine the importance of
a web page through times when it is the outlinks of other pages: the more and with the

higher quality a web page is linked from others, the more important it will be. For
instance, because tamhonvietnam.net is linked by soict.hut.edu.vn, a trustable website,
the importance score of tamhonvietnam.net is added with some value, and hence the
website will be easier to be found in the search engine.
If considering incorporation of an outlink to a web page is a recommendation to the
page, PageRank evaluation is exactly a recommendation based one. Moreover,
PageRank targets on a large-scale problem when attempting to assess the whole world
wide web. Finally, PageRank algorithm can be considered a Markov chain with
random web surfer model. Therefore, I base my research on PageRank to ensure the
acceptability of some of my hypothesis.
1.12 Cloud computing
Cloud computing is a style of computing where massively scalable IT-related
capabilities are provided as a service using Internet technologies to multiple external
customers . With the appearance of cloud computing, there are more chances for small
companies to approach large-scale batch processing which is more and more popular
with the development of data mining techniques and the appearance of huge databases
like those of Facebook, Amazon, Zing Me and so on. Currently, there are many good
cloud computing service providers, i.e. Amazon EC2, NephoScale, Google, Microsoft,
RightScale, etc. Using their services, people do not need to buy expensive computers to
scale their system up but rent services with their desired configurations and use the
providers' APIs to access the services. Since BkProfile is a small team, using cloud
computing services is probably a wise approach to process large-scale computation.
Figure 2: PageRank transfer among web pages

×