Tải bản đầy đủ (.doc) (27 trang)

An Experiment in Building Vertical Search Engine

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (373.15 KB, 27 trang )

Vietnam National University of Hanoi
University of Engineering and Technology
STUDENT RESEARCH SEMINAR, 2012
Project:
An Experiment in Building Vertical Search Engine
Students:
Phạm Ngọc Quân K53CA
Phạm Lê Lợi K53CA
Bùi Hữu Điệp K53CA
Lê Đăng Đạt K53CA
Faculty: Information Technology
Supervisor: Dr. Lê Quang Hiếu
Hanoi, 2012
PROJECT SUMMARY
Project: An Experiment in Building Vertical Search Engine
Project members: Phạm Ngọc Quân, Phạm Lê Lợi, Bùi Hữu Điệp, Lê Đăng Đạt
Project supervisor: Dr. Lê Quang Hiếu, Department of Information Technology, UET
Management: University of Engineering and Technology, VNU Hanoi
Research time: 9/2011 – 3/2012
1. Motivation
When using a general search engine, user will get result involving many aspects, due to
unclassified websites. However, for people who are searching for information about a
specific topic, the websites in this category should be prioritized. In this case, with using
the popular search engine, users will have to read through the search results and choose
suitable ones, causing inconvenience. Our research’s purpose is a experiment to build a
search engine that allows users to choose the domain they like, the returning results
would closely relate to the chosen domain.
2. Main content
In this project, we focus on building a search engine that works as an upgrade of the
popular search engine. The vertical search engine collects the search results from one or
several popular search engine (if there are decent differences in the searching methods of


these search engines). After that, a classification module will decide which sites are
domain-related. Finally, the result will be filtered with removing the out-of-topic
webpages and then returned to users. We also add some suggestions for keyword such as
keyword correction, keyword expansion so that user can get better result.
3. Research result
In our experiment, we demonstrate the idea as an experiment for building a vertical
search engine with the topic chosen as Football. The search engine contains the collecting
data module that gets search results from Yahoo and Bing search engine, and the
2
classification module uses Support Vector Machine classifier. The whole project was
done in Java and installed in a website using JSP.
Another separately experiment is the keyword suggestion module that was written in
Python. Since the running time of this module when integrated with JSP is inconvenient,
this module is removed from the whole search engine, but can be tested severally.
3
TABLE OF CONTENTS
I. INTRODUCTION 5
II. LITERATURE REVIEW 6
1.In the world 6
2.In Vietnam 8
3.Our research goal 8
III. VERTICAL SEARCH ENGINE 9
1.System Architecture 9
2.System’s Features 10
IV. SYSTEM MODULES 11
a.Meta Search Engine 11
1.1. Introduction 11
1.2.Operation 11
1.3.Structure 11
2.Webpage Filter module 12

2.1.Introduction 12
2.2.Support Vector Machine Introduction 12
About 300 more web pages of each category (Football and Non-Football type) are collected to test
the model. The result of the testing is 97% web pages are correctly classified 15
3. Keyword Suggestion Model 16
3.1. Introduction 16
3.2.Operation 16
3.3.Algorithms 17
4. Search Interface 19
V. EXPERIMENTAL RESULT 22
VI. FUTURE WORK 25
VII. CONCLUSION 25
VII. REFERENCES 26
4
I. INTRODUCTION
Internet is becoming more and more popular in every country; its capacity is also getting
larger every second. Internet’s complex structure and its huge amount of data had been
irresolvable obstacles for internet users. That was the reason for the introduction of a
large number of search engines, and Google was a big success. However, a general
search engine like Google, which treats data in all domains equally, would become
inconvenient when users prefer one specific domain to the others. In this situation, a
vertical search engine, with the contribution of domain-specific expertise would perform
greater.
A vertical search engine, as distinct from a general web search engine, focuses on a
specific segment of online content. The vertical content area may be based on topicality,
media type, or genre of content. Common verticals include shopping, the automotive
industry, legal information, medical information, and travel. In contrast to general Web
search engines, which attempt to index large portions of the World Wide Web using
a web crawler, vertical search engines typically use a focused crawler that attempts to
index only Web pages that are relevant to a pre-defined topic or set of topics.

Some vertical search sites focus on individual verticals, while other sites include multiple
vertical searches within one search engine. Vertical search offers several potential
benefits over general search engines:
 Greater precision due to limited scope
 Leverage domain knowledge including taxonomies and ontology.
 Support specific unique user tasks
A part of vertical search engine which focus on specific topic is domain-specific search.
Domain-specific search solutions focus on one area of knowledge, creating customized
search experiences, that because of the domain's limited corpus and clear relationships
between concepts, provide extremely relevant results for searchers.[2]
Normally, the process of building a search engine will consist of the steps below:
• Creating a crawler that collects the websites from the internet. This step covers the
internet, as well as making a database of websites for the search purpose.
• Indexing the websites
5
• Query Processor: process with the query (from the user) with natural language
processing and match the query in the websites for the list the appropriate results.
• Determining the website ranks, and returns the ranked list to the user.
However, these steps require massive of storage as well as a remarkable algorithm for
ranking the websites. Instead of making a whole new search engine, metasearch engine is
another method to make a new one.
A metasearch engine is a search tool that sends user requests to several other search
engines and/or databases and aggregates the results into a single list or displays them
according to their source. Metasearch engines enable users to enter search criteria once
and access several search engines simultaneously. Metasearch engines operate on the
premise that the Web is too large for any one search engine to index it all and that more
comprehensive search results can be obtained by combining the results from several
search engines. This also may save the user from having to use multiple search engines
separately. [3]
On our research, we combine the technology of domain specific search engine and the

idea of metasearch engine, resulting a two levels structure that coordinate each search
engine’s own advantages.
II. LITERATURE REVIEW
1. In the world
Search engine construction has evolved from the early days of building from scratch to
today's plethora of data APIs that make tomorrow's vertical search engines more
powerful and easier to build.
Past
• Huge expenses to build the index, find the data, maintain the process.
• Majority of time spent on building relevancy and less on design and creating a
unique experience.
Present
• Search APIs reduce the complexity of building an index.
• Vertical search engines still spend significant resources on creating unique data.
6
• More resources are spent on designing the best relevancy and a unique experience.
Future
• New search engines tap into huge amounts of distributed data.
• More time for developing unique approaches to presenting relevant information
and creating a unique experience.
Vertical search engines have a distinct advantage over the general search engines. They
already know what their users are interested in. A search for Jaguar in Yahoo! may return
the automobile, the Mac OS, or the animal. However, vertical search engines that
specialize in sports, autos, or animals would not have that problem. This assumption of
user interest gives vertical search engines more flexibility in creating new models of
relevancy ranking.[4]
General search engines, like Google, Yahoo or Bing are certainly famous to every
internet user, they are considered indispensable tools. People now even go to search
pages to find websites that they have already known, instead of directly enter websites’
name on address bar. On the other hand, vertical search engine and metasearch engine

obtain very few successes. Some vertical search engines were released such as MedNar,
PubMed, BizNar, some metasearch websites like iBoogie, InfoGrid were built but none
of them are become famous in the entire world. Currently, the vertical search mechanism
and metasearch method separately seem not powerful enough to overwhelm classic
searching machine, they should be researched further or be combined together.
A good representative for vertical search is Truevert, which is an environmental vertical
search engine that is going beyond the basic assumption of a niche user's intentions. They
build a unique natural language dictionary to enhance relevancy. A search for "CFL" on a
regular search engine could return "Canadian Football League" but Truevert recognizes
this as the acronym for "Compact Flourescent Lighting", a much more relevant term for
environmental concerns.[5]
Back in history, Yahoo was once the dictator on searching aspect, it then becomes the
second after the risen of Google. The success of vertical search engine may be on the
future, when convenience is more appreciated.
7
2. In Vietnam
In Vietnam, general search engine are very popular: Almost every website has its own
search engine. However, the idea of vertical search and metasearch has not been
industrially explored.
3. Our research goal
We want to demonstrate the vertical search engine idea that re-filter the results of the
general search engine based on a specific topic such as medicine, health, football,
weather, economy, etc A simple vertical search engine should be done with similar user
interface as general search engines, the speed should be acceptable, the results should be
prioritized based on a topic and the keywords can be suggested for the users.
Other than that, the experiment can also be an approach for providing a mechanism to
quickly build one vertical search engine with least effort.
8
III. VERTICAL SEARCH ENGINE
1. System Architecture

The architecture for the system is layer-based, each layer represent one levels of filter, the
more layers we have, the more irrelevant websites are filter out, and thus, the better the
results are. The layers are independent so that we can easily add, remove one layer or
replace it by a new one.
We have developed 4 modules: Meta Search Engine,Webpage Filter and Keyword
Suggestion, Search Interface:
Figure 1: System architecture
9
Search Interface
Other Search Engine Other Search Engine
Meta-search
Engine
Keyword
Suggestion
Filter
Knowledge Base
Metasearch Engine use metasearch technique to ask other search engine about the finding
keyword. It then get all returning results together with their scores, transform them from
html form to normal text form and then send the result to the upper module: Text
classification
Webpage Filter takes results from Web Crawler, refine the results based on the
knowledge base, with the technique using Support Vector Machine Classification. All the
passed results are sent to Interface to display.
Keyword Suggestion, independent with WebCrawler, gets suggestion from other search
engine, then use Information Gain (IG) to rearrange results. Top high-score suggests are
sent to Interface to display.
Search Interface allows user to choose number of pages to get and enter the keyword. It
then calls Text Classification and Keyword Suggestion to get returning pages and
suggests respectively. Finally, the results are displayed to users.
2. System’s Features

By combining the popular and efficient search engine of Bing or Yahoo and the
functionality to refine the results as categories, the system can offer a vertical search
service that helps the users to find the efficient information in the topic, without self-
filtering the information of the search results. The keyword suggestion function helps the
user to get the keywords inside the topics, which is the upgrade on the keyword
suggestion function of the popular search engine.
The system also offer a method to setup a personal or topic search engine that should
serve an organization or company in the limited time where the specific information is
needed to search, which is not reliable when using the popular search engine.
10
IV. SYSTEM MODULES
a. Meta Search Engine
1.1. Introduction
Meta Search Engine is the module stands between Search Engine and the Internet. It is an
Java program which receives keyword from upper layer, then use the Internet as the
resource to find related page. It returns a list of unordered pages in all aspects to the
upper layer.
1.2. Operation
Receive keyword from user, Web Crawler do following tasks:
 Ask multiple Search Engines and get multiple lists of pages.
 Add lists together, remove duplication, and check pages’ availability.
 Transform received pages into normal text and pass to upper layer.
1.3. Structure
This module consists of 2 smaller modules: Web downloader and Html to text converter.
Figure 2: The metasearch structure
Web downloader asks other search engine about the keyword by requesting appropriate
Urls. Then from the returning html pages, it extracts links to searching pages, their brief
description, their scores and their availability. It then enters each page and get page’s
content, create a list of html files and send to next module.
11

Yahoo Search Engine
Bing Search Engine
Web downloader
Text Converter
The text converter after receiving html list from Web downloader, it reads through html’s
contents, removes Java Scripts, Css and other frames to obtain a list of nature language
lists. That list is later sent to Classification module.
2. Webpage Filter module
2.1. Introduction
This is the main model that decides the quality of search results. Using previously
collected data and learning model, this module will decide whether a webpage belongs to
the topic of the search engine (The topic we chose for demonstration is Football). The
classification is based from clustering web pages, using Support Vector Machine
algorithms as classifier, with words in the web pages as attributes.
2.2. Support Vector Machine Introduction
A support vector machine (SVM) is a concept in statistics and computer science for a set
of related supervised learning methods that analyze data and recognize patterns, used
for classification and regression analysis. An SVM model is a representation of the
examples as points in space, mapped so that the examples of the separate categories are
divided by a clear gap that is as wide as possible. New examples are then mapped into
that same space and predicted to belong to a category based on which side of the gap they
fall on.
More formally, a support vector machine constructs a hyperplane or set of hyperplanes in
a high- or infinite-dimensional space, which can be used for classification, regression, or
other tasks. Intuitively, a good separation is achieved by the hyperplane that has the
largest distance to the nearest training data point of any class (so-called functional
margin), since in general the larger the margin the lower the generalization error of the
classifier.[6]
12
Figure 3: Data classified with a hyperplane by SVM

2.3. The LIBLINEAR Library Introduction
LIBLINEAR is a library for classification, built by the Machine Learning Group at
National Taiwan University.
Being linear classifier for data with millions of instances and features, LIBLINEAR
supports:
• L2-regularized classifiers, L2-loss linear SVM, L1 loss linear SVM and
logistic regression
• L1-regularized classifiers, L2-loss linear SVM and logistic regression
Main features of LIBLINEAR:
• Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
• Cross validation for model selection
• Probability estimates (logistic regression only)
• Weights for unbalanced data
• MATLAB/Octave, Java, Python, Ruby interfaces[7]
The SVM built in LIBLINEAR works as followed:
13
• The training phase: The SVM reads the training data as a list of samples that
are already classified. The samples contain features which have indexes and
values. The SVM will calculate the model based on the samples. The model is
then used in classification.
o In the search engine, the samples are web sites that are classified to
specific topics. The features are words where the values are the number
of appearances of that word in the web site.
o After the training phase, the SVM created a model, which records the
features and the weight coordinates that are calculate based on the
training data.
• The classifying phase :
o The SVM reads the samples that need to be classified as the same way
as the training data. Based on the coordinates in the models, and the
features that sample contains, the SVM can decide the class of that

sample.
2.4. The Webpage filter module
Based on the two phases of LIBLINEAR functionality, the work of building the filter
module also has two steps.
a. Building the knowledge base
In order to make the model, we need to collect data as web pages which are already
classified as topics (such as sport, economy, whether, etc ). For this purpose, we find the
news websites that classify their pages as clear topics, such as bbc.com, wikipedia.com.
Football webpages are collected primarily at goal.com, which is one the most reliable
football news website.
For the quantity of the web pages collected, we decided to collect 1000 web pages for
each category ( Football and Non-Football ), because the more web pages we collected,
the more words we have for t he word store, used as the features for the SVM models.
The content of the web site is also filtered. We only keep the body content after removed
14
headers, footers and other non-relative contents. The football oriented contents are
prioritized
However, the data that we collected is only at raw type. It had to be converted to the
format that LIBLINEAR could understand. Each sample is expressed in a vector of
features, along with the class that sample belongs to.
b. Parameter Modifying for LIBLINEAR
A SVM model in LIBLINEAR has three parameters:
• Solver Type: this parameter is to choose the functionality of the SVM. Here we
chose L2-regularized L2-loss support vector classification (dual).
• eps : The stopping criterion. This parameter is using as default (0.01). However
we changed to some other values, and get the better value at 0.25.
• C: the cost of constraint violation. C often is in the range of 1 and 1000. The
value of C really matters the quality of the classification model. In order to find
a good value of C, we have done as follow :
o The raw data was splitted into two parts: 80% samples for training and

20% samples for testing the model.
o 80% of the samples and a value of C (Iteratively from 1 to 1000, with
difference of 0.25) are used to make a model to classify the testing
samples. After doing with certain values of C, the best one is chosen at
10.25 for the best result. The differences in classifying are really small,
depending on the quantity of samples collected, as well as the content of
the samples.
c. Model Testing and Integration
About 300 more web pages of each category (Football and Non-Football type) are
collected to test the model. The result of the testing is 97% web pages are correctly
classified.
The model is then integrated into the search engine. The data of the web pages that
the search engine collected will be analyzed and transformed into vectors of features
(words), and the model is then used to classify the web pages. The web pages are
reviewed as the percentage they belong to football category. A threshold is defined to
15
decide whether a web site is about football or not. The most appropriate value for the
threshold after some testing is 70%.
However, this score is only a criterion to rank the websites in the football category. A
combination of this criteria and the page rank of Bing / Yahoo search engine was used
for the best results in our search engine.
There are two approaching ways to analyze the category of the web pages. The hard
way is to go directly into the web site and seize the data. Other while, the information
about the topic of the web page can be grabbed with the text part in the search result
of Bing/Yahoo search engine, which takes less time and effort to vectorize the website
data. The two different approaches are applied at the same time to compare the results
with each other.
3. Keyword Suggestion Model
3.1. Introduction
When users want to find something, they may not clearly remember its name or they

remember just a part of it. A search engine would be more useful if they could correct the
mistakes, expand missing parts. That is in reality the function of keyword suggestion
model. This is also useful when users want to find things that related to the keyword.
3.2. Operation
In order to find proper suggestions, keyword suggestion model asks for available
suggestions in other search engines and then put them in filters to find most meaningful
ones. Specifically, following works are done:
• Similar to Crawler module, it sends requests to search engines and extract
suggestion from returning html pages.
• After getting a list of suggested words, it rearranges these words using Information
Gain algorithm.
• Words with highest scores are sent to Interface.
Believing in other search engines’ suggestion helps reducing a considerable amount of
works. However the number of suggestions gotten is generally small and about haft of
them is useless. In order to improve suggestion’s quantity and quality, addition
techniques were used.
16
3.3. Algorithms
3.3.1. Suggestion collecting technique.
First, keywords are directly sent to search engines and suggestions are taken. If the
quantity of them is not sufficient, keywords are expanded before being sent. For example
in our football search:
Normally, the word “football” will be added to the keyword.
If the keyword is in the form of a name, it is extent with “football player” or “football
club”.
If the keyword contains some names, it may be a match, then the word “match” or “vs”
will be used.
A keyword may be extent in multiple ways. The suggestions are added together and
redundancies are removed before going to the next step.
3.3.2. Document frequency thresholding.

“Document frequency is the number of documents in which a term occurs. We computed
the document the frequency for each unique term in the training corpus and removes
from the feature space those terms whose document frequency was less than some
predetermined threshold. The basic assumption is the rare terms are either non-
informative for category prediction, or not influential in global performance. In either
case removal of rare terms reduces the dimensionality of the feature space. Improvement
in categorization accuracy is also possible if rare terms happen to be noise terms.” In
other words, suggestions with probabilities lower than a predefined threshold will be
removed. In addition, words which are likely not in the concerning aspect will also be
removed: In football search if the probability that a words is in football field is smaller
than the probability that it is not, then it will be removed. [11]
17
Figure 4 : Keyword suggestion workflow
Denote:
Pr(t|ci) is probability of t with condition ci:
( , )
Pr( | )
( )
i
count t ci
t c
count ci
=
3.3.3. Information gain
Information gain is frequently employed as a term goodness criterion in the field of
machine learning. It measures the number of bits of information obtained for category
prediction by knowing the presence or absence of a term in a document. Let {c
i
} denote
the set of categories in the target space. The information gain of term t is defined to be:

G(t) =
1
Pr( )log Pr( )
m
i i
i
c c
=


1
Pr( ) Pr( | )log Pr( | )
m
i i
i
t c t c t
=
+

1
Pr( ) Pr( | )logPr( | )
m
i i
i
t c t c t
=
+

[11]
18

Pr(t|
c1)<Pr(c
|t2)
greater
smaller




C
o
m
p
a
r
e
Remove
Remove
Suggestion
Calculate
Pr(t|c
1
)
C
o
m
p
a
r
e


t
h
r
e
s
h
o
l
d
Calculate
Pr(t|c
2
)
Accept
Pr(t|c1)>Pr(c|t2)
In the formula:
Pr(c
i
) denotes the probability of each category. For example if we have 2 category c
1
:
“football” and c
2
: “nonFootball”, each contains 1000 document, then:
( ) ( )
1 2
1000
Pr c Pr c
1000 1000

= =
+
=
1
2
Pr(t) is the probability of documents which contain the term t. For example, if in 2000
documents, the word “goal” is found in 500 of them, then:
( )
500
Pr goal
2000
=
= 0.25
Pr(c
i
|t) presents the probability of each category in condition term t appears. For example
in 500 pages which contain the word “goal”, 300 belong to “football” category, then:
( )
1
300
Pr c
500
=
= 0.6
Pr( )t
: Probability of documents which don’t contain the term t.
Pr( ) 1 Pr( )t t
= −
Pr( | )
i

c t
takes place for the probability of each category in condition the term t is not
found. The formula is similar to the formula for calculating Pr(c
i
|t).
4. Search Interface
The interface is important as the most necessary part of a search engine by helping the
user to get the results easily. Some common web buidling tools such as JavaServer
Pages(JSP), Cascading Style Sheets(CSS), Javascript, HTML are used to develop the
interface. The basic component of the search webpage is a search form and a button to
commit the search keywords.
19
Figure 5: Search Home Page
Whenever a keyword is requested for the search engine, the results will be displayed
below the search form, with the order from above to below is the Header, the Link and
the information about that link ( which was used for analyzed and categorized the web
page ). The quantity of results for each page is 10.
20
Figure 6: Search Result for the simple keyword “football”
Depending on the number of results that the search engine collected, the results will be
displayed in a number of pages.
Figure 7: Search Results with page change
21
V. EXPERIMENTAL RESULT
1. Experiments in search engine application
At first, the search engine was intended to be the combination of the results from Bing
Search Engine and Yahoo Search Engine, in order to have variant of web pages for a
search term, as well as multi-perspective due to the differences in the web-crawling
algorithms of Yahoo and Bing Search Engine. However, for most of the keywords,
the differences in the results of the two search engines are not noticeable. So the

metasearch engine only works with Bing Search Engine.
After the web page classification module has been integrated into the search engine,
some variant meaning keywords such as City Manchester, Arsenal, Seagames,
Olympics… are used to test the search results.
The original search engine result: (First 10 results)
The results after refining:
Figure 8 : Vertical Search Experiment
22
Compared to the results sent back by the original search engine, the vertical search
engines removed the websites that are not relative to football, while keeping the order
of the football websites. With this upgrade, people who intend to find football pages
feel easier to get the information they need about football. Some advantages and
disadvantages of the current search engine
• Advantages :
o New websites updated with the original search engines.
o Unrelated websites are removed.
• Disadvantages :
o The classification works only on English languages ( due to our lack
of time and labour )
o The vertical search engine may takes longer to search than the
original one
o The websites with flash cannot be analyzed for categorizing (no
texts can be found), so these website are removed on the list, despite
the fact that some flash websites are heavily related to football.
As we mentioned above, two types of classification modules are used to experiment.
After getting the results by the original search engines, the websites’ full text maybe
used to analyzed and classified, or only the small text field in the search result page of
the search engine is used. The first method really takes time, much longer than the
second one, however the classification results may stay the same.
2. Experiments in keyword suggestion

Keywords in 3 different categories: football, nonFootball, ambiguous are tested. Each
keyword is sent to 3 different search engine: Yahoo, Bing, our vertical search engine.
Then the suggestions from 3 above search engine are manually collected and visually
compared. In one category, the results are similar between keywords. However they
are totally different from category to category.
For words which clearly belong to football, our suggestion model generally shows up
a small improvement due to excellent performance of 2 others. In some cases, our
results are even worse. The reasons have been carefully investigated and will be
introduced later in assessment part.
23
Example Word: “Manchester”

Figure 9 : Keyword Suggestion experiment 1
The experiment continues with ambiguous words. Our results are almost football words
while the majority of suggestions from 2 others are not.
Example Word: “England”
Figure 10 : Keyword Suggestion experiment 2
Advantage:
24
• Good suggestions are highly ranked.
• New vertical suggestions are formed.
Disadvantage:
• No instant suggest due to speed and dependence.
• Performance seriously depends on result of other search engines.
• Ranking algorithm is not good enough. Currently, score of a suggestion that
contains some words is calculated by the average value of its parts result in the
higher priority for shorter suggestions.
VI. FUTURE WORK
• The webpage filter module as well as the knowledge base should be expanded to
multi-language, as the only language English is not sufficient.

• The keyword suggestion module should be integrated into the search engine, after
optimizing the running time.
• The project can be expanded into developing a new framework, to quickly setup a
vertical search engine. The data for a specific domain with appropriate format can
plug into the system to make the search engine for that domain.
VII. CONCLUSION
Search Engines are important tools for Internet users due to structure and capacity of the
Internet. General search engines are very popular but there are also vertical search
engines which have some advantages over general ones. Our research’s purpose is to
build this kind of search engine, a search engine that finds pages on one specific domain.
Currently, we have successfully built separate modules but got stuck in integrating step.
In addition the performance of the search engine is not very good: low speed, poor
interface…However applying to related domain, the search results are clearly improved.
Although the result of our first experiment was not as good as expected, it confirms that
vertical search engines have advantages that general search engines could not obtain.
Vertical search will obtain considerable achievements if is investigated properly.
25

×