Tải bản đầy đủ (.pdf) (41 trang)

Architectural Issues of Web−Enabled Electronic Business phần 5 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (404.76 KB, 41 trang )

Chapter 9: Intelligent Web Search Through Adaptive
Learning From Relevance Feedback
Zhixiang Chen
University of Texas−Pan American
Binhai Zhu
Montana State University
Xiannong Meng
Bucknell University
Copyright © 2003, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Abstract
In this chapter, machine−learning approaches to real−time intelligent Web search are discussed. The goal is to
build an intelligent Web search system that can find the users desired information with as little relevance
feedback from the user as possible. The system can achieve a significant search precision increase with a
small number of iterations of user relevance feedback. A new machine−learning algorithm is designed as the
core of the intelligent search component. This algorithm is applied to three different search engines with
different emphases. This chapter presents the algorithm, the architectures, and the performances of these
search engines. Future research issues regarding real−time intelligent Web search are also discussed.
Introduction
This chapter presents the authors approaches to intelligent Web search systems that are built on top of existing
search engine design and implementation techniques. An intelligent search engine would use the search
results of the general purpose search engines as its starting search space, from which it would adaptively learn
from the users feedback to boost and to enhance the search performance and accuracy. It may use feature
extraction, document clustering and filtering, and other methods to help an adaptive learning process. The
goal is to design practical and efficient algorithms by exploring the nature of the Web search. With these new
algorithms, three intelligent Web search enginesWEBSAIL, YARROW and FEATURES are built that are able
to achieve significant search precision increase with just four to five iterations of real−time learning from a
users relevance feedback. The characteristics of those three intelligent search engines are reported in this
chapter.
Background
Recently, three general approaches have been taken to increase Web search accuracy and performance. One is


the development of meta−search engines that forward user queries to multiple search engines at the same
time in order to increase the coverage and hope to include what the user wants in a short list of top−ranked
152
results. Examples of such meta−search engines include MetaCrawler (MC), Inference Find (IF), and Dogpile
(DP). Another approach is the development of topic−specific search engines that are specialized in particular
topics. These topics range from vacation guides (VG) to kids' health (KH). The third approach is to use some
group or personal profiles to personalize the Web search. Examples of such efforts include GroupLens
(Konstan et al., 1997), PHOAKS (Terveen, Hill, Amento, McDonald, & Creter, 1997), among others. The first
generation meta−search engines address the problem of decreasing coverage by simultaneously querying
multiple general−purpose engines. These meta−search engines suffer to certain extent the inherited problem
of information overflow that it is difficult for users to pin down specific information for which they are
searching. Specialized search engines typically contain much more accurate and narrowly focused
information. However, it is not easy for a novice user to know where and which specialized engine to use.
Most personalized Web search projects reported so far involve collecting users behavior at a centralized
server or a proxy server. While it is effective for the purpose of e−commerce where vendors can collectively
learn consumer behaviors, this approach does present the privacy problem. Users of the search engines would
have to submit their search habits to some type of servers, though most likely the information collected is
anonymous.
The clustering, user profiling, and other advanced techniques used by these search engines and other projects
(Bollacker, Lawrence, & Giles, 1998, 1999) are static in the sense that they are built before the search begins.
They cannot be changed dynamically during the real−time search process. Thus, they do not reflect the
changing interests of the user at different time, at different location or on different subjects. The static nature
of the existing search engines makes it very difficult, if not impossible, to support the dynamic changes of the
users search interests. The augmented features of personalization (or customization) certainly help a search
engine to increase its search performance, however their ability is very limited. An intelligent search engine
should be built on top of existing search engine design and implementation techniques. It should use the
search results of the general−purpose search engines as its starting search space, from which it would
adaptively learn in real−time from the users relevance feedback to boost and to enhance the search
performance and the relevance accuracy. With the ability to perform real−time adaptive learning from
relevance feedback, the search engine is able to learn the users search interest changes or shifts, and thus

provides the user with improved search results.
Relevance feedback is the most popular query reformation method in information retrieval ( Baeza−Yates &
Ribeiro−Neto 1999, Salton 1975). It is essentially an adaptive learning process from the document examples
judged by the user as relevant or irrelevant. It requires a sequence of iterations of relevance feedback to search
for the desired documents. As it is known in (Salton, 1975), a single iteration of similarity−based relevance
feedback usually produces improvements from 40 to 60 percent in the search precision, evaluated at certain
fixed levels of the recall ,and averaged over a number of user queries. Some people might think that Web
search users are not willing to try iterations of relevance feedback to search for their desired documents.
However, the authors think otherwise. It is not a question of whether the Web search users are not willing to
try iterations of relevance feedback to perform their search. Rather it is a question of whether an adaptive
learning system can be built that supports high search precision increase with just a few iterations of relevance
feedback. The Web search users may have no patience to try more than a dozen iterations of relevance
feedback. But, if a system has a 20% or so search precision increase with just about four to five iterations of
relevance feedback, are the users willing to use such a system? The authors believe that the answer is yes.
Intelligent Web search systems that dynamically learn the users information needs in real−time must be built
to advance the state of art in Web search. Machine−learning techniques can be used to improve Web search,
because machine−learning algorithms are able to adjust the search process dynamically so as to satisfy the
users information needs. Unfortunately, the existing machine−learning algorithms (e.g., Angluin, 1987;
Littlestone, 1988), including the most popular similarity−based relevance feedback algorithm (Rocchio,
1971), suffer from the large number of iterations required to achieve the search goal. Average users are not
willing to go through too many iterations of learning to find what they want.
Chapter 9: Intelligent Web Search Through Adaptive Learning From Relevance Feedback
153
Web Search and Adaptive Learning
Overview
There have been great research efforts on applications of machine−learning to automatic extraction, clustering
and classification of information from the Web. Some earlier research includes WebWatcher (Armstrong,
Freitage, Joachims, & Mitchell, 1995) that interactively help users locate desired information by employing
learned knowledge about which hyperlinks are likely to lead to the target information; Syskill and Webert
(Pazzani, Muramatsu, & Billus, 1996), a system that uses a Bayesian classifier to learn about interesting Web

pages for the user; and NewsWeeder (Lang, 1995), a news−filtering system that allows the users to rate each
news article being read and learns a user profile based on those ratings. Some research is aimed at providing
adaptive Web service through learning. For example, Ahoy! The Homepage Finder in (Shakes, Langheinrich,
& Etzioni, 1997) performs dynamic reference shifting; Adaptive Web Sites in (Etzioni & Weld 1995,
Perkowitz & Etzioni 2000) automatically improve their organization and presentation based on user access
data; and Adaptive Web Page Recommendation Services (Balabanovi, 1997) recommends potentially
interesting Web pages to the users. Since so much work has been done on intelligent Web search and on
learning from the Web by many researchers, a comprehensive review is beyond the scope and the limited
space of this chapter. Interested readers may find good surveys of the previous research on learning the Web
in Kobayashi and Takeda (2000).
Dynamic Features and Dynamic Vector Space
In spite of the World Wide Webs size and the high dimensionality of Web document index features, the
traditional vector space model in information retrieval (Baeza−Yates & Ribeiro−Neto,1999; Salton, 1989;
Salton et al., 1975) has been used for Web document representation and search. However, to implement
real−time adaptive learning with limited computing resource, the traditional vector space model cannot be
applied directly. Recall that back in 1998, the AltaVista (AV) system was running on 20 multi−processor
machines, all of them having more than 130 Giga−Bytes of RAM and over 500 Giga−Bytes of disk space
(Baeza−Yates & Ribeiro−Neto,1999). A new model is needed that is efficient enough both in time and space
for Web search implementations with limited computing resources. The new model may also be used to
enhance the computing performance of a Web search system even if enough computing resources are
available.
Let us now examine indexing in Web search. In the discussion, keywords are used as document index
features. Let X denote the set of all index keywords for the whole Web (or, practically, a portion of the whole
Web). Given any Web document d, let I(d) denote the set of all index keywords in X that are used to index d
with non−zero values. Then, the following two properties hold:
The size of I(d) is substantially smaller than the size of X. Practically, I(d) can be bounded by a
constant. The rationale behind this is that in the simplest case only a few of the keywords in d are
needed to index it.

For any search process related to the search query q, let D(q) denote the collection of all the

documents that match q, then the set of index keywords relevant to q, denoted by F(q), is
Although the size of F(q) varies from different queries, it is still substantially smaller than the size of
X, and might be bounded by a few hundreds or a few thousands in practice.

Web Search and Adaptive Learning
154
Definition 1. Given any search query q, F(q), which is given in the above paragraph, is defined as the set of
dynamic features relevant to the search query q.
Definition 2. Given any search query q, the dynamic vector space V(q) relevant to q is defined as the vector
space that is constructed with all the documents in D(q) such that each of those documents is indexed by the
dynamic features in F(q).
The General Setting of Learning
Lest be a Web search system.For any query q, S first finds the set of documents D(q) that match the query q. It
finds D(q) with the help of a general−purpose search strategy through searching its internal database, or
through external search engine such as AltaVista (AV) when no matches are found within its internal
database. It then finds the set of dynamic features F(q), and later constructs the dynamic vector space V(q).
Once D(q), F(q) and V(q) have been found, S starts its adaptive learning process with the help of the learning
algorithm that is to be presented in the following subsections. More precisely, let } K denotes a dynamic
feature (i.e., an index keyword). S maintains a common w = for dynamic features in F(q). The components of
w have non−negative real values. The learning algorithm uses w to extract and learn the most relevant features
and to classify documents in D(q) as relevant or irrelevant. weight vector )
Algorithm TW2
As the authors have investigated (Chen, Meng, & Fowler, 1999; Chen & Meng, Chen, Meng, Fowler, & Zhu,
2000), intelligent Web search can be modeled as an adaptive learning process such as adaptive learning,
where the search engine acts as a learner and the user as a teacher. The user sends a query to the engine, and
the engine uses the query to search the index database and returns a list of URLs that are ranked according to
a ranking function. Then the user provides the engine relevance feedback, and the engine uses the feedback to
improve its next search and returns a refined list of URLs. The learning (or search) process ends when the
engine finds the desired documents for the user. Conceptually, a query entered by the user can be understood
as the logical expression of the collection of the documents wanted by the user. A list of URLs returned by the

engine can be interpreted as an approximation of the collection of the desired documents.
Let us now consider how to use adaptive learning from equivalence queries to approach the problem of Web
search. The vector space model (Baeza−Yates & Ribeiro−Neto, 1999; Salton, 1989; Salton et al., 1975) is
used to represent documents. The vector space may consist of Boolean vectors. It may also consist of
discretized vectors, for example, the frequency vector of the index keywords. A target concept is a collection
of documents, which is equivalent to the set of vectors of the documents in the collection. The learner is the
search engine and the teacher is the user. The goal of the search engine is to find the target concept in
real−time with a minimal number of mistakes (or equivalence queries).
The authors designed the algorithm TW2, a tailored version of Winnow2 (Littlestone 1988), which is described
in the following. As described in the general setting of learning, for each query q entered by the user,
algorithm TW2 uses a common weight vector w and a real−valued threshold q to classify documents in D(q).
Initially, all weights in w have a value of 0. Let a > 1 be the promotion and demotion factor. Algorithm TW2
classifies documents whose vectors as relevant, and all others as irrelevant. If the
user provides a document that contradicts the classification of TW2, then TW2 is said to have made a mistake.
When the user responds with a document that may or may not contradict to the current classification, TW2
updates the weights through promotion or demotion. It should be noticed that in contrast to algorithm
Winnow2 to set all initial weights in w to 1, algorithm TW2 sets all initial weights in w to 0 and has a different
promotion strategy accordingly. Another substantial difference between TW2 and Winnow2 is that TW2
accepts document examples that may not contradict its current classification to promote or demote its weight
The General Setting of Learning
155
vector, while Winnow2 only accepts examples that contradict its current classification to perform promotion
or demotion. The rationale behind setting all the initial weights to 0 by algorithm TW2 is to focus attention on
the propagation of the influence of the relevant documents, and to use irrelevant documents to adjust the
focused search space. Moreover, this approach is computationally feasible because existing effective
document−ranking mechanisms can be coupled with the learning process.
In contrast to the linear lower bounds proved for Rocchios similarity−based relevance feedback algorithm
(Chen & Zhu, 2002), algorithm TW2 has surprisingly small mistake bounds for learning any collection of
documents represented by a disjunction of a small number of relevant features. The mistake bounds are
independent of the dimensionality of the index features. For example, one can show that to learn a collection

of documents represented by a disjunction of at most k relevant features (or index keywords) over the
n−dimensional Boolean vector space, TW2 makes at most mistakes, where A is the
number of dynamic features that occurred in the learning process. The actual implementation of algorithm
TW2 requires the help of document ranking and equivalence query simulation that are to be addressed later.
Feature Learning Algorithm FEX (Feature EXtraction)
Given any user query q, for any dynamic feature ) Ki F(q) with 1 I n, define the rank of Ki as h(Ki) = ho(Ki)
+ wi. Here, ho(Ki) is the initial rank for Ki. Reacal that Ki is some index keyword. With the feature ranking
function h and the common weight vector w, FEX extracts and learns the most relevant features as follows.
Document Ranking
Let g be a ranking function independent of TW2 and FEX. Define the ranking function f for documents in D(q
) for any user query q as follows. For any Web document d ∈ D(q) with vector d = (x1,,xn) ∈ V(q), define
Feature Learning Algorithm FEX (Feature EXtraction)
156
Here, g remains constant for each document d during the learning process of the learning algorithm. Various
strategies can be used to define g, for example, PageRank (Brin & Page, 1998), classical tf−idf scheme, vector
spread, or cited−based rankings (Yuwono & Lee, 1996). The two additional tuning parameters are used to do
individual document promotions or demotions of the documents that have been judged by the user. Initially,
let ß(d) 0 and γ(d) = 1. ß(d) and γ (d) can be updated in a similar fashion as the weight value wi is updated by
algorithm TW2.
Equivalence Query Simulation
Our system will use the ranking function f that was defined above to rank the documents in D(q) for each user
query q, and for each iteration of leaning, it returns the top 10 ranked documents to the user. These top 10
ranked documents represent an approximation to the classification made by the learning algorithm that has
been used by the system. The quantity 10 can be replaced by, say, 25 or 50. But it should not be too large for
two reasons: (1) the user may only be interested in a very small number of top ranked documents, and (2) the
display space for visualization is limited. The user can examine the short list of documents and can end the
search process, or, if some documents are judged as misclassified, document relevance feedback can be
provided. Sometimes, in addition to the top 10 ranked documents, the system may also provide the user with a
short list of other documents below the top 10. Documents in the second short list may be selected randomly,
or the bottom 10 ranked documents can be included. The motivation for the second list is to give the user

some better view of the classification made by the learning algorithm.
The Websail System and the Yarrow System
The WEBSAIL System is a real−time adaptive Web search learner designed and implemented to show that
the learning algorithm TW2 not only works in theory but also works practically. The detailed report of the
system can be found in Chen et al. (2000c). WEBSAIL employs TW2 as its learning component and is able to
help the user search for the desired documents with as little relevance feedback as possible. WEBSAIL has a
graphic user interface to allow the user to enter his/her query and to specify the number of the top matched
document URLs to be returned. WEBSAIL maintains an internal index database of about 800,000 documents.
Each of those documents is indexed with about 300 keywords. It also has a meta−search component to query
AltaVista whenever needed. When the user enters a query and starts a search process, WEBSAIL first
searches its internal index database. If no relevant documents can be found within its database then it receives
a list of top matched documents externally with the help of its meta−search component. WEBSAIL displays
the search result to the user in a format as shown in Figure 1.
Equivalence Query Simulation
157
Figure 1: The display format of WEBSAIL
Also as shown in Figure 1, WEBSAIL provides at each iteration the top 10 and the bottom 10 ranked
document URLs. Each document URL is preceded with two radio buttons for the user to judge whether the
document is relevant to the search query or not. The document URLs are clickable for viewing the actual
document contents so that the user can judge more accurately whether a document is relevant or not. After the
user clicks a few radio buttons, he/she can click the feedback button to submit the feedback to TW2.
WEBSAIL has a function to parse out the feedback provided by the user when the feedback button is clicked.
Having received the feedback from the user, TW2 updates its common weight vector w and also performs
individual document promotions or demotions. At the end of the current iteration of learning, WEBSAIL
re−ranks the documents and displays the top 10 and the bottom10 document URLs to the user.
At each iteration, the dispatcher of WEBSAIL parses query or relevance feedback information from the
interface and decides which of the following components should be invoked to continue the search process:
TW2, or Index Database Searcher, or Meta−Searcher. When meta−search is needed, Meta−Searcher is called
to query AltaVista to receive a list of the top matched documents. The Meta−Searcher has a parser and an
indexer that work in real−time to parse the received documents and to index each of them with at most 64

keywords. The received documents, once indexed, will also be cached in the index database.
The following relative Recall and relative Precision are used to measure the performance of WEBSAIL. For
any query q, the relative Recall and the relative Precision are
where R is the total number of relevant documents among the set of the retrieved documents, and Rm is the
number of relevant documents ranked among the top m positions in the final search result of the search
engine. The authors have selected 100 queries to calculate the average relative Recall of WEBSAIL. Each
query is represented by a collection of at most five keywords. For each query, WEBSAIL is tested with the
returning document number m as 50, 100, 150, 200, respectively. For each test, the number of iterations used
and the number of documents judged by the user were recorded. The relative Recall and Precision were
calculated based on manual examination of the relevance of the returned documents. The experiments reveal
that WEBSAIL achieves an average of 0.95 relative Recall and an average of 0.46 relative Precision with an
average of 3.72 iterations and an average of 13.46 documents judged as relevance feedback.
The Yarrow system (Chen & Meng, 2000) is a multi−threaded program. Its architecture differs from that of
WEBSAIL in two aspects: (1) it replaces the meta−searcher of WEBSAIL with a generic Query Constructor
and a group of meta−searchers, and it does not maintain its own internal index database. For each search
process, it creates a thread and destroys the thread when the search process ends. Because of its light−weight
Equivalence Query Simulation
158
size, it can be easily converted or ported to run in different environments or platforms. The predominant
feature of YARROW, compared with existing meta−search engines, is the fact that it learns from the users
feedback in real−time on client side. The learning algorithm TW2 used in YARROW has some surprisingly
small mistake bound. YARROW may be well used as a plug−in component for Web browsers on client side.
A detailed report of the Yarrow system is given in Chen and Meng (2000).
The Features System
The FEATURES system (Chen, Meng, Fowler, & Zhu, 2001) is also a multi−threaded system, and its
architecture is shown in Figure 2. The key difference between FEATURES and WEBSAIL is that
FEATURES employs the two learning algorithmsFEX and TW2to update the common weight vector w
concurrently.
Figure 2: The architecture of FEATURES
For each query, FEATURES usually shows the top 10 ranked documents, plus the top 10 ranked features, to

the user for him/her to judge document relevance and feature relevance. The format of presenting the top 10
ranked documents together with the top 10 ranked features is shown in Figure 3. In this format, document
URLs and features are preceded by radio buttons for the user to indicate whether they are relevant or not.
Figure 3: The display format of FEATURES
If the current task is a learning process from the users document and feature relevance feedback, Dispatcher
sends the feature relevance feedback information to the feature learner FEX and the document relevance
feedback information to the document learner TW2. FEX uses the relevant and irrelevant features as judged
The Features System
159
by the user to promote and demote the related feature weights in the common weight vector w. TW2 uses the
relevant and irrelevant documents judged by the user as positive and negative examples to promote and
demote the weight vector. Once FEX and TW2 have finished promotions and demotions, the updated weight
vector w is sent to Query Searcher and to Feature Ranker. Feature Ranker re−ranks all the dynamic features,
that are then sent to Html Constructor. Query Searcher searches Index Database to find the matched
documents that are then sent to Document Ranker. Document Ranker re−ranks the matched documents and
then sends them to Html Constructor to select documents and features to be displayed. Empirical results
(Chen et al., 2001) show that FEATURES has substantially better search performance than AltaVista.
Timing Statistics
On December 13th and 14th of 2001, the authors conducted the experiments to collect the timing statistics for
using WEBSAIL, YARROW and FEATURES. Thirty (30) query words were used to test each of these
meta−search engines. Every time a query was sent, the wall−clock time needed for the meta−search engine to
list the sorted result was recorded in the program. Also recorded was the wall−clock time to refine the search
results based on the users feedback. Since YARROW supports multiple external search engines,
ALTAVISTA and NORTHERN LIGHT were selected as the external search engines when YARROW was
tested. The external search engine used by WEBSAIL and FEATURES is ALTAVISTA. The following tables
show the statistical results at 95% confidence interval level. The original responding time is torig and the
refining time is trefine, and C.I. denotes the confidence interval.
Table 1: Response time of WEBSAIL (in seconds)
Table 2: Response Time of YARROW (in seconds)
Table 3: Response time of FEATURES (in seconds)

The statistics from the table indicate that while the standard deviations and the confidence intervals are
relatively high, they are in a reasonable range that users can accept. It takes WEBSAIL, YARROW and
FEATURES in the order of a few seconds to 20 seconds to respond initially because they need to get the
information from external search engines over the network. However, even for the initial response time is not
long and hence is acceptable by the user.
Timing Statistics
160
The Commercial Applications
Intelligent Web search can find many commercial applications. This section will concentrate on the
applications to E−commerce. E−commerce can be viewed as three major components, the service and goods
suppliers, the consumers, and the information intermediaries (or infomediaries). The service and goods
suppliers are the producer or the source of the e−commerce flow. The consumers are the destination of the
flow. Informediaries, according to Grover and Teng (2001), are an essential part of E−commerce. An
enormous amount of information has to be produced, analyzed and managed in order for e−commerce to
succeed. In this context, Web search is a major player in the infomediaries. Other components of
infomediaries include communities of interest (e.g., online purchase), industry magnet sites (e.g.,
www.amazon.com), e−retailers, or even individual corporate sites (Grover & Teng, 2001). The
machine−learning approaches in Web search studied in this chapter are particularly important in the whole
context of E−commerce. The key feature of the machine−learning approach for Web search is interactive
learning and narrowing the search results to what the user wants. This feature can be used in many
e−commerce applications. The following are a few examples.
Building a partnership: As pointed out in Tewari et al. (2001), building a partnership between the buyers and
the seller is extremely important for the success of an e−Business. Tewari et al. used Multi−Attribute
Resource Intermediaries (MARI) infrastructure to approximate buyer and seller preferences. They compare
the degree of matching between buyers and sellers by computing a distance between the two vectors. When
interactive learning features explored in this chapter are used in this process, the buyers and the sellers can
negotiate the deal in real−time, thus greatly enhancing the capability of the system. A collection of sellers
may provide an initial list of items available at certain prices for buyers to choose. The buyers may also have a
list of expectations. According to the model proposed in (Tewari et.al, the possibility of a match is computed
statically. If a machine−learning approach is taken, the buyers and the sellers may interactively find a best

deal, similar to the situation where a face−to−face negotiation is taking place.
Brokering between buyers and sellers: Brokerage between the producers and the consumers is a critical
E−commerce component. Given a large number of producers and a large number of consumers, how to
efficiently find a match between what is offered on the market and what a buyer is looking for? The work
described in Meyyappan (2001) and, Santos et al. (2001) provided a framework for e−commerce search
brokers. A broker here is to compare price information, product features, the reputation of the producer, and
other information for a potential buyer. While in the previous category the seller and the buyer may negotiate
interactively. Here the buyer interacts with the broker(s) only, very similar to the real−world situation. The
interactive machine−learning and related Web search technology can be applied in this category as well. The
machine−learning algorithm will use the collection of potential sellers as a starting space, interactively search
the optimal seller for the user based on the information collected by the brokerage software. (Meyyappan
2001) and (Santos et.al. 2001) provided a framework for this brokerage to take place. The machine−learning
algorithm discussed in this chapter can be used for a buyer to interact with the broker to get the best that is
available on the market. For example, a broker may act as a meta−search engine that collects information
from a number of sellers, behaving very much like general−purpose search engines. A buyer asks her broker
to get certain information; the broker, which is a meta−search engine equipped with TW2 or other learning
algorithms may search, collect, collate and rank the information returned from seller sources to the buyer. The
buyer can interact with the broker, just as if in the scenario of Web search. The broker will refine its list until
the buyer finds a satisfactory product and the seller.
Interactive catalog: The service providers or the sellers can allow consumers to browse the catalog
interactively. While browsing the learning algorithm can pick up users' interests and supply better information
to the customer, much like what adaptive Web sites (Perkowitz & Etzioni, 2000) do for the customers. Here
the learning can take place in two forms. The seller can explicitly ask how the potential buyers (browsers of
The Commercial Applications
161
the catalog) feel about the usefulness of the catalog. This can be analogous to the interactive learning using
algorithms such as TW2. Researchers have reported approaches of this type (though they didnt use TW2
explicitly.) See (Herlocker and Konstan 2001) for an example and other similar projects. In the second
approach, the provider of the catalog (seller) would learn the user interests and behaviors implicitly as
reported in Claypool et.al. (2001). The learning algorithm such as TW2 can be embedded in the catalog

software. The buyers interests and intention can be captured through modified browser software. The learning
algorithm can then revise the catalog listings by taking the buyers Web page clicks as feedback. This is very
similar to the Web search situation.
Web commerce infrastructure: Chaudhury, Mallick, and Rao (2000) describe using the Web in e−commerce
as various channels. The Web can be used as advertising channel, ordering channel, and customer support
channel. All these channels should be supported by an interactive system where customer feedback can be
quickly captured, analyzed and used in updating the e−commerce system.
Future Work
In the future, the authors plan to improve the interface of their systems. Right now, the systems display the
URLs of the documents. If the user wants to know the contents of the document, he/she needs to click the
URL to view the content. The authors plan to display the URL of a document together with a good preview of
its content. The authors also want to highlight those index keywords in the preview and allow them to be
clickable for feature extracting and learning.
The authors also plan to apply clustering techniques to increase the performance of their system. It is easy to
observe that in most cases documents that are relevant to a search query can be divided into a few different
clusters or groups. The authors believe that document clustering techniques such as graph spectral partitioning
can be used to reduce the number of the iterations of the learning process and to increase the performance of
the system.
Acknowledgment
The authors thank the two anonymous referees and the editor, Dr. Nansi Shi, for their valuable comments on
the draft of this chapter. The final presentation of the chapter has greatly benefited from their comments.
URL References
(AV) AltaVista: www.altavista.com
(IF) Inference Find: www.infind.com
(KH) Kidshealth.com: www.kidshealth.com
(VG) Vacations.Com: www.vacations.com
(DP) Dogpile: www.dogpile.com
Future Work
162
(IS) Infoseek: www.infoseek.com

(MC) MetaCrawler: www.metacrawler.com
References
Angluin, D. (1987). Queries and concept learning. Machine−learning, 2, 319−432.
Armstrong, R., Freitag, D., Joachims, T., & Mitchell, T. (1995). Webwatcher: A learning apprentice for the
World Wide Web. In Working Notes of the AAAI Spring Symposium on Information Gathering from
Heterogeneous, Distributed Environments, 6−12. AAAI Press.
Balabanovi, M. (1997). An adaptive Web page recommendation service. In Proceedings of the First
International Conference on Autonomous Agents, 378−387. New York: ACM Press.
Baeza−Yates, R., & Ribeiro−Neto, B. (1999). Modern Information Retrieval. Reading, MA:
Addison−Wesley.
Bollacker, K., Lawrence, S., & Giles, C.L. (1998). Citeseer: An autonomous Web agent for automatic
retrieval and identification of interesting publications. In Proceedings of the Second International Conference
on Autonomous Agents, 116−113. New York: ACM Press.
Bollacker, K., Lawrence, S., & Giles, C. L. (1999). A system for automatic personalized tracking of scientific
literature on the Web. In Proceedings of the Fourth ACM Conference on Digital Libraries, 105−113. New
York: ACM Press.
Brin, S., & Page, L. (1998). The anatomy of a large−scale hypertextual Web search engine. In Proceedings of
the Seventh World Wide Web Conference.
Chaudhury, A., Mallick, D.N. & Rao, H.R. (2001). Web channels in E−commerce. Communications of the
ACM, 44(1), 99−103.
Chen, Z., & Meng, X. (2000). Yarrow: A real−time client site meta search learner. In Proceedings of the AAAI
2000 Workshop on Artificial Intelligence for Web Search (the full version will appear in Journal of Intelligent
Information Systems), pp. 12−17.
Chen, Z., Meng, X., & Fowler, R.H. (1999). Searching the Web with queries. Knowledge and Information
Systems 1, 369−375.
Chen, Z., Meng, X., Fowler, R. H., & Zhu, B. (2001). FEATURES: Real time adaptive features and document
learning for Web search. Journal for the American Society for Information Science, 52(8), 655665.
Chen, Z., & Zhu, B. (2002). Some formal analysis of the Rocchios similarity−based relevance feedback
algorithm. Information Retrieval, 5(1), 61−86.
Chen, Z., Meng, X., Zhu, B., & Fowler, R. (2000). Websail: From on−line learning to Web search. In

Proceedings of the 2000 International Conference on Web Information Systems Engineering (the full version
will appear in Journal of Knowledge and Information Systems, 4, 219−227.
References
163
Claypool, M., Brown, D., Le, P., and Waseda, M. (2001). Inferring user interest, IEEE Internet Computing,
5(6), 32−39, November.
Etzioni, O. & Weld, D. (1995). Intelligent agents on the Internet: Fact, fiction and forecast. IEEE Expert,
10(3), 44−49.
Grover, V. & Teng, J.T.C. (2001). E−commerce and the information market. Communications of the ACM,
44(4), 79−86.
Herlocker, J.L. & Konstan, J.A. (2001). Content−independent task−focused recommendation, IEEE Internet
Computing, 5(6), 40−47, November.
Ide, E. (1971). New experiments in relevance feedback. In G. Salton (Ed.), The Smart Retrieval System
Experiments in automatic document processing,, 337−354. Englewood Cliffs, NJ: Prentice Hall Inc.
Kobayashi, M. & Takeda, K. (2000). Information Retrieval on the Web. ACM Computing Surveys, 32(2),
144−173.
Konstan, J., Miller, B., Maltz, D., Herlocker, J., Gordon, L., & Riedl, J. (1997). GroupLens: Applying
collaborative filtering to Usernet news. Communications of ACM, 40(3), 77−87.
Lang, K. (1995). Newsweeder: Learning to filter news. In Proceedings of the Twelfth International
Conference on Machine−learning, 331−339. Lewis, D. (1991). Learning in intelligent information retrieval.
In Proceedings of the Eighth International Workshop on Machine−learning, 235−239.
Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear−threshold algorithm.
Machine−learning, 2, 285−318.
Meng, X., & Chen, Z. (1999). Personalize Web search using information on clients side. In Advances in
Computer Science and Technologies (985−992). Denver, CO: International Academic Publishers.
Meyyappan, A. (2001). Proposing a new multi−routing agent architecture for E−marketplace. In Proceedings
of the 2001 International Internet Computing Conference, 275−277.
Pazzani, M., Muramatsu, J. & Billus, D. (1996). Syskill & Webert: Identifying interesting Web Sites. In
Proceedings of the Thirteenth National Conference on Artificial Intelligence, 54−61.
Perkowitz, M. & Etzioni, O. (2000). Adaptive Web sites: Concept and case study. Artificial Intelligence, 118,

245−275.
Rocchio, J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The smart retrieval
systemExperiments in Automatic Document Processing, . 313−323. Englewood Cliffs, NJ: Prentice Hall, Inc.
Salton, G. (1989). Automatic text processing: The transformation, analysis, and retrieval of information by
computer. Reading, MA: Addison−Wesley.
Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of
ACM, 18(11), 613−620.
Santos, S.C., Anglim, S., & Meira, S.R.L. (2001). A framework for Web−commerce search brokers. In
Proceedings the 2001 International of the Internet Computing Conference, 261−267.
References
164
Shakes, J., Langheinrich, M., & Etzioni, O. (1997). Dynamic reference sifting: A case study in the homepage
domain. In Proceedings of the Sixth International World Wide Web Conference, 189−200.
Terveen, T., Hill, W., Amento, B., McDonald, D., & Creter, J. (1997). Phoaks: A system for sharing
recommendation. Communications of ACM, 40(3), 50−62.
Tewari, G., Berkovich, A., Gabovich, V., Liang, S., Ramakrishnan A., & Maes, P. (2001). Sustaining
individual incentives while maximizing aggregate social welfare: A mediated brokering technique for trading
agents in next−generation electronic markets. In Proceedings of the 2001 International Internet Computing
Conference, pp. 247−253.
Yuwono, B., & Lee, D. (1996). Search and ranking algorithms for locating resources on the World Wide Web.
In Proceedings of the International Conference on Data Engineering, 164−171.
References
165
Chapter 10: World Wide Web Search Engines
Wen−Chen Hu
University of North Dakota
Jyh−Haw Yeh
Boise State University
Copyright © 2003, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.

Abstract
The World Wide Web now holds more than 800 million pages covering almost all issues. The Webs fast
growing size and lack of structural style present a new challenge for information retrieval. Numerous search
technologies have been applied to Web search engines; however, the dominant search method has yet to be
identified. This chapter provides an overview of the existing technologies for Web search engines and
classifies them into six categories: 1) hyperlink exploration, 2) information retrieval, 3) metasearches, 4) SQL
approaches, 5) content−based multimedia searches, and 6) others. At the end of this chapter, a comparative
study of major commercial and experimental search engines is presented, and some future research directions
for Web search engines are suggested.
Introduction
One of the most common tasks performed on the Web is to search Web pages, which is also one of the most
frustrating and problematic. The situation is getting worse because of the Webs fast growing size and lack of
structural style, as well as the inadequacy of existing Web search engine technologies (Lawrence & Giles,
1999a). Traditional search techniques are based on users typing in search keywords which the search services
can then use to locate the desired Web pages. However, this approach normally retrieves too many
documents, of which only a small fraction are relevant to the users needs. Furthermore, the most relevant
documents do not necessarily appear at the top of the query output list. A number of corporations and research
organizations are taking a variety of approaches to try to solve these problems. These approaches are diverse,
and none of them dominate the field. This chapter provides a survey and classification of the available World
Wide Web search engine techniques, with an emphasis on nontraditional approaches. Related Web search
technology reviews can also be found in (Gudivada, Raghavan, Grosky, & Kasanagottu, 1997; Lawrence &
Giles, 1998b; Lawrence & Giles, 1999b; Lu & Feng, 1998).
Requirements of Web Search Engines
It is first necessary to examine what kind of features a Web search engine is expected to have in order to
conduct effective and efficient Web searches and what kind of challenges may be faced in the process of
developing new Web search techniques. The requirements for a Web search engine are listed below, in order
of importance:
effective and efficient location and ranking of Web documents;1.
thorough Web coverage;2.
166

up−to−date Web information;3.
unbiased access to Web pages;4.
an easy−to−use user interface which also allows users to compose any reasonable query;5.
expressive and useful search results; and6.
A system that adapts well to user queries.7.
Web Search Engine Technologies
Numerous Web search engine technologies have been proposed, and each technology employs a very
different approach. This survey classifies the technologies into six categories: i) hyperlink exploration, ii)
information retrieval, iii) metasearches, iv) SQL approaches, v) content−based multimedia searches, and vi)
others. The chapter is organized as follows: Section 2 introduces the general structure of a search engine, and
Sections 3 to 8 introduce each of the six Web search engine technologies in turn. A comparative study of
major commercial and experimental search engines is shown in Section 9 and the final section gives a
summary and suggests future research directions.
Search Engine Structure
Two different approaches are applied to Web search services: genuine search engines and directories. The
difference lies in how listings are compiled:
- Search engines, such as Google, create their listings automatically.•
- A directory, such as Yahoo!, depends on humans for its listings.•
Some search engines, known as hybrid search engines, maintain an associated directory. Search engines
traditionally consist of three components: the crawler, the indexing software, and the search and ranking
software (Greenberg & Garber, 1999; Yuwono & Lee, 1996). Figure 1 shows the system structure of a typical
search engine.
Figure 1: System structure of a Web search engine
Web Search Engine Technologies
167
Crawler
A crawler is a program that automatically scans various Web sites and collects Web documents from them.
Crawlers follow the links on a site to find other relevant pages. Two search algorithmsbreadth−first searches
and depth−first searchesare widely used by crawlers to traverse the Web. The crawler views the Web as a
graph, with the nodes being the objects located at Uniform Resource Locators (URLs). The objects could be

(Hypertext Transfer Protocols (HTTPs), File Transfer Protocols (FTPs), mailto (e−mail), news, telnet, etc.
They also return to sites periodically to look for changes. To speed up the collection of Web documents,
several crawlers are usually sent out to traverse the Web at the same time. Three simple tools can be used to
implement an experimental crawler:
lynx: Lynx is a text browser for Unix systems. For example, the command lynx −dump source
downloads the Web page source code at />•
java.net: The java.net package of Java language provides plenty of networking utilities. Two classes
in the package, java.net.URL and java.net.URLConnection, can be used to download Web pages.

Comprehensive Perl Archive Network (CPAN): Perl has been used intensively for Web−related
applications. Some scripts provided by CPAN at are useful for crawler
construction.

To construct an efficient and practical crawler, some other networking tools have to be used.
Indexing Software
Automatic indexing is the process of algorithmically examining information items to build a data structure
that can be quickly searched. Filtering (Baeza−Yates, 1992) is one of the most important pre−processes for
indexing. Filtering is a typical transformation in information retrieval and is often used to reduce the size of a
document and/or standardize it to simplify searching. Traditional search engines utilize the following
information, provided by HTML scripts, to locate the desired Web pages:
Content: Page content provides the most accurate, full−text information. However, it is also the
least−used type of information, since context extraction is still far less practical.

Descriptions: Page descriptions can either be constructed from the metatags or submitted by Web
masters or reviewers.

Hyperlink: Hyperlinks contain high−quality semantic clues to a pages topic. A hyperlink to a page
represents an implicit endorsement of the page to which it points (Chakrabarti et al., 1999).

Hyperlink text: Hyperlink text is normally a title or brief summary of the target page.•

Keywords: Keywords can be extracted from full−text documents or metatags.•
Page title: The title tag, which is only valid in a head section, defines the title of an HTML document.•
Text with a different font: Emphasized text is usually given a different font to highlight its
importance.

The first sentence: The first sentence of a document is also likely to give crucial information related to
the document.

Search and Ranking Software
Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items. A
user enters a keyword or keywords, along with Boolean modifiers such as and, or, or not, into a search engine,
which then scans indexed Web pages for the keywords. To determine in which order to display pages to the
user, the engine uses an algorithm to rank pages that contain the keywords (Zhang & Dong, 2000). For
example, the engine may count the number of times the keyword appears on a page. To save time and space,
Crawler
168
the engine may only look for keywords in metatags, which are HTML tags that provide information about a
Web page. Unlike most HTML tags, metatags do not affect a documents appearance. Instead, they include
such information as a Web pages contents and some relevant keywords. The following six sections give
various methods of indexing, searching, and ranking the Web pages.
Hyperlink Exploration
Hypermedia documents contain cross references to other related documents by using hyperlinks, which allow
the user to move easily from one to the other. Links can be tremendously important sources of information for
indexers; the creation of a hyperlink by the author of a Web page represents an implicit endorsement of the
page being to which it points. This approach is based on identifying two important types of Web pages for a
given topic:
Authorities, which provide the best source of information on the topic, and•
Hubs, which provide collections of links to authorities.•
For the example of professional basketball information, the official National Basketball Association site
( is considered to be an authority, while the ESPN site ( is a hub.

Authorities and hubs are either given top ranking in the search results or used to find related Web pages (Dean
& Henzinger, 1999).
Analyzing the interconnections of a series of related pages can identify the authorities and hubs for a
particular topic. A simple method to update a non−negative authority with a weight xp and a non−negative
hub with a weight yp is given by Chakrabarti et al. (1999). If a page is pointed to by many good hubs, its
authority weight is updated by using the following formula:
where the notation q®ðp indicates that q links to p. Similarly, if a page points to many good authorities, its
hub weight is updated via
Unfortunately, applying the above formulas to the entire Web to find authorities and hubs is impracticable.
Ideally, the formulas are applied to a small collection Ssð of pages that contain plenty of relevant documents.
The concepts of a root set and a base set have been proposed by Kleinberg (1999) to find Ssð. The root set is
usually constructed by collecting the t highest−ranked pages for the query sð from a search engine such as
Google or Yahoo!. However, the root set may not contain most of the strongest authorities. A base set is
therefore built by including any page pointed to by a page in the root set and any page that points to a page in
the root set. Figure 2 shows an example of a root set and a base set. The above formulas can then be applied to
a much smaller set, the base set, instead of the entire Web.
In addition to the methods used to find authorities and hubs, a number of search methods based on
connectivity have been proposed. A comparative study of various hypertext link analysis algorithms is given
in (Borodin et al., 2001). The most widely used method is a Page Rank model (Brin & Page, 1998), which
Hyperlink Exploration
169
suggests the reputation of a page on a topic is proportional to the sum of the reputation weights of pages
pointing to it on the same topic. That is, links emanating from pages with high reputations are weighted more
heavily. The concepts of authorities and hubs, together with the Page Rank model, can also be used to
compute the reputation rank of a page; those topics for which the page has a good reputation are then
identified (Rafiei & Mendelzon, 2000). Some other ad hoc methods include an Hyperlink Vector Voting
(HVV) method (Li, 1998) and a system known as WebQuery (Carriere & Kazman, 1997). The former method
uses the content of hyperlinks to a document to rank its relevance to the query terms, while the latter system
studies the structural relationships among the nodes returned in a content−based query and gives the highest
ranking to the most highly connected nodes. An improved algorithm obtained by augmenting with content

analysis is introduced in Bharat and Henzinger (1998).
Figure 2: Expanding the root set into a base set
Information Retrieval (IR)
IR techniques are widely used in Web document searches (Gudivada et al, 1997). Among them, relevance
feedback and data clustering are two of the most popular techniques used by search engines. The former
method has not so far been applied to any commercial products because it requires some interaction with
users, who normally prefer to use a keyword−only interface. The latter method has achieved more success
since it does not require any interaction with users to achieve acceptable results.
Relevance Feedback
An initial query is usually a wild guess. Retrieved query results are then used to help construct a more precise
query or modify the database indexes (Chang & Hsu, 1999). For example, if the following query is submitted
to a search enginewhich TOYOTA dealer in Atlanta has the lowest price for a Corolla 2002?, the engine may
produce the following list of ranked results:
Get the BEST price on a new Toyota, Lexus car or truck.
Toyota of Glendale¾ðYour #1 Toyota dealer. ota−of−glendale.com/2.
Leith Toyota¾ðRaleigh, North Carolina.
Atlanta rental cars & auto rentals.
This list includes three relevant results: 1, 2, and 3, and one irrelevant result: 4. The following two relevance
feedback methods can be used to improve the search results:
Query modification: Adjusts the initial query in an attempt to avoid unrelated or less−related query
results. For example, the above query could be modified by adding a condition excluding rental cars.

Indexing modification: Through feedback from the users, system administrators can modify an
unrelated documents terms to render it unrelated or less related to such a query. For example, the
information concerning rental cars could be removed from the database indexes of car sales and
prices.

Information Retrieval (IR)
170
For the above example, the search results after modification should not include Result #4.

Data Clustering
Data clustering is used to improve the search results by dividing the whole data set into data clusters. Each
data cluster contains objects of high similarity, and clusters are produced that group documents relevant to the
users query separately from irrelevant ones. For example, the formula below gives a similarity measure:
where weightik is the weight assigned to termk in a document Di (Baeza−Yates, 1992). Clustering should not
be based on the whole Web resource, but on smaller separate query results. In Zamir and Etzioni (1998), a
Suffix Tree Clustering (STC) algorithm based on phrases shared between documents is used to create clusters.
Beside clustering the search results, a proposed similarity function has been used to cluster similar queries
according to their contents as well as user logs (Wen, Nie, & Zhang, 2001). The resulting clusters can provide
useful information for Frequently Asked Queries (FAQ) identification. Another Web document clustering
algorithm is suggested in Chang and Hsu (1999).
Metasearches
None of the current search engines is able to cover the Web comprehensively. Using an individual search
engine may miss some critical information that is provided by other engines. Metasearch engines (Dreilinger
& Howe, 1997; Howe & Dreilinger, 1997; Selberg & Etzioni, 1997) conduct a search using several other
search engines simultaneously and then present the results in some sort of integrated format. This lets users
see at a glance which particular search engine returned the best results for a query without having to search
each one individually. They typically do not use their own Web indexes. Figure 3 shows the system structure
of a metasearch engine, which consists of three major components:
Dispatch: Determines to which search engines a specific query is sent. The selection is usually based
on network and local computational resources, as well as the long−term performance of search
engines on specific query terms.

Interface: Adapts the users query format to match the format of a particular search engine, which
varies from engine to engine.

Display: Raw results from the selected search engines are integrated for display to the user. Each
search engine also produces different raw results from other search engines and these must be
combined to give a uniform format for ease−of−use.


Data Clustering
171
Figure 3: System structure of a metasearch engine
Current search engines provide a multiplicity of interfaces and results that make the construction of
metasearch engines a very difficult task. The STARTS protocol (Gravano, Chang, Garcia−Molina, Lagoze, &
Paepcke, 1997) has been proposed to standardize Internet retrievals and searches. The goals are to choose the
best sources (search engines) to evaluate a query, submit the query to the sources selected, and finally merge
the query results obtained from the different sources. However, this protocol has received little recognition
since none of the most−often−used search engines apply it. Another approach (Huang, Hemmje, & Neuhold,
2000) to solving this problem is to use an adaptive model which employs a mediator−wrapper architecture.
The mediator provides users with integrated access to multiple heterogeneous data sources, while each
wrapper represents access to a specific data source. It maps a query from a general mediator format into the
specific wrapper format required by each search engine.
Metasearch engines rely on the summaries and ranks of URLs returned by standard search engines. However,
not all standard search engines give unbiased results and this will distort the metasearch results. The NEC
Research Institute (NECI) metasearch engine (Lawrence & Giles, 1998a) solved this problem by downloading
and analyzing each document and then displaying results in a format that shows the query terms in context.
This helps users more readily determine if the document is relevant without having to download each page.
The authors of Q−pilot (Sugiura & Etzioni, 2000) noticed that thousands of specialized, topic−specific search
engines are accessible on the Web, and these topic−specific engines return far better results for on topic
queries than standard search engines. Q−pilot dynamically routes each user query to the most appropriate
specialized search engines by using two methods: neighborhood−based topic identification, and query
expansion.
Sql Approaches
Learning how to use a new language is normally an arduous task for users. However, a new system which
uses a familiar language is usually adopted relatively smoothly by the users. Structured Query Language
(SQL) is a well−known and widely−used database language. SQL approaches (Florescu, Levy, & Mendelzon,
1998; Mendelzon & Milo, 1998) view the World Wide Web as a huge database where each record matches a
Web page, and use SQL−like languages to support effective and flexible query processing. A typical
SQL−like language syntax (Konopnicki & Shmueli, 1998; Mendelzon, Mihaila, & Milo, 1997; Spertus &

Stein, 2000) is
Sql Approaches
172
Query := select Attribute_List from Domain_Specifications
[ where Search_Conditions ];
Three query examples are given below to show the use of the language.
SQL Example 1: Find pages in the World Wide Web Consortium (W3C) site where the pages have fewer
than 2000 bytes.
select url from where bytes < 2000;
url is a pages URL and each page has attributes such as bytes, keywords, and text.
SQL Example 2: Find educational pages containing the keyword database.
select url from http://%.edu/ where "database" in keywords;
Regular expressions are widely used in the query language, e.g., the symbol % is a wild card matching any
string. The in predicate checks whether the string database is one of the keywords.
SQL Example 3: Find documents about XML in the W3C Web site where the documents have paths of length
two or less from the root page.
select d.url, d.title
from Document d such that =|®ð|® ® d
where d.text like %XML%;
The symbol "|ð" is an alternation and the symbol "®ð" is a link. The string =|®|®® is a regular expression
that represents the set of paths of length of one or two. The like predicate is used for string matching in this
example.
Various SQL−like languages have been proposed for Web search engines. The methods introduced previously
treat the Web as a graph of discrete objects; another object−oriented approach (Arocena & Mendelzon, 1998)
considers the Web as a graph of structured objects. However, neither approach has achieved much success
because of its complicated syntax, especially for the latter method.
Content−Based Multimedia Searches
In order to allow for the wide range of new types of data that are now available on the World Wide Web,
including audio, video, graphics, and images, the use of hypermedia was introduced to extend the capabilities
of hypertext. The first Internet search engine, Archie, was created in 1990; however, it was not until the

introduction of multimedia to the browser Mosaic that the number of Internet documents began to increase
explosively. Only a few multimedia search engines are available currently, most of which use name or
keyword matching where the keywords are entered by Web reviewers rather than using automatic indexing.
The low number of content−based multimedia search engines is mainly due to the difficulty of automated
multimedia indexing. Numerous multimedia indexing methods have been suggested in the literature (Chang &
Hsu, 1992; Yoshitaka & Ichikawa, 1999), yet most do not meet the efficiency requirements of Web
multimedia searches, where users expect both a prompt response and the search of a huge volume of Web
multimedia data. A few content−based image and video search engines are available online (Benitez, Beigi, &
Chang, 1998; Gevers & Smeulders, 1999; Lew, 2000; Smith & Chang, 1997; Taycher, Cascia, & Sclaroff,
1997). Various indexing methods are applied to locate the desired images or video. The major technologies
include using camera/object motion, colors, examples, locations, positional color/texture, shapes, sketches,
text, and texture, as well as relevance feedback (Flickner et al., 1995). However, a de facto Web image or
Content−Based Multimedia Searches
173
video search engine is still out of reach because the systems key component¾ðimage or video collection and
indexing¾ðis either not yet fully automated or not practicable. Similarly, effective Web audio search engines
have yet to be constructed since audio information retrieval is considered to be one of the most difficult
challenges for multimedia retrieval (Foote, 1999).
Others
Apart from the above major search techniques, some ad hoc methods worth mentioning include:
Work aimed at making the components needed for Web searches more efficient and effective, such as
better ranking algorithms and more efficient crawlers. In Zhang and Dong (2000), a ranking algorithm
based on a Markov model is proposed. It synthesizes the relevance, authority, integrativity, and
novelty of each Web resource, and can be computed efficiently through solving a group of linear
equations. A variety of other improved ranking algorithms can be found in Dwork, Kumar, Naor, and
Sivakumar and Singhal and Kaszkiel (2001, 2001).

Various enhanced crawlers can be found in the literature (Aggarwal, Al−Garawi, & Yu, 2001;
Edwards, McCurley, & Tomlin, 2001; Najork & Wiener, 2001). Some crawlers are extensible,
personally customized, relocatable, scalable, and Web−site−specific (Heydon & Najork, 1999; Miller

& Bharat, 1998). Web viewers usually consider certain Web pages more important. A crawler which
collects those important pages first is advantageous for users (Cho, Garcia−Molina, & Page, 1998).

Artificial Intelligence (AI) can also be used to collect and recommend Web pages. The Webnaut
system (Nick & Themis, 2001) learns the users interests and can adapt as his or her interests change
over time. The learning process is driven by user feedback to an intelligent agents filtered selections.

To make the system easier to use, an interface has been designed to accept and understand a natural
language query (Ask Jeeves, ).

Major Search Engines
Some of the currently available major commercial search engines are listed in Table 1, although many table
entries are incomplete as some of the information is classified as confidential due to business considerations
(Search Engine Watch, http:// www.searchenginewatch.com). Most search services are backed up by or are
cooperating with several other services. This is because an independent or stand−alone service contains less
information and thus tends to lose its users. In the table, the column Backup gives the major backup
information provider, and most unfilled methods use keyword matching to locate the desired documents. Most
search engines on the list not only provide Web search services but also act as portals, which are Web home
bases from which users can access a variety of services, including searches, e−commerce, chat rooms, news,
etc. Table 2 lists some major experimental search engines, which use advanced search technologies not yet
implemented by the commercial search engines. The list in Table 2 is a snapshot of the current situation; the
list is highly volatile, either because a successful experimental search engine is usually commercialized in a
short time or because a prototype system is normally removed after its founders leave the organization. The
two tables list major general−purpose search engines; special−purpose search engines, including specialty
searches, regional searches, kid searches, etc., are not considered in this chapter. They use much smaller
databases and therefore give more precise and limited search results.
Table 1: Major commercial Web search engines. SE: Search Engine and AS: Answering Service
Others
174
No. Name URL Type Backup Method

1 AOL
Search
Hybrid SE Open Directory
2 AltaVista SE LookSmart
3 Ask Jeeves AS natural
language
4 Direct Hit SE HotBot hyperlink
5 Excite SE LookSmart
6 FAST
Search
scalability
7 Google SE hyperlink
8 HotBot Hybrid SE Direct Hit
9 IWon Hybrid SE Inktomi
10 Inktomi SE
11 LookSmart Directory Inktomi reviewers
12 Lycos Directory Open Directory
13 MSN
Search
Directory LookSmart
14 Netscape
Search
SE Open Directory
15 Northern
Light
SE filtering
16 Open
Directory
Directory volunteers
17 RealNames keywords

18 Yahoo! Directory Google reviewers
Copyright © 2003, Idea Group Inc. Copying or distributing in print or electronic forms without written
permission of Idea Group Inc. is prohibited.
Table 2: Major experimental Web search engines
No. Name URL Method
1 Clever hyperlink
2 Grouper clustering
3 HuskySearch metasearch
4 ImageRover image
5 ImageScape image
6 Inquirus /inquirus.html metasearch
7 Mercator image
8 MetaSEEk crawler
9 PicToSeek :5345/ret_user/ image
10 W3QS SQL
11 WebOQL Object SQL
Others
175
12 WebSQL SQL
Summary
In less than a decade, the World Wide Web has become one of the three major media, with the other two
being print and television. Searching for Web pages is both one of the most common tasks performed on the
Web and one of the most frustrating and problematic. This chapter gave an overview of the current
technologies for Web search engines with an emphasis on non−traditional approaches and classified the
technologies into six categories. However, apart from the traditional keyword matching techniques, no one
method dominates Web search engine technologies. The major reason for this is that the amount of
information posted on the World Wide Web is huge and the page formats vary widely.
Future Directions
Users of search engines often submit ambiguous queries. Ambiguous queries can be categorized into four
types: 1) disorderly, 2) incomplete, 3) incorrect, and 4) superfluous queries. Below are examples of perfect

and ambiguous queries, and the ranked search results from Infoseek at for the book
Intelligent multimedia information retrieval, edited by Mark T. Maybury (1997).
Perfect query: Intelligent multimedia information retrieval
Intelligent multimedia information retrieval1.

Disorderly query: Multimedia information intelligent retrieval
Artificial intelligence, fuzzy logic and neural networks1.
Intelligent access to information: research in natural language, information retrieval,
computer vision, multimedia and database
2.
Multimedia color PC notebooks3.
Intelligent multimedia information retrieval4.

Incomplete query: Multimedia information retrieval
Abstract Stein Mulleller Thiel 951.
Corpora Oct 1998 to −: Corpora: TWLT 14: language technology in multimedia information2.
3 2.1 Introduction to the workplan3.
Intelligent multimedia information retrieval6.

Incorrect query: Intelligent multimedia information retrieval
Artificial intelligence research laboratory at Iowa State University1.
Vasant Honavars home in cyberspace2.
CIIR multimedia indexing

3.
Intelligent multimedia information retrieval31.

Superfluous query: Intelligent multimedia information retrieval systems
Research in multimedia and multimodal parsing and generation1.
Intelligent multimedia information retrieval2.


Summary
176

×