The WWW and The PageRank Related Problems

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.71 MB, 81 trang )

ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN
Nguyễn Hoài Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
LUẬN VĂN THẠC SĨ KHOA HỌC
Hà Nội - 2006
ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN
Nguyễn Hoài Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
Chuyên ngành: Đảm bảo toán học cho máy tính và hệ thống tính toán
Mã số: 1.01.07
LUẬN VĂN THẠC SĨ KHOA HỌC
NGƯỜI HƯỚNG DẪN KHOA HỌC:
PGS. TS. HÀ QUANG THỤY
Hà Nội - 2006
HANOI NATIONAL UNIVERSITY
UNIVERSITY OF SCIENCE
Nguyen Hoai Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
Major: Mathematical assurances for computers and computing systems
Code: 1.01.07
MASTER THESIS
THESIS SUPERVISOR:
ASSOC. PROF. HA QUANG THUY
Hanoi - 2006
Intentionally left blank for your note
Contents

List of Figures ii
List of Tables iii
Introduction iv
Acknowledgement vi
Abstract ix
List of Glossaries xi
1 Objects’ ranks and applications to WWW 1
1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Rank for objects different from web page. . . . . . . . . . 2
1.1.2 Linkage usage in search engine . . . . . . . . . . . . . 4
1.2 Basis of PageRank . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Mathematical basis . . . . . . . . . . . . . . . . . . . 11
1.2.2 Practical issues. . . . . . . . . . . . . . . . . . . . . 13
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 19
2 Some PageRank-related problems 20
2.1 Accelerating problems . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Related works . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Exploiting block structure of the Web . . . . . . . . . . . 22
2.2 Connected-component PageRank approach . . . . . . . . . . . 30
2.2.1 Initial ideas. . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Mathematical basis of CCP . . . . . . . . . . . . . . . 32
2.2.3 On practical side of CCP. . . . . . . . . . . . . . . . . 35
2.3 Spam and spam detection . . . . . . . . . . . . . . . . . . . 37
2.3.1 Introduction to Web spam . . . . . . . . . . . . . . . . 37
2.3.2 Spam techniques . . . . . . . . . . . . . . . . . . . . 38
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 42
3 Implementations and experimental results 43
3.1 Search engine Nutch . . . . . . . . . . . . . . . . . . . . . 43
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Infrastructure and data sets . . . . . . . . . . . . . . . 48

3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 48
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 58
Conclusions and future works 59
References 61
List of Figures
1.1 A small directed Web graph with 7 nodes . . . . . . . . . . . . . . . 13
1.2 Ranks of page in SmallWeb graph with α = .9 . . . . . . . . . . . . . 16
1.3 Figures exemplifying results with different α of SmallSet . . . . . . . 17
1.4 Graph of iterations needed with α ∈ [.5; .9] of SmallSet . . . . . . . . 18
2.1 Convergence rates for standard PageRank vs. BlockRank . . . . . . 29
2.2 Unarranged matrix and arranged matrix . . . . . . . . . . . . . . . . 31
2.3 Boosting techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Nutch packages dependencies . . . . . . . . . . . . . . . . . . . . . 46
3.2 Matrices from set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Ranks from PageRank vs. CCP for set 1 . . . . . . . . . . . . . . . . 50
3.4 Ranks and differences corresponding to each block for set 1 . . . . 51
3.5 Time of two methods with different decay values for set 1 . . . . . . 51
3.6 No. of iterations of two methods with different decay values for set 1 52
3.7 Matrices from set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Ranks from PageRank vs. CCP for set 2 . . . . . . . . . . . . . . . . 53
3.9 Ranks and differences corresponding to each block for set 2 . . . . 53
3.10 Time of two methods with different decay values for set 2 . . . . . . 54
3.11 No. of iterations of two methods with different decay values for set 2 54
3.12 Matrices from set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.13 Ranks from PageRank vs. CCP for set 3 . . . . . . . . . . . . . . . . 55
3.14 Ranks and differences corresponding to each block for set 3 . . . . 56
3.15 Time of two methods with different decay values for set 3 . . . . . . 56
3.16 No. of iterations of two methods with different decay values for set 3 57
i
3.17 Mean case for 3 sets with different decay values . . . . . . . . . . . 57

List of Tables
1.1 Iterations needed for each value of decay-factor α of SmallSet . . . 17
2.1 The closeness of local PageRank vector to global PageRank vector 25
3.1 Each set with corresponding sources . . . . . . . . . . . . . . . . . . 49
3.2 Each set with corresponding number of pages and links . . . . . . . 49
iii
Introduction
Human beings’ history has witnessed many big improvements but nothing is
comparable to networks’, in general, and the WWW’s improvement, in partic-
ular.
During the last decades, databases have grown exponentially in large stores
and companies. In the old days, seekers faced many difficulties in finding
enough data to feed their needs. The picture has changed and now the reverse
picture is a daily problem – how to find relevant data in the information which
is regularly accumulated. The need therefore arises for better approaches that
are able to handle complex models in a reasonable amount of time because
the old models in statistics, machine learning and traditional data analysis
fail to cope with this level of complexity. That is named data-mining with an
import ant branch, Web-mining.
The key inspiration of web-mining based on W WW’s growth. There is no
doubt that WWW is the most valuable source of information but it is not very
easy for you to get used to if you do not know the way. Several studies t ry to
estimate the size of th e Web and most o f them agree that there are over billions
of pages. Additionally, the changing rate of the Web is even more dramatic [4].
According to [4], the size of the Web has doubled in less than two years, and
this growth rate is projected to continue for the next two years. Although a
enormous amount of information is available on the Web, you can hardly find
appropriate information if you nothing supporting. For that reason, you need
a search engine to make your work easier.
We can consider search engines roughly as places that store information

and return what we need after some operations. Everything returne d by a
iv
search engine is sorted descendingly, that means en tries which appear first
seem to be more relevant. Even if that seems normal, many efforts were made
to make this step straightforward. For var ious ways of approaching, there are
many methods that deal with this problem. I n this problem, we have to deter-
mine a page’s importance score in order to sort it in the list of pages [16, 6].
The most famous approach discovered is PageRank
TM
from Google [1]. This
thesis focuses on PageRank and its modifications to tweak up the computing
process.
With the aim of understanding network phenomena and, especially, PageR-
ank computation, this thesis is organized as follow:
• Chapter 1: Demontrates on the rank concept and PageRank. This chap-
ter is the background on which the following chapters flourish.
• Chapter 2: Takes care of som e PageRank-related problems such as ac-
celerating problem, spam. And, a new method is also proposed in this
chapter (section 3.2) which we has published [19].
• Chapter 3: Concentrates on the practical side of PageRank with im-
plementations and experiments. Conclusions and future works are also
mentioned in this chapter.
Acknowledgement
My first words are to acknowledge all those people who had helped, contr ibuted
to the works in this thesis. It is o bvious that I can not f inish this thesis alone.
It will be an impossible task if I, myself alone, work on all things such as de-
ciding orientation, doing research, implementing and many more things. I
tried my best, and if your name is not listed, please tr ust implicitly in me that
my gratitude is not less than those who are listed below.
In the foremost position of my gratitude would stand Assoc. Prof. Ha Quang

Thuy for his supervision, advice and guidance to me from the very first day
when I newly accustom science. His endless enthusiasm for science has in-
spired me in doing research and finally this thesis can be realized. Above all
and most nee ded, he always tries his utmost to encourage and support me in
various ways. Once again, I would like to send my most nicely thankful words
to him.
Many thanks would go to Assoc. Prof. Hoang Chi Thanh, Head of Depart-
ment of Informatics where I am working, for his help and advice to me in both
work and life. Without his help, my thesis can not be here as it is being now.
It is a pleasure for me to express my grateful thanks to Prof. Pham Ky Anh
for his generous disposition. I was allowed to use many facilities in the Center
for High Performance Computing, where he is the Director, to do research and
these facilities helped forming a big important par t of my thesis.
To Doan Duy Hai, Departme nt of Computational and Applied Mathemat-
ics, it is a great joy for me to have such a wonderful friend, colleague like him.
May be he does not know how much I am indebted to him for his kind he lp.
I thank him for his valuable advice in science discussion and for his precious
vi
time spending on helping me. His precise ideas were quite an unforgettable
assistance to me with my scientific works. F u rthermore, with his superb skill
in L
A
T
E
X, I can have help from an expert to drive the trou bles away.
Working environment is a key point in forming the way and att itude of
working to me. I am very lucky to work in a good environment with wise
and scholarly colleagues. I also benefit by academic courses and professional
subjects from these wise and scholarly colleagues including my teachers and
friends. I thank Nguyen T rung Kien who, with his unconditional help of im-

plementing, running and storing data in my experiments, makes t hings easier
for me.
I have a nice collaboration to the seminar KDD group from Faculty of In-
formation Technology, College of Technology. I was so providential to have sev-
eral opportunities to work, discuss about science with those intellectuals. I
especially thank Nguyen Thu Trang for her help and time working on pro-
gramming with me.
This work was supported in par t by the National Project "Developing con-
tent filter systems to support management and implementation public secu-
rity - ensure policy" and th e MoST-203906 Project "Information Extraction
Models for discovering entities and semantic relations from Vietnamese Web
pages".
I can not wait anymore to wholeheartedly acknowledge my sweetheart. She
is really an angel whose dedication, love, confidence in me and intelligence
really took me over hard works.
I am extremely lucky to be born in my family. My father, the first person
who set th e basis of my learning character, taught me the joy of intellectual
recreation even when I was a child. My mother, the best mother in the world,
with her unselfish sacrifice, bring me up with gently love. And, my younger
brother, without his spiritual support, may I fail many times.
My last words are to thank to everybody because you deserve that for your
import ance to me.
Hanoi, Octorber 25
th
, 2006
Nguyen Hoai Nam
Abstract
In this thesis, we will study about the whole WWW on the point of view of
social n etworks. There are many directions to come up to WWW but we chose
the way of social networks because it will give us many advantages in doing

research.
Everytime we heard of WWW, that refer us to the Internet - the biggest
network all over the world. Networks are all around us, all the time. From the
biochemistry of our cells to the web of friendships across the planet. From the
circuitry of modern electronics to chains of historical events. A network is the
result of the forces that shaped it. Thus the principles o f network formation can
be, to some extent, deciphered from the network itself. All such information
comprises the structure of the network.
Moreover, the Inte rnet has been improving rapidly, from a small network
for experimental purpose to a world-wide network. The plentiful content of the
World-Wide Web is useful to millions. But infomation is only available if you
know where to find it. This thesis centres around some aspects of network-
related problems:
• How can we find needed infomation? The answer is the appearance of
search engines.
• How is returned information sorted? The key idea is to u se a ranking ap-
proach and the mo st famous approach, PageRank
TM
, is carefully studied
in this thesis.
• What are raised problems associated to the answers of two previous ques-
tions? These problems cover storage capacity, computational complexity,
ix
accelerating methods, programming skills and many more.
To check whether the the ory to tweak up the existing method is true, we
implemented with data sets downloaded from the Internet. Experimental fig-
ures show that our method is relatively good and it is really promissing.
List of Glossaries
Glossary name Description
CCP Connected-component PageRank

IR Information Retrieval
PR PageRank
URL Unified Resource Locator
WWW World Wide Web
xi
CHAPTER
Objects’ ranks and applications to
WWW
WWW is a very famous entity that you can meet everyday at any
time and in any circumstance. Gradually it prevails on every aspect
of our lives and it is definitely useful. Because of endlessly enormous
information, which can be provided by the WWW, we must have a
scheme to sort things in order to determine which is relatively suit-
able. This chapter will guide you through one of the most topical
problems addressed.
1.1 Rank of objects
Many datasets of interest t oday are best described as a linked collection of
interrelated objects. These may represent homogeneous networks, in which
there is a single-object type and link type, or richer, heterogeneous networks,
in which there may be multiple object and link types (and possibly other se-
mantic information). Examples of homogeneous networks include single mode
social networks, such as people connected by friendship links, or the WWW,
1
1.1 Rank of objects
a collection of linked web pages. Examples of heterogeneous networks include
those in medical domains descr ibing patients, diseases, treatme nts and con-
tacts, or in bibliographic domains describing publications, authors, and venues.
These examples may include set of scientific papers. As all of us know, it is
necessary that we credit others’ works if we use their works in our paper. That
can be considered a link between our papers and credit-owner’s works. Link

mining refers to data mining techniques that explicitly consider these links
when building predictive or descriptive models of the linked data. Commonly
addressed link mining tasks include object ranking, group detection, collective
classification, link prediction and subgraph discovery. While network analysis
has been studied in depth in part icular areas such as social network analy-
sis, hypertext m ining, and web analysis, only recently has there been a cross-
fertilization of ideas among these different commu n ities. This is an exciting,
rapidly expanding area.
In this section, examples on ranking objects in many kinds o f datasets are
examined to insist that an approach to sort relevant results is required. Then,
the section will go on with the need of a ranking algorithm for WWW–one of
the most important communities in our contemporary world.
1.1.1 Rank for objects different from web page
Progress in digital data acquisition and storage technology has resulted in
the growth of huge databases. This has occurred in all areas of human en-
deavor, from the mundane (such as supermarket transaction data, credit card
usage records, telephone call details, and government statistics) to the more
exotic (such as images o f astronomical bodies, molecular databases, and med-
ical records). Little wonder, then, that interest has grown in the possibility
of tapping these data, of extracting from them information that might be of
value to the owner of the database. The discipline concerned with this task
has become known as data mining.
Data mining, with its aim to find knowledge from a "big mountain" of infor-
mation, needs a way to sort things extracted. We will study some examples to
decide that a ranking algorithm is "must-have" part of every mining system.
2
1.1 Rank of objects
Example 1.1. Consider some online computers store that accept queries on
the Configuration and Price attributes of a goods. Agents could treat query
conditions as if they were regular Boolean conditions then returns relevant

things. This way, agent or any client can determine an acceptable radius of
pattern and price around the preferred products. However, suppose we are
browsing a very big store, there could be too many matching items making
users’ tasks difficult. Additionally, many items that are slightly weaker in
specification but price are very good in the searching range are missing.
Thus, we should t h ink of a new ranking system which ranks results if they
are closest to the Configuration given and have best Price or relatively inex-
pensive. 
In addition, we can propose, briefly, a method to rank items as follow.
Example 1.2. Suppose, with two criteria Configuration and Price used, each
criterion is assigned weight as in 0.8c+0.2p where c ∈ [0, 1] shows how close the
result is to the target configuration and p ∈ [0, 1] indicates how close the price
is to the target price. Looking at the formula, if a computer has higher relevant
configuration, it seems to be ranked higher in results. If another store would
want to weigh conf iguration and price equally, the formula should be 0.5c+0.5p.
Suppose that a customer wants to look for a computer with specific con-
figuration and target price of 500US$. Besides, in the store, there is only one
computer with the specific configuration (c = 1, quite good) and high price
(i.e. p = 0.2). All the remaining computers are in the near configuration with
c = 0.8 and the price for all is moderate, p = 0.5. Note that, with the same or
near configuration, computers can differ in price due t o the make of manufac-
turers.
Applying the definition above, we compute the relevance score of each com-
puter type. The computer, which suites best in conf iguration have score of
0.8 × 1 + 0.2 × 0.2 = 0.84 whereas other computers shou ld have scores of
0.8 × 0.8 + 0.2 × 0.5 = 0.74. Therefore, in this store, the computer with best rele-
vant configuration should be ranked first in results and others will have lower
ranks. What will happen if we consider a store which assigns the same weight
to both configuration and price? Let us have a computation of these computers.
The computer, which suites best in configuration has score of 0.5×1+0.5×0.2 =

3
1.1 Rank of objects
0.6 whereas other computers should have scores of 0.5 × 0.8 + 0.5 × 0.5 = 0.65.
Therefore, in this store, the answer is different. Computers with unlike config-
uration but good price should be the first choice because they have high scores
and the computer with the same configuration but high price would receive a
worse vote from the store’s algorithm. 
Problem arises in Example 1.2 is not the only problem. We will face another
problem which is described Example 1.3
Example 1.3. Suppose that there is a se rvice, which combines results from
many online stores to provide information to customers. Therefore, this ser-
vice has to query many stores at once then combines the results into a single
ranked result. Nevertheless, what will happen if this service uses a different
algorithm to stores as the case descr ibed in Example 1.2? This is an open ques-
tion. 
It is hard to determine which system is right, which is better, to which we
should follow for previous examples. Moreover, not only computers but also
many kinds of entities around us need to be ranked. That leads us to a big job:
Ranking objects. This is an open question that the th esis’ author intends to
solve in the future.
In previous examples, computers with different configurations and pr ices
have no interrelated information. How about objects those have connections
between? Is it easier to mine and extract useful information? In the next sec-
tion, we will consider web pages–special objects which have links and know
how links between web pages benefit us.
1.1.2 Linkage usage in search engine
Links or more generically relationships, among data instances are ubiquitous.
These links often exhibit patterns that can indicate properties of t h e data in-
stances such as the importance, rank, or category of the object. In some cases,
not all links will be observed; therefore, we may be interested in predicting

the existence of links between instances. In other domains, where the links
are evolving over time, our goal may be to predict whether a link will exist
4
1.1 Rank of objects
in the future, given the previously obser ved links. By con sidering links, more
patterns that are complex arise as well. This leads to other challenges focused
on discovering substructures, such as communities, groups, or common sub-
graphs.
Traditional data mining algorithms such as association rule mining, mar-
ket basket analysis, and cluster analysis commo n ly attempt to find patterns
in a dataset characterized by a collection of independent instances of a single
relation. One can think of this process as learning a model for the node at-
tributes of a homogeneous graph while ignoring the links between the nodes.
A key emerging challenge for data mining is tackling the problem of mining
richly structured, heterogeneous datasets. These kinds of datasets are best de-
scribed as networks or graphs. The domains often consist of a variety of object
types; the objects can be linked in a variety of ways. Thus, the graph may have
different node and edge types. Naively applying traditional statistical infer-
ence procedures, which assume that instances are independent, can lead to
inappropriate conclusions about t h e data. Care must be taken that potential
correlations due to links are handled appropriately. In fact, object linkage is
knowledge that should be exploited. This information can be used to improve
the predictive accuracy of the learned models: attributes of linked objects are
often correlated, and links are mo re likely to exist between objects that have
some commonality. In addition, the graph structure itself may be an impor-
tant element to include in the model. Structural properties such as deg ree and
connectivity can be important indicators. Link mining is a newly emerging re-
search area that is at the intersection of the work in link analysis, hypertext
and web mining, relational learning and inductive logic programming, and
graph mining. We use the term link mining to put a special emphasis on the

links, moving them up to first-class citizens in the data analysis endeavor.
Now we will take care of link mining on the point of view of the main
inspiration–WWW. There is no dou bt that the Web is really gigantic and deal-
ing with it is really a challenging task. Paper [4] has point out that the size and
the content of the Web is changing second by second with the dramatic growth
rate. The size of WWW is estimated to have dou bled in less than two years and
this growth rate is projected to continue for the next two years. As Google [1]
has announced, they have a repository of over 8 billion pages indexed and this
5
1.1 Rank of objects
number does not reflect well of all pages exist on the WWW.
Aside from newly created pages, the existing pages are continuously up-
dated and, consequently, contents of Web change incessantly. For example,
with the study of over a half of million pages, [4] also shows that about 23%
of pages changed daily. In the .com domain, 40% of the pages changed daily,
and in about 10 days half of the pages are gone (i.e., their URLs are no longer
valid).
In addition to size and rapid change, the interlinked nature of the Web
sets it apart from many other collections. Several studies aim to understand
how t h e Web’s linkage is structured and how th at structure can be modeled.
One recent study, for example, suggests that the link structure of the Web is
somewhat like a "bow-tie" [7]. That is, about 28% of the pages constitutes a
strongly connected core (the center of the bow tie). About 22% form one of the
tie’s loops: these are pages that can be reached from t he core but not vice versa.
The other loop consists of 22% of the pages that can reach the core, but cannot
be reached from it. (The remaining nodes can neither reach the core nor can
be reached from the core.)
With its own complex structure, WWW has created many difficulties in
people’s aim to understand it, and, furthermore, to utilize and find useful in-
formation from its enormous warehouse of information resources. As many

insisting sentences in thesis, this is the main reason why search engines ap-
pear.
Example 1.4. For many of us, if we want to search for scientific papers, be-
sides non-free sites, one of the first-thought-of websites should be .i
This website has a huge collection of scientific papers, especially in Infor-
mation Technology. It collects papers from many sources then stores in its
database. If you want to search for any paper, just get to th e URL above, type
what you want and look up at the results returned. This is quite an easy task
for us. How about people who are newbies to using Internet? It may be im-
possible for them to know t h e URL n eeded to find what they want. Moreover,
Citeseer is limited in a small amount of available papers, and scientific papers
are not the only thing that people want to look up from the WWW. 
What shown in Example 1.4 has addressed one of the smallest t hings that
6

The WWW and The PageRank Related Problems

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về