ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC CÔNG NGHỆ
Hoàng Cường
Research on node ranking in peer-to-peer networks
KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY
Ngành: Công nghệ thông tin
HÀ NỘI - 2010
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Lời cảm ơn
Lời đầu tiên em xin bày tỏ lòng biết ơn sâu sắc tới TS. Nguyễn Hoài Sơn, các
thầy đã hướng dẫn và là nguồn cảm hứng cho quá trình nghiên cứu của em.
Em xin bày tỏ lòng biết ơn tới các thầy, cô giáo trong Khoa Công nghệ thông
tin - Trường Đại học Công nghệ - ĐHQGHN. Các thầy cô đã dạy bảo, chỉ dẫn
chúng em và luôn tạo điều kiện tốt nhất cho chúng em học tập trong suốt quá trình
học đại học đặc biệt là trong thời gian làm khoá luận tốt nghiệp.
Hà Nội, ngày 22 tháng 5 năm 2010
Hoàng Cường
i
Research on Node Ranking – Peer to Peer …. Hoàng Cường
ABSTRACT
This paper defines and describes a fully distributed NODE ranking algorithm
for “peer to peer” systems. The research puts forward new approach for ranking
nodes over peer to peer. Synthesizing foundation and promoting new method which
is feasible for peer to peer networks. Integration of this algorithm into P2P keyword
search can produce dramatic benefit both in terms of effectiveness for users and
decrease in network traffic. The incremental search algorithm provided
approximately a ten-fold reduction in network traffic for two-word and three-word
queries.
ii
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Chapter 1
Table of Contents
Abstract.............................................................................. Error! Bookmark not defined.
List images........................................................................................................................... 5
List tables............................................................................................................................. 7
Chapter 1: Peer to Peer and Ranking Problem .................................................................... 5
1.1. Peer to Peer ......................................................................................................... 5
1.1.1. Peer to Peer overview ............................................................................. 5
1.1.2. Architecture of Peer to Peer Systems .. Error! Bookmark not defined.7
1.1.3. Distributed hash tables............................................................................. 8
1.2. Ranking in Peer to Peer networks....................................................................... 9
1.2.1. Introduction............................................ Error! Bookmark not defined.
1.2.2. Ranking Roles........................................ Error! Bookmark not defined.
1.2.3. Research’s important objects.................Error! Bookmark not defined.
Chapter 2: Ranking on DHT Peer to Peer Networks......................................................... 11
2.1. Chord Protocol.................................................................................................. 11
2.2. Pagerank............................................................................................................ 12
2.2.1. Description............................................................................................. 12
2.2.2. Algorithms ............................................................................................. 13
2.3. Distributed Computing ..................................................................................... 17
2.2.1. Introduction............................................................................................ 17
2.2.2. Algorithms ............................................. Error! Bookmark not defined.
2.4 if-idf................................................................................................................... 18
Chapter 3: Building a new algorithm for ranking in chord networksError! Bookmark not defin
e
3.1. Targets and Missions of Research .................... Error! Bookmark not defined.
3.2. Idea.................................................................... Error! Bookmark not defined.
iii
Research on Node Ranking – Peer to Peer …. Hoàng Cường
3.2.1. Major problems to exploit ..................... Error! Bookmark not defined.
3.2.2. Ranking Idea .......................................... Error! Bookmark not defined.
Chapter 4: Ranking on Details ........................................ Error! Bookmark not defined.
4.1. Ranking algorithm ............................................ Error! Bookmark not defined.
4.2. Ranking’s features ............................................ Error! Bookmark not defined.
Chapter 5: Evaluation ....................................................................................................... 50
Chapter 6: Related Work .................................................................................................. 52
Chapter 7: Contributions and future work........................................................................ 53
References ......................................................................................................................... 54
iv
Research on Node Ranking – Peer to Peer …. Hoàng Cường
List Images
Image 1.1.1 Peer to Peer means connected together.Error! Bookmark not defined.
Image 1.1.3 Distributed hash tables example.......... Error! Bookmark not defined.
Image 1.2.1 System must to have the ranking engine to find the one.Error! Bookmark
not defined.
Image 2.1 A 16-node Chord network. example. ... Error! Bookmark not defined.
Image 2.2.2: How Pagerank works........................... Error! Bookmark not defined.
Image 2.3: Distributed Nodes Graph example ......... Error! Bookmark not defined.
Image 3.2.1: Google almost is not exact .................. Error! Bookmark not defined.
Image 3.2.2: Intersect Idea ....................................... Error! Bookmark not defined.
Image 3.4: Factor Percent......................................... Error! Bookmark not defined.
Image 4.1: Bandwidth is the key of ranking trusted. Error! Bookmark not defined.
Image 4.1.2: Example of sub-graph semantic rank .. Error! Bookmark not defined.
Fig 4: A global graph of both local nodes and external nodes Error! Bookmark not
defined.
Fig 5: An external local graph without a strategy .... Error! Bookmark not defined.
Fig 6: An external local graph.................................. Error! Bookmark not defined.
Image 4.2: Eigenvalue .............................................. Error! Bookmark not defined.
Image 4.2.3: Random walk....................................... Error! Bookmark not defined.
Image 4.2.4: (n+1) graph nodes................................ Error! Bookmark not defined.
Image 4.2.5: Graph example - 6 nodes..................... Error! Bookmark not defined.
Image 4.2.6: Multiplication result example.............. Error! Bookmark not defined.
Image 4.2.7: Multiplication result example – at iterators........Error! Bookmark not
defined.
v
Research on Node Ranking – Peer to Peer …. Hoàng Cường
vi
Research on Node Ranking – Peer to Peer …. Hoàng Cường
List tables
Table 3.2.1: The Pagerank converge and HITS converge .......Error! Bookmark not
defined.
Table 3.2.2: The Pagerank converge increasing to fastError! Bookmark not defined.
Table 3.2.3: Pagerank convergence are not steady when Epsilon small ...........Error!
Bookmark not defined.
Table 3.2.4: HITS convergence ( take lots time than Pagerank)...............Error!
Bookmark not defined.
Table 5.1: the number of iterators which converges……………………….Error!
Bookmark not defined.
7
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Chapter 1:
Peer to Peer and Ranking Problem
1.1. Peer to Peer
A peer-to-peer, commonly abbreviated to P2P, is any distributed network
architecture conceive in associate that make a portion of their resources (such as
processing power, disk storage or network bandwidth) directly available to other
network partners, without the need for central coordination instances (such as servers
or stable hosts). Peers are both suppliers and consumers of resources, in contrast to the
traditional client–server model where only servers put out, and clients snack.
Peer-to-peer was popularized by file sharing systems like Napster. File
sharing is the practice of distributing or providing access to digitally stored
information, such as computer programs, multi-media (audio, video), documents, or
electronic books. It may be implemented through a variety of storage, transmission,
and distribution models and common methods of file sharing incorporate manual
sharing using removable media, centralized computer file server installations
on computer networks, World Wide Web-based hyperlinked documents, and the use
of distributed peer-to-peer networking.
1.1.1. Peer to Peer overview
In its simplest form, a peer-to-peer (P2P) network is created when two or more
PCs are connected and share resources without going through a separate server. A P2P
network can be an ad hoc connection—a couple of computers connected via a
Universal Serial Bus to transfer files. A P2P network also can be a permanent
infrastructure that links a half-dozen computers in a small office over copper wires. Or
a P2P network can be a network on a much grander scale in which special protocols
and applications set up direct relationships among users over the Internet.
The initial use of P2P networks in business followed the deployment in the
early 1980s of free-standing PCs. In contrast to the mini-mainframes of the day (e.g.
Fuitsu/ICL, IBM AS/400, IBM Mainframe, Unisys, … ), which used by over 16,000
8
Research on Node Ranking – Peer to Peer …. Hoàng Cường
organizations, the list, or selections from this list, will be of particular interest to those
companies who either supply medium sized to large scale systems or who offer
products and/or services related to the use of such systems; any data is supplied with
the named head of IT, as standard..
Mini- mainframes served up word processing and other applications to dumb
terminals from a central computer and stored files on a central hard drive, the then-
new PCs had self-contained hard drives and built-in CPUs. The smart boxes also had
onboard applications, which meant they could be deployed to desktops and be useful
without an umbilical cord linking them to a mainframe.
Shared file and printer access within a local area network may either be based
on a centralized file server or print server, sometimes denoted client–server paradigm,
or on a decentralized model, denoted peer-to-peer network topology or Workgroup
(computer networking). In client–server communications, a client process on the local
user computer takes the initiative to start the communication, while a server process on
the file server or print server remote computer passively waits for requests to start a
communication session. In a peer-to-peer network, any computer can be server as well
as client. It’s fantastic and efficient.
In effect, every connected PC is at once a server and a client. There's no special
network operating system residing on a robust machine that supports special server-
side applications like directory services (specialized databases that control who has
access to what).
Image 1.1.1 : “Peer to Peer” means “connected together”
9
Research on Node Ranking – Peer to Peer …. Hoàng Cường
In a P2P environment, Unlike client-server networks, where network
information is stored on a centralized file server PC and made available to tens,
hundreds, or thousands client PCs, the information stored across peer-to-peer networks
is uniquely decentralized. Because peer-to-peer PCs have their own hard disk drives
that are accessible by all computers, each PC acts as both a client (information
requestor) and a server (information provider). In the diagram below, three peer-to-
peer workstations are shown. Although not capable of handling the same amount of
information continuance that a client-server network might, all three computers can
communicate directly with each other and share one another's resources.
1.1.2 Architecture of P2P systems
Peer-to-Peer Architecture distinguishes itself by its distribution of power and
function. Rather than concentrating its power in the server, Peer-to-Peer models rely
on the power and bandwidth of participants. They form ad hoc connections between
nodes for sharing all kinds of information and files. Peer-to-Peer discards hierarchical
notions of clients and servers (clients at the top, servers on the bottom) and replaces it
with equal peer nodes that function simultaneously as clients and servers. This also
discards the idea of a central server, which exists in Client-Server Architecture.
There are several classifications of Peer-to-Peer networks. These include
pure/hybrid and structured/unstructured Peer-to-Peer networks. Pure
P2P networks merge the role of clients and servers as equals and do not provide a
central server for managing the network or a central router that forwards requests to
other networks.
Hybrid P2P models, on the other hand, do contain a central server that stores
peer information and responds to request for information stored on that server. In this
configuration, peers host available resources since there no central server provides this
function.
Peers also make central servers aware of what resources they want to share and
make those resources available to peers that request them. Also, route terminals
function as used addresses and are indexed to find an absolute address.
The structure of P2P networks is determined by the nature of the overlay
network, which consists of all participating peers as equal nodes. Nodes in an overlay
network are connected through virtual or logical links that create a path to the
underlying network.
Essentially, overlay networks are network built on top of other networks. Peer-
to-Peer networks are considered overlay networks because they are usually built on
top of the Internet. Structured P2P networks use a global protocol so that searches can
be routed TO any peers/nodes BY any peers/nodes on the network.
To retrieve rare files, more structured overlay links are required. The most
common structured P2P network is the distributed hash table (DHT). DHTs are
decentralized distributed systems that store names and values. Any participating node
in the network can lookup and retrieve values. Maintenance of the DHT mapping is
distributed among the nodes. The ownership of each file is assigned to a peer, but the
10
Research on Node Ranking – Peer to Peer …. Hoàng Cường
addition or deletion of peers or files doesn’t cause major disruptions. This makes them
very scalable.
Unstructured P2P networks establish links more arbitrarily. To join, a peer only
has to copy the links of an existing node and then add its own links as it develops. To
find a desired file, however, the request must be flooded throughout the network. This
doesn’t always necessarily return the desired results if the file being requested is rare.
There is no correlation between peer and content. Also, flooding increases network
traffic, slowing down responses and file sharing.
The primary advantage of P2P networks is that all clients contribute their
resources. These resources include computing power, bandwidth, and storage space. In
traditional Client-Server models there are a fixed number of servers, so the addition of
clients slows down network processing. In Peer-to-Peer models, as nodes are added,
system resources increase (contributed by the added nodes) to accommodate demand.
Comparisons
Client-Server Architecture provides certain advantages over these other
network models. For example, Client-Server models offer easier maintenance,
security, and administration. For example, encapsulation makes it possible for servers
to be repaired, upgraded, or replaces without clients being affected.
Encapsulation is the process by which an object can hide its data and methods
without revealing them to users. Also, because all data is stored on servers, the data is
more secure. Servers control access and ensure that only screened clients can access
and manipulate data. Again, since data is centralized on servers, updates occur on
the server and are then transmitted to clients as they request services.
In P2P models, updates must be applied and copied to peers in the network,
which requires a lot labor and is prone to errors. However, Client-Server paradigms
often suffer from network traffic congestion. This is not a problem for P2P, since
network resources are in direct proportion to the number of peers in the network. Also,
Client-Server paradigms lack the robustness of P2P networks. Robustness refers to a
network’s ability to bounce back or continue functioning if one of the components
fails. If a server fails in Client-Server models, the request cannot be completed. In
P2P, a node can fail or abandon the request. Other nodes still have access to resources
needed to complete the download.
11
Research on Node Ranking – Peer to Peer …. Hoàng Cường
1.1.3. Distributed hash tables
Image 1.1.3: Distributed hash tables example
Distributed hash tables (DHTs) are a class of decentralized distributed systems
that provide a lookup service similar to a hash table: (key, value) pairs are stored in the
DHT, and any participating node can efficiently retrieve the value associated with a
given key. Responsibility for maintaining the mapping from keys to values is
distributed among the nodes, in such a way that a change in the set of participants
causes a minimal amount of disruption. This allows DHTs to scale to extremely large
numbers of nodes and to handle continual node arrivals, departures, and failures.
DHTs form an infrastructure that can be used to build peer-to-peer networks.
Notable distributed networks that use DHTs include BitTorrent's distributed tracker,
the Kad network, the Storm botnet, YaCy, and the Coral Content Distribution
Network.
1.2. Ranking in Peer to Peer networks
1.2.1 Introduction
This thesis discusses the execution distributed page ranking technology in the
peer-to-peer network crown which constructs. The distribution page ranking needs,
because net's size grows by the remarkable speed, and the centralized page ranking is
not may promote. Open style system PageRank is proposed in the paper base in
Google use traditional PageRank.
12
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 1.2.1: System must to have the ranking engine to find the one
We then proposed that some distributed page ranking algorithm, proves their
convergence partially, and discusses some interesting products they. The indirect
transmission in this article was introduced that reduces representence which the
representence ceiling and achieves between page rankers may promote. Between the
convergence time which and the band width relations consumes are also discussed.
Finally, we verify certain discussions by the basis true data set's experiment.
1.2.2. Ranking Roles
The determination “the importance” the link structure became based on the
page ranking's homepage was searching an engine's important technology. Specially,
the hit algorithm maintains each page a jack and the authority score, the authority and
the jack score calculates based on the page connection relations in the hyperlinked
environment. Google the use PageRank algorithm determines “the score” the
homepage double counting matrix eigenvector/feature vector/proper vector.
When net's size growth, it difficult and becomes difficultly, for the existing
search engine can include the entire net. We need to be may promote about page
quantity and user's quantity distributed search engine. Not only in a distributed search
engine, the page ranking is essential in its improvement's inquiry result centralization
relative, but should and the availability carries out distributed for the measurable
quantity. A direct way achieves distribution the page ranking to call the hit or the
PageRank algorithm to the distributed environment. But it is not to do a that trivial
matter. Two hits and PageRank are the redundant algorithms. Each pack of riding
instead of walking needed previously the step estimated result, the synchronous
13
Research on Node Ranking – Peer to Peer …. Hoàng Cường
operation needed. However, achieves the synchronous communication, in the width
disseminates in the distributed environment is difficult. Moreover, must consider
carefully the page divides into with the representence ceiling, when carries out the
distributed page ranking.
The coordinated cover network which constructs was won the prestige to take
recently from has organized, toughness, a large-scale distribution system's
construction platform. In this article, we try to carry out the effective page ranking in
the peer-to-peer network crown which constructs. We first propose according to the
google PageRank some distributed page ranking algorithm, and proposes about theirs
some interesting products and the result. Is more important than because of the
representence ceiling CPU and in the distributed page ranking's memory usage, our
then discussion about relaxes the representence ceiling's strategic page to divide into
with the idea. Through this execution, our paper makes the following contribution:
• Through the use true data set, we provide two kind of distributed page ranking
algorithm, proves their convergence partially, and verifies their characteristic.
• We recognize main the point in dispute and question and the distribution page
ranking concern in the P2P network crown which constructs.
Chapter 2:
Foundation of Ranking on DHT Peer to Peer Networks
2.1. Chord Protocol
Using the Chord lookup protocol, node keys are arranged in a circle.
The circle cannot have more than 2
m
nodes. The circle can have IDs/keys
ranging from 0 to 2
m
− 1.
IDs and keys are assigned an m-bit identifier using consistent hashing. The
SHA-1 algorithm is the base hashing function for consistent hashing. Consistent
hashing is integral to the robustness and performance of Chord because both keys and
IDs (IP addresses) are uniformly distributed and in the same identifier space.
Consistent hashing is also necessary to let nodes join and leave the network without
disruption.
Each node has a successor and a predecessor. The successor to a node (or key)
is the next node (key) in the identifier circle in a clockwise direction. The predecessor
is counter-clockwise. If there is a node for each possible ID, the successor of node 2 is
node 3, and the predecessor of node 1 is node 0; however, normally there are holes in
the sequence. For example, the successor of node 153 may be node 167 (and nodes
14
Research on Node Ranking – Peer to Peer …. Hoàng Cường
from 154 to 166 will not exist); in this case, the predecessor of node 167 will be node
153.
Since the successor (or predecessor) node may disappear from the network
(because of failure or departure), each node records a whole segment of the circle
adjacent to it, i.e. the r nodes preceding it and the r nodes following it. This list results
a high possibility that a node is able to correctly locate its successor or predecessor,
even if the network in question suffers from a high failure rate.
Image 2.1: A 16-node Chord network example
The Chord protocol is one solution for connecting the peers of a P2P network.
Chord consistently maps a key onto a node. Both keys and nodes are assigned an m-bit
identifier. For nodes, this identifier is a hash of the node's IP address. For keys, this
identifier is a hash of a keyword, such as a file name. It is not uncommon to use the
words "nodes" and "keys" to refer to these identifiers, rather than actual nodes or keys.
There are many other algorithms in use by P2P, but this is a simple and common
approach.
2.2. Pagerank
15
Research on Node Ranking – Peer to Peer …. Hoàng Cường
2.2.1 Description
PageRank is the key link parsing algorithm, names by the Google Co-founder
Lary Page, uses from assigns a digit as extra as the hyperlinked set of each element
document, for example world wide network, with "Goal Google Internet search
engine; measuring" It in set relative importance. Perhaps the algorithm is utilized in
the individual all collection mutual quotation and about. It assigns to all specific
element E digital weight also calls E PageRank, and indicated by PR (E).
2.2.2 Algorithm
PageRank will be the possibility distribution which will use in is symbolizeed
possibly willfully clicks in the link person will arrive all special data. PageRank may
calculate for all size document collection. In several research papers are divided
evenly by the supposition release between collection all documents in the
computational process at the beginning. The PageRank computation requests several
passes, calls " iterations" To reflects the theory real value strictly through the
adjustment approximate PageRank collection value.
The possibility is expressed takes in 0 and a 1. scope value. A 0.5 possibility
are expressed together takes " 50% chance" Something occurrence. Therefore, there
0.5 method PageRank will be 50% opportunity click the human who will link willfully
in one will be directed to and 0.5 PageRank this article.
Simplified algorithm
16
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 2.2.2: How PageRank Works
Supposition four homepage microcosms: A, B, C and D. The PageRank initial
approximation will be divided evenly between these four documents. Therefore, each
document 0.25 will start from estimate PageRank.
By the PageRank original shape original value is completely 1. This means that
all data the sum total is the page total in the net. The PageRank newest edition (will
see also the following convention) the supposition in 0 and 1. between possibility
distributions. Therefore here will use together the simple possibility distribution
original value 0.25.
If page B, C and D each link only, they every one will discuss 0.25 PageRank to A.
All PageRank PR () will gather therefore in this simplification's system to A, because
all links will aim at A.
This is 0.75.
Supposition page B has a link to call C, and calls A, but page D has the link to all
three data. In the link vote's value in the page all links outward is divided. Therefore,
page B calls A for quite a 0.125 value's vote and quite a 0.125 value vote calls C. D'
Only 1/3; s PageRank is A' Counting; s PageRank (about 0.083).
In other words, PageRank linked outward by one discussed with document' Is
equal; s has links outward the L() normalization quantity division PageRank score.
(supposition, a concrete URL link each document only counting).
In the general case, the PageRank value for any page u can be expressed as:
,
i.e. the PageRank value for a page u is dependent on the PageRank values for
each page v out of the set B
u
(this set contains all data linking to page u), divided by
the number L(v) of links from page v.
Damping factor
The PageRank theory will maintain at the link clicks on the surf rider who will
fictionalize to stop willfully clicking finally. Possibility, in all step, the human will
17
Research on Node Ranking – Peer to Peer …. Hoàng Cường
extend will be damping factor various research has tested different damping factor D.,
but the usual supposition, the damping factor will be established nearby 0.85.
The damping factor from 1 is subtracted (, and in algorithm some variations,
result divides (N) by document the quantity in collection), and this deadline then
increases the PageRank score sum total product which and follows on somebody's
heels to the damping factor. Namely
So any page's PageRank is derived in large part from the PageRanks of other
data. The damping factor adjusts the derived value downward. The original paper,
however, gave the following formula, which has led to some confusion:
Then any page' s PageRank is obtaining majority of from other page of
PageRanks. The damping factor downward adjustment obtains value. Original text,
however, has given the following convention, has caused some confusions:
Between them the difference is in the first convention sum total PageRank
value to one, but obtains in second convention each PageRank is multiplied by N, and
the sum total becomes N. In page and Brin' A statement; s paper " All PageRanks sum
total is one" and supports the above convention by other Google employee's request
the first distortion.
Each time it crawls the net, and reconstructs its index, Google evaluates the PageRank
score. When Google increases the document quantity in its collection, PageRank
initial approximation for all document reduction.
The convention use obtains tastelessly, in several clicks and switch after random page
a random surf rider’s model. The page PageRank value reflection random surfrider
will land in that page through the click in the link opportunity. May understand that
takes the condition is the page, and the transition is equally all possible and is between
the page link Markov chain.
If the page link to other data, it has not become the water trough, and terminates
the random surfing the process. However, the explanation is quite simple. If the
random surf rider arrives at the water trough page, it picks another URL stochastically,
and continues again the surfing.
When calculates PageRank, the page has not linked outward the supposition and in
collection other data of connections. Therefore their PageRank score is divided evenly
18
Research on Node Ranking – Peer to Peer …. Hoàng Cường
19
in other data. In other words, is fair with is not water trough's page, these random
transitions increase to net's all knots, when remaining possible usual d = 0.85,
estimated that uses their browser' from a frequency common surfrider; s bookmark
characteristic.
Therefore, the equality is as follows:
where p
1
,p
2
,...,p
N
are the data under consideration, M(p
i
) is the set of data that link
to p
i
, L(p
j
) is the number of outbound links on page p
j
, and N is the total number of
data.
jacency matrix. This makes PageRank a particularly elegant metric: the
eigenvector is
The PageRank values are the entries of the dominant eigenvector of
the modified ad
where R is the solution of the equation
where the adjacency function
is 0 if page p
j
does not link to p
i
, and
normalized such that, for each i
,
i.e. the elements of each column sum up to 1 (for more details see
the computation section below). This is a variant of the eigenvector centrality measure
sed commonly in network analysis.
the PageRank eigenvector are fast to approximate (only a few iterations are
needed).
u
Because of the large eigengap of the modified adjacency matrix above, the
values of
Research on Node Ranking – Peer to Peer …. Hoàng Cường
As a result of the Markov theory, may display page PageRank be the possibility
is in that page after many clicks. This accidentally equals t − 1 t is expectation of) the
place request's click (or jumps willfully quantity obtains from the page returns to itself.
The major object is it favors a older page, because is new, the very good first
page, will not even have many links, only if it will be an existing stand (is a stand part
crowded wrap page which will connect, for example Wikipedia). The Google table of
contents (itself derivative opening table of contents project) allows the user to look in
the category the PageRank sorting result. The Google table of contents is PageRank
determined directly the demonstration order Google provides only service. In Google'
the s other search service (e.g. its main net search) PageRank uses in considering the
relevance in search result demonstration dozens of data. Several strategies proposed
that accelerates PageRank the computation. Operated PageRank various strategies to
arrange to use diligently together the improvement search result ranking and decides
as the currency to do to the link the advertisement.
These strategies have attacked the PageRank concept reliability severely, seeks
determined that which documents in fact take seriously by the net community. Google
knew that the punishment the link farm which and other plans designs inflates
artificially PageRank. Google starts in December, 2007 to punish effectively sells the
paid text link the stand. How does Google identify the link farm and other PageRank
operational tool is in Google' In; s business secret.
2.3 Distributed Computing
The distributed computing is the computer science area research distributional
system. Distributional system through a computer network service including many
autonomous computers. The computer achieves a common goal mutually according to
the order interaction. The computer program which runs in the distributional system
said that a distributed program, the distribution programming writes such program' s
process. And the distributed computing mentions the use distributional system
explanation estimate question.
In the distributed computing, the question is divided many responsibilities, the
computer explains everybody.
20
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 2.3: Distributed Nodes Graph example
We pass use computer’s hope automation; s many responsibilities held
responsible with answer the type: We hope to ask the question, and the computer
should cause the answer. In the computer science theoretically, is called the estimate
question like this voluntarily. It is estimated that the question has each template
including the instance is an explanation officially together. The example is the
question which we asked that and the explanation is anticipates the answer to these
questions.
(How does the theory computer science seek needs to understand the estimate
question possibly through use that the complex theory solution computer (the
computability theory) and high efficiency computation). In the tradition, said the
question perhaps through the use solution computer, if perhaps we design all concrete
instances are correct explanation algorithm causes. Perhaps such algorithm possibly
implements the computer program which runs in an general calculator: Studies from
the input question instance's holiday eye, carries out some computation, and causes the
explanation to adopt the product.
Formalism for example random access ' perhaps the s machine or the universal
Turing machine use the achievement to carry out such algorithm continuously general
calculator' s abstraction model. In many computer situations, consistent and distributed
computing area research similar question or execution interaction process system
computer: Which estimate question how can solve in such network and the high
efficiency place? However, it is not obvious in concurrent or the distributional system
situation, “solves the problem is all meanings”
2.4 Computing PageRank in a distributed system
21
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Lectured the net graph in distribution system's recent research work to divide
into messes up the website or the domain case. The net is molded takes many messes
up the network server. Is divided in net's ultra link two categories, the internal cut-off
link and the mutual server link. The internal server link is between the page link in the
server, and these links use in calculating on each server's place PageRank intermediate
vector. The mutual server link is between the page link with the different server, and
they use in calculating ServerRank. ServerRank surveys the different network server's
relative importance. The server which submits is being merged finally from many
network server's result causes an arrangement ultra link name list.
The ranking algebra proposed that deals with the ranking in the different
granularity level, is utilized possibly also in gathering the place ranking and the stand
ranking obtains the global ranking. Has in one disperses the system fully in the
PageRank approximation work, each of the same generation is autonomous, and
perhaps of the same generation mutually overlaps. Was proposing the JXP algorithm,
each of the same generation calculates the place PageRank score, then meets other of
the same generations and increases it gradually through the exchange information
willfully about the global net graph knowledge, then recomputation in place of the
same generation's PageRank score.
This conference and the recomputation process is duplicated, collects the
enough information until of the same generation. If of the same generation meets the
sufficient number of times exchange information finally, JXP score polymerization to
the real global PageRank score. Supposes is each page of out degree in global graph
awareness. However, these operations are providing the approximation the focal point
are the global graph, in centralized system or distribution system.
2.5. tf-idf
The tf–idf weight (term frequency–inverse document frequency) is a weight
often used in
information retrieval and text mining. This weight is a statistical measure
used to evaluate how important a word is to a document in a collection or corpus. The
importance increases proportionally to the number of times a word appears in the
document but is offset by the frequency of the word in the corpus. Variations of the tf–
idf weighting scheme are often used by search engines as a central tool in scoring and
ranking a document's relevance given a user query.
One of the simplest ranking functions is estimated by summing the tf-idf for
each query term; many more sophisticated ranking functions are variants of this simple
model.
Motivation
Suppose we have a set of English text documents and wish to determine which
document is most relevant to the query "the brown cow." A simple way to start out is
by eliminating documents that do not contain all three words "the," "brown," and
"cow," but this still leaves many documents. To further distinguish them, we might
count the number of times each term occurs in each document and sum them all
together; the number of times a term occurs in a document is called its term frequency.
22
Research on Node Ranking – Peer to Peer …. Hoàng Cường
However, because the term "the" is so common, this will tend to incorrectly emphasize
documents which happen to use the word "the" more, without giving enough weight to
the more meaningful terms "brown" and "cow". Also the term "the" is not a good
keyword to distinguish relevant and non-relevant documents and terms like "brown"
and "cow" that occur rarely are good keywords to distinguish relevant documents from
the non-relevant documents. Hence an inverse document frequency factor is
incorporated which diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely.
Mathematical details
The term count in the given document is simply the number of times a given
term appears in that document. This count is usually normalized to prevent a bias
towards longer documents (which may have a higher term count regardless of the
actual importance of that term in the document) to give a measure of the importance of
the term t
i
within the particular document d
j
. Thus we have the term frequency,
defined as follows.
where n
i,j
is the number of occurrences of the considered term (t
i
) in document d
j
, and
the denominator is the sum of number of occurrences of all terms in document d
j
.
The inverse document frequency is a measure of the general importance of the
term (obtained by dividing the total number of documents by the number of
documents containing the term, and then taking the logarithm of that quotient).
with
•
| D | : total number of documents in the corpus
•
: number of documents where the term t
i
appears (that is
). If the term is not in the corpus, this will lead to a division-by-zero. It
is therefore common to use
Then
A high weight in tf–idf is reached by a high term frequency (in the given document)
and a low document frequency of the term in the whole collection of documents; the
weights hence tend to filter out common terms. The tf-idf value for a term will always
be greater than or equal to zero.
23