Tải bản đầy đủ (.pdf) (95 trang)

Multi dimensional range query evaluation for distributed hash table based peer to peer systems

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (390.13 KB, 95 trang )

MULIT-DIMENSIOANL RANGE QUERY EVALUATION
FOR DISTRIBUTED HASH TABLE BASED PEER-TO
-PEER SYSTEMS

ZHANG GONG

NATIONAL UNIVERSITY OF SINGAPORE
2004


MULIT-DIMENSIOANL RANGE QUERY EVALUATION
FOR DISTRIBUTED HASH TABLE BASED PEER-TO
-PEER SYSTEMS

ZHANG GONG
(B. Sci., Xi'an JiaotongUniversity, China)

A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
NATIONAL UNIVERSITY OF SINGAPORE
2004

``

II


Acknowledgement
First, I would like to express my heartfelt thanks to my supervisor, Dr. Gary S. H. Tan,
for his supervision through my master study. Also, my sincere gratitude goes to


Associate Professor Kian-Lee Tan, for all his advice and constant guidance during all
phases of this thesis. They have conscientiously provided me with careful guidance at
every stage of my research, offered various ideas whenever I ran into difficulties, and
constructively corrected some of my mistakes in the course of my work. I appreciate
the fact that participating in their projects has granted me many paths to develop my
research and analytical abilities greatly. Their support enabled me to both learn and
write what is presented in this thesis. In addition, they have given me constructive
suggestions on my attitude to work, which is helpful to my career development.

Others that I would like to thank include Ng Yew Kwong, Hu Yu, Gozali Johan
Prawira, whom I enjoyed sharing discussions on P2P systems and programming
questions with. In addition my sincere appreciation is given to my lab fellows, Ameya
Virkar, Liu Ming and Liu Peng for their generous help both in my research and in my
life, and for the pleasant and friendly environment of the computer system lab.

Last but not least, I would like to convey my gratitude to the thesis examiners for
taking time from their busy schedules to assess my research work..

``

I


Table of Index
Chapter 1 Introduction……………………….……………….….……….……….......1
1.1 DHT-based P2P Systems…………………..…………..……..….……………3
1.2 Complex Query Over DHT-based P2P systems……..…………..…….……...5
1.2.1 Marrying Database and P2P paradigm…………………….…………...5
1.2.2 Complex Query over DHT-based P2P systems……………..………….6
1.3 Multi-dimensional Range Query Evaluation………………………..………...8

1.3.1 Motivations- Three complex queries……...………….….……………..8
1.3.2 Multi-dimensional range query processing………….…….………….15
1.4 Research Contributions…………...…………..….……….…………………17
1.5 Organization of the Thesis……………………...………...………………….19

Chapter 2 Literature Review……………..………………...………..…………….…21
2.1 Related work……………………………………..……………………..21
2.2 One-dimensional Indexing for P2P designs………………..…………...24
2.2 Multi-dimensional Indexing for P2P system……………………..……..25
2.4 Multi-dimensional Indexing Using Hilbert space filling curve...….……26

Chapter 3 System Model ………..…………………..……………………………….28
3.1 Problem Formulation ……..…………………………………………….28
3.1.1 Data Management Process in DHT-based P2P Systems.....…..….28
3.1.2 Problem Formulation…………………...………………………...30
3.2 Design Principles ……………………...………………………………..32
3.2.1 General Principles ………………………………...……………..32
3.2.2 Design Goals…………………………...………………………...33
3.3 System Model …………………………………………………………..38
3.3.1 Three-tiers Architecture………….….....……………………..…..38
3.3.2 Application layer………………………...………….……..……..39

``

II


3.3.3 Multi-dimensional Indexing layer…………....……………...…..40
3.3.4 Two-fold Property of Partitioning Manner…...………...………..34
3.3.5 File systems………………………..……………………………..49


Chapter 4 Multi-dimensional Range Query evaluation….……………………....50
4.1 Multi-dimensional Single Point Query Evaluation………………..……50
4.2 Multi-dimensional Range Query Evaluation…...……………………….51
4.3 Zone Maintenance………………..……………………………………..54
4.4 System Performance Evaluation…………………………..…............…57
4.4.1 Multi-dimensional Single Point Query……………..……………59
4.4.2 Multi-dimensional Range Query Evaluation……………………..60
4.5 Selectivity Factor…………………………………….………………….64
4.6 Design Improvements………………..…………………………………63
4.6.1Parallelism Strategy……………….………………………………63
4.7 Comparison with Naïve Flooding Method……………………………...66

Chapter 5 Conclusions..…………………………………..………………………….71

References…………………………..………………………………………………..74

Appendix A Sample code for Mapping Multi-dimensional Point into Hilbert sequence
number…………………………..………………………….………77
Appendix B Join Time Load Balance Code…………………………..………………83

Appendix C Overlay Network’s Transit-Stub Topologies Building …………...……84

``

III


Summary
In this thesis, we investigate the issue of enabling current DHT-based P2P systems to


support multi-dimensional range query towards the long term goal of providing
complex query facilities in P2P systems. We adopt a multi-dimensional coordinate
space model, which is sorted by Hilbert Space filling curve. Sorting make the range
query processing in multiple coordinate space possible. The way that the space is
partitioned is both a zone partitioning way and a single direction sequence dividing
way. This helps extend DHT functionality layer’s fine property of efficient
exact-match lookup into higher dimensions. We propose a relay-race like query
scheme and introduce some strategies to improve the performance such as introducing
parallelism, lookup during processing, et al.

The performance of the system is evaluated via simulation. Several metrics are
explored such as the number of the nodes visited, query latency, per hop latency, et al.
The evaluation shows that the proposed model not only keeps the scalability but also
process multi-dimensional range query in bounded costs. This system can be
incorporated into computational grids to enhance the information discovery capability.

Compared with current methods of processing range query in P2P systems, our
method as a general method outperforms in these aspects:

1 Existing methods are all one-dimensional. Our method provides a general method to

``

IV


answer multi-dimensional rang query and one-dimensional query is one special form
of multi-dimensional query. To our knowledge, this is the first approach that supports
multi-dimensional range query processing.


2 Because our method is oriented to multi-dimensional range query, it avoids
expensive join operation. For current one-dimensional method, if we want to find
resource specified by several attributes (dimensions), an independent index
infrastructure for each attribute must be built first. Then the information infrastructure
queries the appropriate indexing infrastructure for each attribute presents in the query
and concatenate the results in a database-like “join” operation [13]. Our method
handles multiple attributes by single DHT-based system.

3 Existing methods process range query by a “flooding” manner. Our method
processes multi-dimensional range query in a lookup way. Hence, except that the fine
property of DHT-based P2P system is extended, deterministic structure and
performance guarantees in terms of logical hops are provided for processing
multi-dimensional queries. And also high network overhead is reduced.

4 Our method provides a deterministic and complete manner to answer
multi-dimensional range queries.

``

V


List of Figures
Figure 1.1 Napster –Centralized P2P systems………………………….....………….. 4
Figure 1.2 Routing in CAN………………………………………...……………….....5
Figure.2.1 Hilbert Curve in Two-dimensional Space………..………….....................26
Figure 3.1 Three-tier Architecture…………………………………………...……….39
Figure 3.2 Two-dimensional space and range query region………………………….42
Figure 3.3 System Model……………………………………………...……………..44

Figure 3.4 P2P system with 7 peers...……………………………………...…………45
Figure 3.5 Zone partitioning………………………………………………………….46
Figure 4.1 Query evaluation………………………………………………………,,,,,53
Figure 4.2 Space partitioning . ……………………………………………………….52
Figure.4.3: Multi-dimensional single point query performance……………………...60
Figure 4.4 Average Path length………………………………………………………61
Figure 4.5 Query response time……………………………………………………...61
Figure 4.6 Effect of parallelism strategy on response time…………………………..64
Figure 4.7 Extra communication overheads introduced by parallelism strategy…….65
Figure 4.8 Comparison of the two schemes on the aspect of the number of visited
nodes……….68
Figure 4.9 Comparison of the two schemes on the user perceived response time…...68

``

VI


List of Tables
Table 1.1 relational table distributed into the P2P system……………………….…….9
Table 1.2 the second tables stored in P2P system……………………………………13

``

VII


List of Queries

Query 1.1……………………………….….…………………………………………..7

'Query 1.2…………………………………………………………………………….10
Query 1.3……………….……………………..…………...…………………...….....10
Query 1.4:……………………………………….………….………...........................14
Query 1.5……………………………………………………………………………..13
Query 3.1:…………………………………………………………………………….31
Query 3.2:………………………………………………………………………….…31
Query 3.3:……………………………………………………………………….……41
Query 4.1:……………………………………………………………………….……58
Query 4.2.…………………………….……………………………………………...59

``

VIII


Chapter 1

Introduction

Peer-to-peer systems (P2P) are one of the most quickly growing technologies in today’s
computing. In such systems, content data is stored in peer nodes, without the
centralized control and planning. Thus, data can be exchanged directly between peers,
which is contrary to the way of the traditional client-server model. The fine properties
of self-organization, fault-tolerance and scalability make P2P systems develop fast in
recent years.
However, there are two serious limitations for current P2P system development: poor
scalability and weak semantics. Scalability has always been one of the concerns
accompanying the unstructured P2P designs, ranging from the initial centralized design
of Napster [1], to the completely decentralized design of Gnutella [2], to the
hierarchical design of FastTrack [3], until the proposal of structured P2P systems.

Structured P2P design provides Distributed Hashing Table (DHT).So it is also called
DHT-based P2P systems. Scalability is largely solved in DHT-based P2P networks,

``

1


because lookups can be solved in log n (or nα for small α) overlay routing hops for an

overlay P2P network with n hosts. However, query facility is further impoverished in
DHT-based P2P; because hash table only supports lookup operation efficiently, which
is one of the most fundamental and simplistic query formats.

Traditional database research prides itself on the most notable features: powerful
relational query facilities, strict relational model and reliable data management. These
features are missing in P2P’s distributed environment. How to bring database’s rich
query processing facilities into widely distributed environment of P2P networks poses
an important challenge both for database community and P2P community.

In this thesis, we present one important step towards the long term plan of marrying the
powerful query facilities of traditional database with P2P networks —
multi-dimensional range query evaluation in DHT-based P2P systems.

This chapter is organized as follows: first we briefly introduce DHT-based P2P
designs; next, we briefly overview the need for building complex query facility over
DHT-based P2P systems; then, we explore the motivation to process multi-dimensional
range query in P2P system and presents the overview of the proposed method to
process multi-dimensional range query in P2P systems; finally, we summarize the
research contributions.


``

2


1.1 DHT-based P2P Systems

It is unstructured P2P systems that firstly embarks in the Internet and produces many
hugely successful, popular deployments such as Napster [1], Gnutella [2] and BestPeer
[4]. Napster was introduced in mid-1999 and, as of December 2000, the software has
been downloaded by 50 million users, making it the fastest growing application on the
Internet. It is envisioned that P2P will transform the Internet from a shared bandwidth
infrastructure into a combined bandwidth infrastructure and P2P systems may lead to
new content distribution models for applications such as software distribution, file
sharing, et al.

However, there are two serious limitations for current P2P system development: poor
scalability and weak semantics. For example, in Napster a central server stores the
index of all the files available within the user community. Although the actual file
transmission occurs between user machines, but all of these transmissions must start
with the centralized index server. It is expensive to scale the central directory. Also, the
fact that the file directory is centralized into few central machines leads to the security
problem of single point of failure. Gnutella [2] is one kind of completely decentralized
P2P designs. It uses the complete flooding manner to process query. Although it
reduces the risk of single point of failure, the query is processed inefficiently and
incompletely. Flooding on every request is clearly not scalable and because the
flooding has to be curtailed at some point in practice, this may lead to the result that the
``


3


system fails to find content that is actually in the system.

Napster Server

Request
& result
Client
Client

Retrieve
File
Client

Figure 1.1 Napster –Centralized P2P systems

The emergence of structured P2P designs largely resolves the scaling problem.
Structured P2P systems builds on the idea of Distributed Hash Table (DHT). The
underlying networking infrastructure is a logical network, called overlay network. The
pairs with the form of (key, value) are stored among peers. Structured P2P has different
deployment; however, most of the implementations support the basic operations such
as put (key, value) and get (key). Get (key) is the main operation that DHT P2P offers.
That is: given a keyword, the system lookups for all the files whose name contains this
keyword. In general, for a given overlay network with n peers, this lookup operation
can be resolved in log n (or nα for small α) overlay routing hop.

Let us exemplify structured P2P designs by one of the representatives: CAN [5] -Content Addressable Network. As illustrated in Figure 1.2, CAN has one


``

4


d-dimensional virtual coordinate space. At any instant the entire coordinate space is
partitioned among all the nodes in the system. Each CAN node owns one zone in the
virtual space. In addition, a node holds information about a small number of “adjacent”
zones in the table. Uniform hash function is used to map key values to points in the
d-dimensional space. All the points falling into one zone of one specific node is
maintained by this node. The basic operations performed on a CAN are insertion,
lookup and deletion on (key, value) pairs. Requests (insert, lookup, or delete) for a
particular key are routed by intermediate CAN nodes towards the target node whose
zone contains that key. The design of CAN is completely distributed, scalable, and
fault-tolerant. Fig.1.2 illustrates the typical routing mechanisms of CAN.

(p, q)
B

A

Figure 1.2 Routing in CAN

1.2 Complex Query Over DHT-based P2P systems

1.2.1 Marrying Database and P2P paradigm
Let us examine the tracks of this thesis from a grand vision- how to apply database
technology into P2P paradigm?

``


5


The semantics provided by today’s P2P technology is typically quite weak.
Theoretically, the data within a P2P system should be accessible at many degrees of
granularity. However, today’s P2P systems only support the atomic granularity level.
That is, data consists of a collection of indivisible objects, e.g., complete MP3 files.We
can either place an entire object at a peer, or not at all. It poses one challenge to form
hierarchical granularity in P2P systems. In most cases, current P2P file-sharing systems
are largely limited to applications in which objects are large, opaque, and whose
content has already been described precisely by their name.

Database community has many strong data management tools such as queries, views, to
express relationships between objects and to define new objects. Complex queries can
be posed across multiple sources, and the results of one query can be materialized to
answer other queries. If these data management techniques can be used to develop
better solutions to the weak semantics problem in P2P system, P2P systems will bring
not only its inherent popular properties, but also bring the powerful semantics support.
This is the track that this thesis follows.

1.2.2 Complex Query over DHT-based P2P systems

As one important component of applying database technology into P2P community, we
are engaged into building the complex query facility over P2P systems, in particular,
DHT-based P2P systems.

``

6



Current P2P designs support the simplistic query form: “search”. This tool can find all
the files whose names contain a given string. However, “search” is a limited form of
querying, intended for identifying (“finding”) individual items. Rich query languages
should do more than “find” things: they should be able to involve multiple attributes;
they should also allow query for combinations and correlations among the attributes.

As an example, it is possible to search in Gnutella for music by Beethoven, but it is not
possible to ask specifically for Beethoven’s entire overtures, since they do not typically
contain the word “overture” in their name. It is difficult to answer complex query
showed below:
Query 1.1
Select peer_name
From music P2P system
Where authour_name =Beethoven
AND music_format= Overture
AND year>1999

As the new generation of P2P systems, structured P2P largely resolved the scaling
problem. However, in other aspects, the core idea of Distributed Hash Table
impoverishes the query facility. Distributed hash table is essentially a decentralized and
distributed hash table. The most notable functionality of hash table is quick
exact-match lookups. The index based on hash table is difficult to support range query
except the way of overall scan. Hence, DHT-based P2P only support exact-match
lookups and does not support range query or multi-attributes query efficiently. This

``

7



inherent deficiency aggravates poor query facilities in P2P systems.

Hence, here arises the need of providing complex query facility over P2P systems. If
we can process relational complex query in a P2P network over DHTs, we can certainly
execute traditional exact-match as a special case. As DHT-based P2P has larglely
solved the scaling problem, it is a critical need of providing complex query facilities
over DHT-based P2P systems, while still maintaining the scalability of the DHT
infrastructure.

1.3 Multi-dimensional Range Query Evaluation

1.3.1 Motivations- Three Complex Queries

Before describing our approach, we first discuss some general issues about processing
complex query over P2P systems and then we explain why we propose
multi-dimensional range query evaluation in DHT-based P2P systems.

As an example, in a P2P music file sharing system, putting aside the issues of
protocol and the architectures of underlying network, we assume each resource is
described by a set of attributes with globally known types. The collection of these
attributes and their values forms one relational table. Such a relational table is
distributed into each peer. That is, each peer stored one horizontal partition of the
table. In our example, the music file is described by the schema (Music_id,

``

8



Music_name, Author, Orchestra, Year) and showed in Table 1.1. In DHT-based P2P
systems, such a relational table will be distributed into the system based on one
attribute. CAN [5] can be used to construct the overlay network. We choose
Music_name as the primary key whose hashed value decides where to store a given
tuple. Given a pair (music_name, ip), the specific music_name is deterministically
mapped into one point P in the coordinate space by a uniform hash function. The
corresponding (key, value) is then stored at the node whose zone encloses the point P.
If one user issues one query for music_name="Camen", the same hash function is
applied into this key. The query is converted into searching the point corresponding to
the key. If the point is not stored in the local peer, such a search request will be
forwarded and routed by the CAN overlay infrastructure. Finally, if the search target
is found, the tuple is returned to the peer issuing the query.

Music(R)
Music_id (key)

Music_name

Author

Orchestra

Year

12

Camen

Bizet


London Symphony Orchestra

2000

11

Fidelio

Beethvon

Vienna Philharmonic Orchestra

1999











Table 1.1 relational table distributed into the P2P system

One common problem arises: how can we answer the query involved with non-key
attribute. For example, if one user issues one query asking for all the music played by
“London Symphony Orchestra” , how can we process the SQL query below:


``

9


Query 1.2
Select Music_name
From Music
Where Orchestra= London Symphony Orchestra.

Unfortunately, current query processing mechanism built on top of DHT layer select
only one attribute as the hash key. Such kind of one-dimensional query system is unable
to process such kind of query directly. The available method to solve such a problem is
to build several different DHT layers based on different keys. According to the given
selected attribute, specific DHT index is chosen. Naturally, this introduces the
expensive cost of building extra DHT layers. As a result, the communication result is
added.

Thus new indexing facility which supports multiple attributes is solicited. If we have
such kind multi-dimensional indexing facility over DHT layer, we even can solve more
complex multi-attributes query like below:
Query 1.3:
Select Music_name
From Music
Where Orchestra= London Symphony Orchestra
And Author=Bizet

Another kind of critical query that poses challenge for DHT-based P2P design is range
query. This is inherently because the hash table only support exact-match lookup. In

this setting, DHT-based P2P infrastructure provides no thrust power to solve the
query below:

``

10


Query 1.4:
Select Music_name
From Music
Where Year>1998
And Year <2003

The only approach is to flood this query into the systems and check each peer.
Accordingly, it causes the same negatives as Gnutella. Flooding on every request is not
scalable. Further, because users in a Gnutella network self-organize into an
application-level mesh on which requests for a file are flooded with a certain scope, the
flooding has to be curtailed at some point. This leads to the possibility of failing to find
content that is actually in the community.

More important, range lookup or range query is one of the fundamental functionality
that is needed to support general purpose database query processing. In practice, the
range selection operation is typically implemented at the leaves of a query plan. Hence,
supporting range query is the fundamental foundation for the targeted indexing facility
to support complex query in P2P system. However, current methods, including Harren
et al [6]’s research agenda did not provide efficient way to enable P2P systems to
support general range lookup. One of the goals that constrain the design of Harren’s
work is “minimal extension to DHT APIs”. It is true that this consideration keeps DHT
APIs as thin and general-purpose as possible in one side. But in our contention, in order


``

11


to achieve the larger goal of supporting more general and complex query involving
range query in the context of DHT-based systems, it is necessary to extend the current
P2P designs that only support exact-match lookup to range lookup. Otherwise, for DHT
P2P design, without the functionality of supporting range query, it is difficult to enable
such kind indexing facility to support general or complex query. Here comes the second
goal of our design -- extending P2P to support range lookup.

Two design goals have already come out. But this is not the whole story. Consider the
third complex query described below.

Suppose, in the above P2P music file sharing system, the material about the author of
the music is also one hot target that interests the users. Similarly, in the scenario of
existing one-attribute indexing infrastructure, we have to build another relational table
to store such kind of independent theme except the hash indexing infrastructure built on
the music_name. This table is distributed into the P2P systems according to the hash
value of the attribute author_name. As a result, current P2P music sharing system has
two tables distributed into it. It is easy to find the information bounded to the individual
table. For instance, given the name of the author, author’s country is enquired. This
follows the usual DHT exact_match lookup based the value of author_name.

``

12



Author (T)
Author_id(key)

Author_name

Author_birth

Country

Reprehension_works

12

Bizet

1756

France

Carmen

11

Beethvon

1770

German


Fidelio











Table 1.2 the second tables stored in P2P system

But how to answer the query which involves two relational tables? For instance, how
can we find the music files whose production is earlier than Year 1999 and whose
author is born in German?
Query 1.5:
Select Music_name
From Music AS R, Author AS T
Where
R.author=T.author_name
R.year <1999 AND
T.contry=German

Clearly, this query covers both Music table and Author table. To answer it, we must
search two tables. Naturally, we can consider joining these two tables. A common
range query specified by several attributes (dimensions) must be answered through an
expensive join operation, after querying the corresponding indexing infrastructure for
each attribute present in the query, provided that one independent indexing

infrastructure is built for each attribute involved. This approach is adopted in the
Harren et al [6]’s agenda. They implement the join operation over multiple tables and
propose some reasonable algorithms such as “hash join”.

``

13


However, as we see, even in the traditional centralized database systems, join is one
expensive operation. In distributed environment, this operator is introducing excessive
communication cost unavoidably and causing maintenance problem. Further, if more
tables are inserted into the systems, the cost of join grows. From user’s perspective, P2P
users are impatient, they expect quick response. Hence, the indexing facility that avoid
join is preferred. Avoiding expensive join operation becomes the third theme arising in
our design.

In general, from the current query processing functionality of P2P systems, new
indexing functionality that
1) supports multi-attributes
2) supports range query
3) avoids expensive join operation
is solicited. Our method of multi-dimensional range query processing is towards the
goal of providing these key functionalities and try to reach the goal of supporting more
general complex query.

``

14



1.3.2 Multi-dimensional range query processing
In this section, we briefly overview the proposed multi-dimensional range query
processing model for P2P networks.

The work reported in this thesis is similar in spirit to that of Harren et al. [6], in that we
are interested in supporting database query processing over P2P networks. Our
contention is that in order to support complex queries in the distributed context of
peer-to-peer systems, we need to extend the current P2P exact name lookups to range
searches. In this thesis, we will propose a method for efficiently answering
multi-dimensional range queries on a peer-to-peer data sharing system. Our long term
goal is to support the various types of complex queries in P2P data-sharing system. In
this thesis we propose to extend P2P systems to support more general queries on
potentially more complex and more structured datasets. The query scheme keeps the
fine functionality of efficient exact-match lookup in previous DHT P2P designs, while
extending it to multi-attributes query. For instance, multi-dimensional single point
query can be evaluated as efficient as the single keyword exact-match lookup. This
avoids the dilemma described in the first query example in the above section.

Unlike previous DHT P2P designs which process one-dimensional range queries in an
inefficient manner of “flooding” and does not support multi-dimensional range queries,
the proposed method aims to exploit the processing power of range query in DHT P2P
architectures in a multi-dimensional setting. In the proposed mechanism,

``

15



×