Tải bản đầy đủ (.pdf) (11 trang)

Tối ưu hóa viễn thông và thích nghi Kỹ thuật Heuristic P13 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (217.5 KB, 11 trang )

13
Adaptive Demand-based
Heuristics for Traffic Reduction
in Distributed Information
Systems
George Bilchev and Sverrir Olafsson
13.1 Introduction
As Internet connectivity is reaching the global community, information systems are
becoming more and more distributed. Inevitably, this overnight exponential growth has also
caused traffic overload at various places in the network. Until recently, it was believed that
scaling the Internet was simply an issue of adding more resources, i.e. bandwidth and
processing power could be brought to where they were needed. The Internet’s exponential
growth, however, exposed this impression as a myth. Information access has not been and
will not be evenly distributed. As it has been observed, user requests create ‘hot-spots’ of
network load, with the same data transmitted over the same network links again and again.
These hotspots are not static, but also move around, making it impossible to accurately
predict the right network capacity to be installed. All these justify the requirement to
develop new infrastructure for data dissemination on an ever-increasing scale, and the
design of adaptive heuristics for traffic reduction.
In this chapter, we develop a distributed file system model and use it as an experimental
simulation tool to design, implement and test network adaptation algorithms. Section 13.2
describes in detail the distributed file system model and explains the implemented
simulation environment. Two adaptation algorithms are developed in section 13.3. One is
Telecommunications Optimization: Heuristic and Adaptive Techniques, edited by D.W. Corne, M.J. Oates and G.D. Smith
© 2000 John Wiley & Sons, Ltd
Telecommunications Optimization: Heuristic and Adaptive Techniques.
Edited by David W. Corne, Martin J. Oates, George D. Smith
Copyright © 2000 John Wiley & Sons Ltd
ISBNs: 0-471-98855-3 (Hardback); 0-470-84163X (Electronic)
Telecommunications Optimization: Heuristic and Adaptive Techniques
224


based on the ‘greedy’ heuristic principle and the other is a genetic algorithm tailored to
handle the constraints of our problem. Experiments are shown in section 13.4, and section
13.5 gives conclusions and discusses possible future research directions.
Figure 13.1 A schematic representation of the network and the distributed file system.
13.2 The Adaptation Problem of a Distributed File System
The World Wide Web is rapidly moving us towards a distributed, interconnected
information environment, in which an object will be accessed from multiple locations that
may be geographically distributed worldwide. For example, a database of customers’
information can be accessed from the location where a salesmen is working for the day. In
another example, an electronic document may be co-authored and edited by several users.
In such distributed information environments, the replication of objects in the
distributed system has crucial implications for system performance. The replication scheme
affects the performance of the distributed system, since reading an object locally is faster
and less costly than reading it from a remote server. In general, the optimal replication
scheme of an object depends on the request pattern, i.e. the number of times users request
the data. Presently, the replication scheme of a distributed database is established in a static
fashion when the database is designed. The replication scheme remains fixed until the
designer manually intervenes to change the number of replicas or their location. If the
request pattern is fixed and known a priori, then this is a reasonable solution. However, in
practice the request patterns are often dynamic and difficult to predict. Therefore, we need
1(7:25
.
Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems
225
an adaptive network that manages to optimize itself as the pattern changes. We proceed
with the development of a mathematical model of a distributed information/file system.
A distributed file system consists of interconnected nodes where each node i, i = 1,N has
a local disk with capacity
i
d to store files – see Figures 13.1 and 13.2. There is a collection

of M files each of size
.,1, Mjs
j
= Copies of the files can reside on any one of the disks
provided there is enough capacity. The communication cost
ki
c
,
between nodes i and k
(measured as transferred bytes per simulated second) is also given.
Figure 13.2 Users connect to each node from the distributed files system and generate requests.
In our model each node runs a file manager which is responsible for downloading files
from the network (Figure 13.3). To do that, each file manager i maintains an index vector l
ij
containing the location where each file j is downloaded from. User applications running on
the nodes generate file requests the frequency of which can be statistically monitored in a
matrix {p
i,j
}.
To account for contention and to distribute the file load across the network it has been
decided to model how busy the file managers are at each node k as follows:


=
=
=


=
Mm

Nn
jji
kl
ji
jji
k
sp
sp
b
ji
,1
,1
,
,
,
,
Thus, the response time of the file manager at node k can be expressed as waiting time in a
buffer (Schwartz, 1987):
Database
Server
User
Community

NETWORK
Telecommunications Optimization: Heuristic and Adaptive Techniques
226
Figure 13.3 Each node runs a file manager (responsible for allocating files on the network) and a
number of user applications which generate the requests.







>

=
otherwise

1
kk
kk
k
b
b
r
τ
τ
where
k
τ
reflects the maximum response capacity of the individual servers. The overall
performance at node i can be measured as the time during which applications wait for files
to download (i.e. response time):
ji
M
j
l
li
j

i
pr
c
s
O
ji
ji
,
1
,
,
,









+=

=
The first term in the sum represents the time needed for the actual transfer of the data and
the second term reflects the waiting time for that transfer to begin. The goal is to minimize
the average network response time:
{}
ji
N

i
M
j
l
li
j
N
l
pr
c
s
ji
ji
ji
,
11
,
1
,
,
,
min ⋅









+
∑∑
==
The minimization is over the index matrix {l
i,j
}. There are two constraints: (1) the available
disk capacity on each node should not be exceeded; and (2) each file must have at least one
copy somewhere on the network.
File Manager
Application 1

Application Z
…….
Multi- tasking
NODE
Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems
227
13.2.1 The Simulation Environment
In our distributed file system model the users generate file requests and the network
responds by allocating and downloading the necessary files. The file requests can be
statistically monitored and future requests predicted from observed patterns. The simulation
environment captures these ideas by modeling the user file requests as random walks:
)1()(
,,
γ
+−= tptp
jiji
where
γ
is drawn from a uniform distribution ),(

r
r
U
− . The parameter r determines the
‘randomness’ of the walk. If it is close to zero then
)1()(
,,
−≈ tptp
jiji
.
During the simulated interval [t, t+1] the model has information about the file requests
that have occurred in the previous interval [t–1, t]. Thus the dynamics of the simulation can
be formally defined as:
For t=1,2,3,…
generate new file requests:
{
}
ji
tPtP
,
)1()(
γ
+−=
simulate network:
)()()(
,
11
,
1
1

1
,
,
tpr
c
s
tOtO
ji
N
i
M
j
l
li
j
N
N
i
i
N
ji
ji










+==
∑∑∑
===
An adaptive distributed file system would optimize its file distribution according to the
user requests. Since the future user requests are not known and can only be predicted the
optimization algorithm would have to use an expected value of the requests derived from
previous observations:
))1((Prediction)(
~
−= tPtP
Thus an adaptive distributed file system can be simulated as follows:
For t=1,2,3,…
file requests
prediction: ))1((
Prediction
)( −= tPtP
optimization: ))(
Optimize(
)( tPt
L
=
generate new file requests:
{
}
ji
tPtP
,
)1()(
γ

+−=
simulate network:
)()()(
,
11
,
1
1
1
,
,
tpr
c
s
tOtO
ji
N
i
M
j
l
li
j
N
N
i
i
N
ji
ji










+==
∑∑∑
===
The next section describes the developed optimization algorithms in detail.
Telecommunications Optimization: Heuristic and Adaptive Techniques
228
13.3 Optimization Algorithms
13.3.1 Greedy Algorithm
The ‘greedy’ principle consists of selfishly allocating resources (provided constraints allow
it) without regard to the performance of the other members of the network (Cormen et al.,
1990). While greedy algorithms are optimal for certain problems (e.g. the minimal spanning
tree problem) in practice they often produce only near optimal solutions. Greedy
algorithms, however, are very fast and are usually used as a heuristic method. The greedy
approach seems very well suited to our problem since the uncertainties in the file request
prediction mean that we never actually optimize the real problem, but our expectation of it.
The implemented greedy algorithm works as follows:
For each file j check every node i to see if there is enough
space to accommodate it and if enough space is available
calculate the response time of the network if file j was at
node i:
jk

N
k
i
ik
j
pr
c
s
,
1
,









+

=
After all nodes are checked copy the file to the best found
node.
The above described algorithm loads only one copy of each file into the distributed file
system. If multiple copies are allowed, then add copies of the files in the following way:
For each node i get the most heavily used file
(i.e.,
)(max

, jji
j
sp ⋅ ) which is not already present.
Check if there is enough space to accommodate it.
If yes, copy it. Continue until all files are checked.
13.3.2 Genetic Algorithm
Genetic Algorithms (GAs) are very popular due to their simple idea and wide applicability
(Holland, 1975; Goldberg, 1989). The simple GA is a population-based search in which the
individuals (each representing a point from the search space) exchange information (i.e.
reproduce) to move through the search space. The exchange of information is done through
operators (such as mutation and crossover) and is based on the ‘survival of the fittest’
principle, i.e. better individuals have greater chance to reproduce.
It is well established that in order to produce good results the basic GA must be tailored
to the problem at hand by designing problem specific representation and operators. The
Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems
229
overall flow of control in our implemented GA for the distributed files system model is
similar to the steady state genetic algorithm described in Chapter 1.
In order to describe the further implementation details of our GA we need to answer the
following questions: How are the individuals represented? How is the population
initialized? How is the selection process implemented? What operators are used?
Individuals representation: each individual from the population represents a distribution
state for the file system captured by the matrix
{
}
ji
l
,
.
Initialization: it is important to create a random population of feasible individuals. The

implemented initialization process randomly generates a node index for each object and
tries to accommodate it on that node. In case of a failure the process is repeated for the
same object.
Selection process: the individuals are first linearly ranked according to their fitness and
then are selected by a roulette-wheel process using their rank value.
Operators: the main problem is the design of operators which preserve feasibility of the
solutions (Bilchev and Parmee, 1996). This is important for our problem since it
intrinsically has two constraints: (i) the disk capacity of the nodes must not be
exceeded; and (ii) each file must have at least one copy somewhere on the network. (If
feasibility is not preserved by the operators, then the fitness function would require to
be modified by an appropriate penalty function in order to drive the population into the
feasible region.) We have developed two main operators both preserving feasibility:
The new operators developed in this work are called Safe-add and Safe-delete. Safe-add
works as follows:
For each node randomly select and copy a file which is not
already locally present and whose size is smaller than the
available disk space.
Check to see if any of the nodes would respond faster by
downloading files from the new locations and if yes, update
the matrix
{}
ji
l
,
.
Safe-delete is as follows:
For each node randomly select and delete a file provided it
is not the last copy.
Update the matrix
{}

ji
l
,
to reflect on the above changes.
In our experiments we have used a population size of 70 individuals for 30 generations.
During each generation 50 safe-add and three safe-delete operators are applied. During the
selection process the best individual has 5% more chances of being selected as compared to
the second best, and so on.
Telecommunications Optimization: Heuristic and Adaptive Techniques
230
13.4 Simulation Results
In our experiments we start with an offline simulation during which the optimization
algorithms are run when the file system is not used (i.e. overnight, for example). In this
scenario, we assume that both algorithms have enough time to finish their optimization
before the file system is used again.
A typical simulation is shown in Figure 13.4. Tests are done using seven nodes and 100
files. All simulation graphs start from the same initial state of the distributed file system.
Then the two optimization algorithms are compared against a non-adaptive (static) file
system (i.e. when no optimization is used). The experiments undoubtedly reveal that the
adaptive distributed file system produces better results as compared to a static file system.
The graphs also clearly indicate the excellent performance of the GA optimizer, which
consistently outperforms the greedy algorithm.
Figure 13.4 An offline simulation. Adaptive distributed file systems utilizing a genetic algorithm and
a greedy algorithm respectively are compared against a static distributed file system. The experiments
use seven nodes and 31 files.
To show the effect of delayed information, we run the greedy algorithm once using the
usage pattern collected from the previous simulation step P(t–1) (which are available in
practice) and once using the actual P(t) (which is not known in practice). The difference in
performances reveals how much better we can do if perfect information were available
(Figure 13.5).

Static file
system
Greedy algorithm
Genetic algorithm
t
1
50
100
O(t)
Response time
Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems
231
Figure 13.5 A greedy algorithm with perfect information is compared to a greedy algorithm with
delayed information. In practice, we only have delayed information.
Figure 13.6 Online simulation. The circles indicate when the GA optimization takes place. The
GA/greedy algorithm ratio is 60 (i.e. the GA is run once for every 60 runs of the greedy algorithm).
Static file
system
Greedy algorithm
Genetic algorithm
t
O(t)
Response time
1
50
100
Greedy algorithm with
perfect information
Greedy algorithm
t

1 15 30
O(t)
Response time
Telecommunications Optimization: Heuristic and Adaptive Techniques
232
Figure 13.7 Online simulation. The GA/greedy algorithm ratio is 40. This is the critical ratio where
the average performance of both algorithm is comparable.
Figure 13.8 Online simulation. The GA/greedy algorithm ratio is 30. The GA manages to maintain
its performance advantage.
t
O(t)
Response time
1
50
100
Static file
system
Genetic algorithm
Greedy algorithm
Static file
system
Greedy algorithm
Genetic algorithm
t
O(t)
Response time
1
50
100
Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems

233
Next we consider online simulations. In an online simulation a faster optimization
algorithm would be executed at a greater rate than a slower algorithm. To account for the
fact that the GA is much slower than the greedy algorithm, we introduce a new simulation
variable, namely the GA/greedy algorithm ratio. It shows how many greedy optimizations
can be achieved during the time it takes for one GA optimization to finish.
Figure 13.6 shows a typical online simulation. Here the GA is run every 60 steps while
the greedy algorithm is run every step. The GA optimizes the file system at the beginning
of the simulation. Then for the next 60 steps the file system is static. This allows the greedy
algorithm to take over and outperform the GA.
It is interesting to note the critical GA/greedy algorithm ratio after which the greedy
algorithm outperforms the GA. For our simulation model, this happens to be around 40.
Figure 13.7 shows such a simulation. When the ratio drops even further then the GA is
capable of maintaining its performance advantage (Figure 13.8).
The simulation results suggest that the GA is more suitable for situations where the file
request pattern does not change very quickly, while the greedy algorithm is more
appropriate for rapidly changing request patterns.
13.5 Conclusions and Future Work
Our investigations showed that adaptation is crucial for network performance optimization.
Its importance would be even more evident in a real network where there are limited
resources and slow communication links.
The simulation results confirmed our initial hypothesis that a tailored genetic algorithm
would outperform ‘classical’ heuristics such as the greedy algorithm. However, while being
able to find better solutions, the GA is considerably slower and doesn’t scale as well as the
greedy algorithm. Saying that, it is important to mention that both algorithms are
centralized. They run on a single node and use global information. Centralized algorithms
in general suffer from scalability problems. Operations such as collecting global
information can be prohibitive for large networks. This evidence suggests a direction for
future studies, namely distributed adaptation algorithms where only local information is
used and optimization decisions are executed locally on each node. While it is difficult to

predict how global network behavior emerges out of the many local interactions, when
designed such an algorithm would be intrinsically scaleable.
Another important aspect requiring future studies is network resilience. User requests
(file requests in our model) should not fail if a node is temporary disconnected. While
designing resilient networks is a huge problem in itself, it is even more important to design
resilient networks which are optimal and adaptable.
Acknowledgements
We would like to thank Brian Turton (University of Wales), Dave Corne (Reading
University), Marc Wennink (BT) and Alan Pengelly (BT) for useful comments.

×