Tải bản đầy đủ (.pdf) (10 trang)

Managing and Mining Graph Data part 42 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.89 MB, 10 trang )

A Survey on Streaming Algorithms for Massive Graphs 397
Other variants of the streaming model also exist. For example, the W-stream
model [15] allows the algorithm to write to (annotate) the stream during each
pass. These annotations can then be utilized by the algorithm in the successive
passes. Another variant [1] augments the streaming model by adding a sorting
primitive.
3. Statistics and Counting Triangles
In this section, we describe a set of problems that involve graphs but es-
sentially can be reduced to problems whose input is an array presented as
a stream of the array elements (or as a sequence of increments to the el-
ements). For example, the array 𝑎 = [2, 1, 3] can be given as a stream
{(𝑎[1] + 1), (𝑎[3] + 1), (𝑎[3] + 1), (𝑎[2] + 1), (𝑎[1] + 1), (𝑎[3] + 1)}. As-
suming all the entries of the array take value 0 at the beginning of the stream,
after the operations in the stream, we obtain the array 𝑎.
There are many streaming algorithms that computes, for this array, statistics
such as frequency moments [3, 24, 31], heavy hitters [13, 10], and construct
succinct data structures that support queries such as range queries [38]. These
algorithms can be directly applied once the graph problem is reduced to the
corresponding problem of an array.
We consider these problems involving the degree of the graph nodes. For
an undirected graph, the degree of a node is the number of edges incident
to the node. One may view that there is a virtual array 𝐷 associated with
each graph such that 𝐷[𝑖] is the degree of the 𝑖-th node. In the streaming
setting, a stream of edges translates to updates to the array 𝐷. For example, the
stream {(1, 2), (4, 8), (2, 7) . . .} means the operation sequence: {(𝐷[1] + 1),
(𝐷[2] + 1), (𝐷[4] + 1), (𝐷[8] + 1), (𝐷[2] + 1), (𝐷[7] + 1), . . .}. (The degree
array can be extended to directed graph, where we may have one out-degree
array and one in-degree array.)
The frequency moment problem is to compute the k-th moment 𝑓
𝑘
=



𝑛
𝑖=1
(𝐷[𝑖])
𝑘
of the node degrees. The heavy hitter problem is to report, after
seeing the graph stream, the nodes having the largest degrees. The range query
requires to construct a succinct representation of the array (one that is much
smaller in size than the array), from which

𝑘
𝑖=𝑗
𝐷[𝑖], given j and k as query
input, can be calculated.
Cormode and Muthu show [14] that these problems can be solved using cor-
responding streaming algorithms that work for an array. They further provide
algorithms for these problems when the graph is a multigraph, but the degree
of a node is defined to count only the distinct edges. (e.g. if the stream for a
multigraph has edges (1, 2), (2, 5), (1, 2), the degree of the node 1 is 1, not 2
and the degree of the node 2 is 2, not 3.) The details of the algorithms are out
of the scope of this survey. We refer readers to [14] and the aforementioned
398 MANAGING AND MINING GRAPH DATA
literatures for streaming algorithms that compute statistics and other queries
for an array.
The node degree of a graph is also related to the entropy 𝐻 of an un-
biased random walk on the graph [9]. In particular, 𝐻 is defined to be
𝐻 =
1
2∣𝐸∣


𝑛
𝑖=1
𝐷[𝑖] log 𝐷[𝑖]. A streaming algorithm that computes the en-
tropy for an array, of which the 𝑖-th entry represents the frequency of the 𝑖-th
element in a set is given in [9]. The authors showed that the algorithm can be
applied to compute the entropy when the array is the node-degree array 𝐷 for a
graph, and therefore the entropy of an unbiased random walk can be calculated
for a graph stream. They also extended the algorithm to multigraphs where
only distinct edges are counted for the degree.
Another problem that can be reduced to computing statistics of an array
is the triangle counting problem, i.e., to find the number of triangles in an
undirected graph. We describe here the reduction introduced by Bar-Yossef
et. al [6]. Similar to the earlier problems, there is a virtual array 𝑃 associated
with the graph. Each entry in the array corresponds to an (unordered) triple
of the graph nodes. e.g., if 𝑣
𝑖
, 𝑣
𝑗
, 𝑣
𝑘
are three nodes in the graph, there is
an entry 𝑃 [(𝑖, 𝑗, 𝑘)] in the array corresponds to the triple {𝑣
𝑖
, 𝑣
𝑗
, 𝑣
𝑘
}. The
value of the entry counts how many of the three pairs {𝑣
𝑖

, 𝑣
𝑗
}, {𝑣
𝑖
, 𝑣
𝑘
}, and
{𝑣
𝑗
, 𝑣
𝑘
} are actual edges in the graph. There are 4 possible values for the
entries. 0, 1, 2, and 3. Let 𝑇
0
, 𝑇
1
, 𝑇
2
, and 𝑇
3
be the number of entries that take
the corresponding value. Clearly, 𝑇
3
is exactly the number of triangles in the
graph. (We will abuse the notation and also use 𝑇
𝑖
to denote the set of triples
whose entry value is 𝑖.)
Different from the reduction described earlier, an edge in the graph stream
here maps into updates of multiple entries in the array. If we see an edge

(𝑢, 𝑣), it means (𝑃 [(𝑢, 𝑣, 𝑠)] + 1) for all nodes 𝑠 ∕= 𝑢, 𝑣. Now consider the
frequency moments of the array 𝑓
𝑘
=

𝑡
(𝑃 [𝑡])
𝑘
. It can be decomposed into
𝑓
𝑘
= 𝑇
1
⋅ 1
𝑘
+ 𝑇
2
⋅ 2
𝑘
+ 𝑇
3
⋅ 3
𝑘
because each entry with value 1 contributes
1
𝑘
to 𝑓
𝑘
, with value 2, 2
𝑘

and with value 3, 3
𝑘
. We can have the following
equations:


𝑓
0
𝑓
1
𝑓
2


=


1 1 1
1 2 3
1 4 9





𝑇
1
𝑇
2
𝑇

3


.
Using streaming algorithms one can estimate 𝑓
0
, 𝑓
2
. 𝑓
1
can be easily ob-
tained from the stream. Solving the above equation then gives us the estimate
of 𝑇
3
. (Although the size of the virtual array is larger than the size of the
graph stream, e.g., a stream of 𝑚 edges corresponds to an array with 𝑚(𝑛 −2)
entries, the estimate algorithms often use space logarithmic to the size of the
array. Therefore, the memory space needed is not significantly affected by the
reduction.)
A Survey on Streaming Algorithms for Massive Graphs 399
In [6], Bar-Yossef et al.also proposed improved streaming frequency-
moment estimate algorithms. Using the reduction and their frequency-moment
estimation, they show that for 𝜖, 𝛿 > 0, the number of triangles in a graph can
be estimated within 𝜖 error (i.e., the estimate is bounded between (1−𝜖)𝑇
3
and
(1 + 𝜖)𝑇
3
) with at least 1 − 𝛿 probability. The algorithm uses space
𝑠 = 𝑂


1
𝜖
3
⋅ log
1
𝛿


𝑇
1
+ 𝑇
2
+ 𝑇
3
𝑇
3

3
⋅ log 𝑛

and poly(𝑠) process time for each edge. When the stream is an incident stream,
they show that, the number of triangles can be (𝜖, 𝛿)-estimated using space
𝑂

1
𝜖
2
⋅ log
1

𝛿


𝑇
3
+ 𝑇
2
𝑇
3

2
⋅ log 𝑛 + 𝑑
𝑚𝑎𝑥
log 𝑛

.
where 𝑑
𝑚𝑎𝑥
is the maximum degree.
In a follow-up work, Jowhari and Ghodsi [33] introduced several estimators
for the number of triangles. One estimator uses sequences of random numbers
in a way similar to [3]. Let 𝑅 be an array of uniform, ±1-valued random
numbers, i.e., 𝑃 (𝑅[𝑖] = 1) = 𝑃 (𝑅[𝑖] = −1) = 0.5 and 𝐸(𝑅[𝑖]) = 0.
The random numbers in the array are 12-wise independent. A family of such
random arrays can be constructed using the BCH code [3] in log-space. While
the edges stream by, one computes 𝑍 =

(𝑖,𝑗)∈𝐸
𝑅[𝑖]𝑅[𝑗]. 𝑋 = 𝑍
3

/6 is
then an estimator for the number of triangles in the graph. This is so because
𝐸(𝑅
𝑘
[𝑖]) = 0 for odd 𝑘 and the numbers in 𝑅 are 12-wise independent. After
the expansion of 𝑋, the expectations of the terms all evaluate to zero except
those in form of 6𝑅
2
[𝑖]𝑅
2
[𝑗]𝑅
2
[𝑘], which correspond to the triangles. Jowhari
and Ghodsi showed that the variance of the estimator can be controlled such
that only 𝑂(
1
𝜖
2
⋅log
1
𝛿
⋅(
𝑚
3
+𝑚𝐶
4
+𝐶6
𝑇
2
3

+1)⋅log 𝑛) space and per-edge processing
time is needed for an (𝜖, 𝛿)-estimation. (𝐶
𝑘
is the number of cycles of length 𝑘
in the graph.) Another two sample-based estimators are also proposed in [33].
Buriol et al.also proposed sample-based algorithms for counting triangles
in [8]. We present one of their algorithms in Algorithm 13.1.
𝛽 is a {0, 1}-valued random variable whose expectation is
3𝑇
3
𝑇
1
+2𝑇
2
+3𝑇
3
. Be-
cause 𝑇
1
+ 2𝑇
2
+ 3𝑇
3
= 𝑚(𝑛 − 2), (Consider the triples consist of two end
nodes of an edge plus one node from the other 𝑛 −2. There are 𝑚(𝑛 −2) such
combinations. On the other hand, this way of counting counts each triple in 𝑇
1
once, triples in 𝑇
2
twice and triples in 𝑇

3
three times. Hence the equality.) 𝑇
3
can be estimated using a set of samples of 𝛽. For making (𝜖, 𝛿)-estimation, this
algorithm uses 𝑂((
1
𝜖
2
⋅log
1
𝛿

𝑇
1
+𝑇
2
+𝑇
3
𝑇
3
) memory space and constant expected
per-edge process time.
Buriol et al.further showed that Algorithm 13.1 can be modified into a one-
pass algorithm. The uniform sampling of the edges can be done in one pass by
400 MANAGING AND MINING GRAPH DATA
Algorithm 13.1: Sample Triangle
1st pass: Count the number of edges in the graph.1
2nd pass: Sample an edge (𝑢, 𝑣) uniformly. Choose a node 𝑤 uniformly2
from 𝑉 ∖ {(𝑢, 𝑣)}.
3rd pass:

3
if Both (𝑢, 𝑤) and (𝑣, 𝑤) are actual edges in the stream then4
𝛽 = 15
else6
𝛽 = 07
end8
return 𝛽9
reservoir sampling [43]. One difference here is that edges (𝑢, 𝑤) and (𝑣, 𝑤)
may arrive before (𝑢, 𝑣) in the stream. When (𝑢, 𝑣) gets selected as a sample,
we have missed (𝑢, 𝑤) and (𝑣, 𝑤) and would not detect 𝑢, 𝑣, 𝑤 as an triangle.
This happens when (𝑢, 𝑣) is not the first edge of the triangle in the stream and
it reduces the expectation of 𝛽 by a factor of 3. Sample-based algorithms are
also proposed in [8] for incidence streams.
4. Graph Matching
A matching in a graph is a set of edges without common nodes. For an un-
weighted graph, the maximum matching problem is to find a matching having
the largest cardinality (number of edges). For a weighted graph, the problem
is to find a matching whose edges give the largest weight sum. We survey un-
weighted and weighted matching algorithms for graph streams in the following
sections.
4.1 Unweighted Matching
An early algorithm for approximating unweighted bipartite matching in the
streaming model is given in [22]. We describe the algorithm here. It is easy to
see that a maximal matching (A matching no more edge can be added because
every edge outside the match share a vertex with some edge in the matching.)
can be constructed in one pass over the graph stream.
Given a matching 𝑀 for a bipartite graph 𝐺 = (𝐿 ∪ 𝑅, 𝐸), a length-3 aug-
menting path for an edge 𝑒 = (𝑢, 𝑣) ∈ 𝑀, 𝑢 ∈ 𝐿 and 𝑣 ∈ 𝑅, is a quadruple
(𝑤
𝑙

, 𝑢, 𝑣, 𝑤
𝑟
) such that (𝑢, 𝑤
𝑙
), (𝑤
𝑟
, 𝑣) ∈ 𝐸, and 𝑤
𝑙
and 𝑤
𝑟
are free vertices.
We call 𝑤
𝑙
and 𝑤
𝑟
the wing-tips of the augmenting path, (𝑢, 𝑤
𝑙
) the left wing
and (𝑤
𝑟
, 𝑣) the right wing. A set of simultaneously augmentable length-3 aug-
menting paths is a set of length-3 augmenting paths that are vertex disjoint.
A Survey on Streaming Algorithms for Massive Graphs 401
Algorithm 13.2: Find Augmenting Paths
Input: a graph 𝐺 = (𝐿 ∪ 𝑅, 𝐸), a matching 𝑀 for 𝐺 and a parameter
0 < 𝛿 < 1.
while true do
1
In one pass, find a maximal set of disjoint left wings. If the number of2
left wings found is ≤ 𝛿𝑀, terminate.

In a second pass, for the edges in 𝑀 with left wings, find a maximal
3
set of disjoint right wings.
In a third pass we identify the set of vertices that are
4
endpoints of a matched edge that got a left wing, or
the wing tips of a matched edge that got both wings, or
endpoints of a matched edge that is no longer 3 augmentable.
We remember these vertices and in subsequent passes, we ignore any
edge incident on one of these vertices.
end
5
Given a bipartite graph and a matching in the graph, the subroutine in Al-
gorithm 13.2 finds a set of simultaneously augmentable length-3 augmenting
paths. It will be used in the main algorithm that computes the matching for a
bipartite graph.
Let 𝑋 be a maximum-sized set of simultaneously augmentable length-3
augmenting paths for the maximal matching 𝑀. Let 𝛼 =
∣𝑋∣
∣𝑀∣
. It is shown
in [22] that Algorithm 13.2 finds at least
𝛼∣𝑀∣−2𝛿∣𝑀∣
3
simultaneously aug-
mentable length-3 augmenting paths in 3/𝛿 passes.
The main matching algorithm increases the size of a matching by repeatedly
finding a set of simultaneously augmentable length-3 augmenting paths and
augmenting the matching using these paths.
The for-loop in Algorithm 13.3 runs ⌈

log 6𝜖
𝑙𝑜𝑔8/9
⌉ times. During each run, the
subroutine described in Algorithm 13.2 needs to go through the input graph
stream 3/𝛿 passes. Therefore, Algorithm 13.3 in total goes through the stream
𝑂
(
log 1/𝜖
𝜖
)
passes. Each call to the subroutine will find a set of simultane-
ously augmentable length-3 augmenting paths which increases the size of the
matching. The final matching size reaches at least (2/3 − 𝜖) of the maximum
matching. The algorithm processes each edge in 𝑂(1) time in each pass except
the first pass, in which the bipartition is found. The storage space required by
the algorithm is 𝑂(𝑛 log 𝑛).
402 MANAGING AND MINING GRAPH DATA
Algorithm 13.3: Unweighted Bipartite Matching
Input: a bipartite graph 𝐺 = (𝐿 ∪ 𝑅, 𝐸) and a parameter 0 < 𝜖 < 1/3.
In one pass, find a maximal matching 𝑀 and the bipartition of 𝐺.
1
for 𝑘 = 1, 2, . . . , ⌈
log 6𝜖
𝑙𝑜𝑔8/9
⌉ do
2
Run Algorithm 13.2 with 𝐺, 𝑀 and 𝛿 =
𝜖
2−3𝜖
.

3
for each 𝑒 = (𝑢, 𝑣) ∈ 𝑀 for which an augmenting path (𝑤
𝑙
, 𝑢, 𝑣, 𝑤
𝑟
)4
is found by algorithm 13.2 do
remove (𝑢, 𝑣) from 𝑀 and add (𝑢, 𝑤
𝑙
) and (𝑤
𝑟
, 𝑣) to 𝑀.5
end6
end7
Figure 13.1. Layered Auxiliary Graph. Left, a graph with a matching (solid edges); Right, a layered
auxiliary graph. (An illustration, not constructed from the graph on the left. The solid edges show
potential augmenting paths.)
In [35], McGregor introduced an improved algorithm to find augmenting
paths in an unweighted graph for which a maximal match has been constructed.
Given the original input graph 𝐺 and a matching 𝑀, McGregor constructed an
auxiliary graph 𝐺
𝐴
to help searching for augment paths. Fig 13.1 gives an
example of one auxiliary graph. The auxiliary graph is a layered graph with a
small number, 𝑘+2, of layers. It is derived as follows: Let 𝐿
0
, 𝐿
1
, . . . , 𝐿
𝑘+1

be
the layers in 𝐺
𝐴
. The free nodes in 𝐺, i.e. the nodes that haven’t been covered
by an edge in 𝑀, are randomly projected to be nodes in 𝐿
0
or 𝐿
𝑘+1
. The edges
in 𝑀 are projected to be a node in 𝐺
𝐴
and this node is randomly assigned to
be in a layer of 𝐿
1
, 𝐿
2
, . . . , 𝐿
𝑘
. There is an edge between a node 𝑥 ∈ 𝐿
𝑖
(that
corresponding to (𝑣
1
, 𝑣
2
) ∈ 𝑀) and a node 𝑦 ∈ 𝐿
𝑖−1
(that corresponding to
(𝑣
3

, 𝑣
4
) ∈ 𝑀) if (𝑣2, 𝑣3) ∈ 𝐺. With this construction, an (𝑖 + 1)-length path
in 𝐺
𝐴
can be mapped to a (2𝑖 + 1)-length augmenting path for 𝑀 in 𝐺.
Identifying a set of augmenting paths for 𝑀 in 𝐺 now is transformed to find
a set of node-disjoint paths in 𝐺
𝐴
. Because one doesn’t have enough space
to store the whole graph 𝐺 in the streaming model, normally, the auxiliary
graph 𝐺
𝐴
cannot be stored as a whole graph neither. However, the nodes in
A Survey on Streaming Algorithms for Massive Graphs 403
𝐺
𝐴
can be stored. While the algorithm passes through the input stream of 𝐺,
the edges in 𝐺
𝐴
also gets revealed. Hence, the problem boils down to find a
near-maximal set of node-disjoint paths in 𝐺
𝐴
.
A search algorithm was proposed in [35] for this purpose. The algorithm
finds a maximal matching between layers 𝐿
𝑖−1
and 𝐿
𝑖
. Let 𝑆

𝑖
∈ 𝐿
𝑖
be the set
of nodes involved in this matching. The algorithm then goes ahead to find a
maximal matching between 𝑆
𝑖
and 𝐿
𝑖+1
. It continues in this fashion to grow a
set of node-disjoint paths. Clearly, the size of 𝑆
𝑖
may decrease while 𝑖 increases
and may become empty before the last layer is reached. To avoid this, the
path growth process may backtrack if the size of 𝑆
𝑖
becomes too small. The
backtrack is done by marking the nodes in 𝑆
𝑖
as deadends, removing them
from 𝐺
𝐴
and continuing path growth in the remaining of 𝐺
𝐴
.
For a particular 𝐺
𝐴
construction and path growth, the resulting set of paths
may be small. However, the 𝐺
𝐴

construction is random because the nodes cor-
responding to the edges in 𝑀 are randomly assigned to the layers. A matching
algorithm is given in [35] that is similar to Algorithm 13.3 in structure but uti-
lizes the 𝐺
𝐴
-based augmenting-path search. It is shown that, with high proba-
bility, this algorithm finds a matching in 𝑂
𝜖
(1) (a function of 𝜖 and a constant
is 𝜖 is constant) passes whose size is at least
1
1+𝜖
of the maximum matching.
4.2 Weighted Matching
The streaming version of the problem was first studied in [22] where a
streaming algorithm (Algorithm 13.4) was proposed. The algorithm uses only
one pass over the stream and manages to find a matching which is at least
1
6
of
the optimal size.
Algorithm 13.4: Weighted Matching
Maintain a matching 𝑀 at all times.1
while there are edges in the stream do2
Let 𝑒 be the next edge in the stream and 𝑤(𝑒) be the weight of 𝑒;3
Let 𝑤(𝐶) be the sum of the weights of the edges in4
𝐶 = {𝑒

∣𝑒


∈ 𝑀 and 𝑒

and 𝑒 share an end point}. (𝑤(𝐶) = 0 if 𝐶 is
empty.)
if 𝑤(𝑒) > 2𝑤(𝐶) then
5
update 𝑀 ← 𝑀 ∪ {𝑒} ∖ 𝐶.6
else7
ignore 𝑒8
end9
end10
The following property of Algorithm 13.4 is shown in [22].
404 MANAGING AND MINING GRAPH DATA
Theorem 13.2. In 1 pass and 𝑂(𝑛 log 𝑛) storage, Algorithm 13.4 constructs
a weighted matching that is at least
1
6
of the optimal size.
Proof: For any set of edges 𝑆, let 𝑤(𝑆) =

𝑒∈𝑆
𝑤(𝑒). We say that an edge
is selected if it is ever part of 𝑀 . We say that an edge is dropped if it was
selected early but later replaced from 𝑀 (step 6 in Algorithm 13.4) by a new
heavier edge. This new edge replaces the dropped edge. We say an edge is a
survivor if it is selected and never dropped. Let the set of survivors be 𝑆. The
weight of the matching we find is therefore 𝑤(𝑆).
For each survivor 𝑒, let the Trail of Drops leading to this edge be 𝑇 (𝑒) =
𝐶
1

∪ 𝐶
2
∪ . . . where 𝐶
0
= {𝑒}, 𝐶
1
= {the edges replaced by 𝑒}, and
𝐶
𝑖
= ∪
𝑒

∈𝐶
𝑖−1
{the edges replaced by 𝑒

}. We have 𝑤(𝑇 (𝑒)) ≤ 𝑤(𝑒). This
is because for each replacing edge 𝑒, 𝑤(𝑒) is at least twice the cost of the re-
placed edges, and an edge has at most one replacing edge. Hence, for all 𝑖,
𝑤(𝐶
𝑖
) ≥ 2𝑤(𝐶
𝑖+1
) and
2𝑤(𝑇 (𝑒)) =

𝑖≥1
2𝑤(𝐶
𝑖
) ≤


𝑖≥0
𝑤(𝐶
𝑖
) = 𝑤(𝑇 (𝑒)) + 𝑤(𝑒).
Now consider the optimal solution that includes edges opt = {𝑜
1
, 𝑜
2
, . . .}.
We are going to charge the costs of the edges in opt to the survivors and their
trail of drops, ∪
𝑒∈𝑆
𝑇 (𝑒) ∪ {𝑒}. We hold an edge 𝑒 in this set accountable
to 𝑜 ∈ opt if either 𝑒 = 𝑜 or if 𝑜 wasn’t selected because 𝑒 was in 𝑀 when
𝑜 arrived. Note that, in the second case, it is possible for two edges to be
accountable to 𝑜. If only one edge is accountable for 𝑜 then we charge 𝑤(𝑜) to
𝑒. If two edges 𝑒
1
and 𝑒
2
are accountable for 𝑜, then we charge
𝑤(𝑜)𝑤(𝑒
1
)
𝑤(𝑒
1
)+𝑤 (𝑒
2
)

to
𝑒
1
and
𝑤(𝑜)𝑤(𝑒
2
)
𝑤(𝑒
1
)+𝑤 (𝑒
2
)
to 𝑒
2
. In either case, the amount charged by 𝑜 to any edge
𝑒 is at most 2𝑤(𝑒).
We now redistribute these charges as follows: (for distinct 𝑢
1
, 𝑢
2
, 𝑢
3
) if
𝑒 = (𝑢
1
, 𝑣) gets charged by 𝑜 = (𝑢
2
, 𝑣), and 𝑒 subsequently gets replaced by
𝑒


= (𝑢
3
, 𝑣), we transfer the charge from 𝑒 to 𝑒

. Note that we maintain the
property that the amount charged by 𝑜 to any edge 𝑒 is at most 2𝑤(𝑒) because
𝑤(𝑒

) ≥ 𝑤(𝑒). What this redistribution of charges achieves is that now every
edge in a trail of drops is only charged by one edge in opt. Survivors can,
however, be charged by two edges in opt. We charge 𝑤(opt) to the survivors
and their trails of drops, and hence
𝑤(opt) ≤

𝑒∈𝑆
(2𝑤(𝑇 (𝑒)) + 4𝑤(𝑒)) .
Because 𝑤(𝑇 (𝑒)) ≤ 𝑤(𝑒),

𝑒∈𝑆
(2𝑤(𝑇 (𝑒)) + 4𝑤(𝑒)) ≤ 6𝑤(𝑆)
A Survey on Streaming Algorithms for Massive Graphs 405
and the theorem follows. □
The condition on line 5 of Algorithm 13.4 can be generalized to be 𝑤(𝑒) >
(1 + 𝛾)𝑤(𝐶), 𝐶 = {𝑒

∣𝑒

∈ 𝑀 and 𝑒

and 𝑒 share an end point}. By setting

𝛾 appropriately and repeating Algorithm 13.4 until the improvement yielded
falls below some threshold, a matching can be constructed [35] in 𝑂
𝜖
(1) passes
whose size is at least
1
2+𝜖
of the maximum matching.
Another improvement for weighted matching was made recently by
Zelke [46]. Zelke’s algorithm is also based on Algorithm 13.4, but incorpo-
rates some improvements. In particular, the algorithm stores a few edges that
have been in 𝑀 in the past but were replaced later, to potentially reinsert them
into 𝑀 in the future. Such edges are called in [46] the “shadow edges." With
shadow edges, when a new edge arrives in the stream, besides the (two) edges
that sharing the endpoints with the new edge, a few other edges (edges in 𝑀 as
well as the shadow edges) in the vincinity of the new edge can be examined to
find potential augmenting path. This improves the approximation from 1/5.82
(by an algorithm in [35]) to 1/5.58.
5. Graph Distance
We consider the shortest-path distance in a graph. The shortest path between
two vertices in a graph is the path that has the smallest number of edges (for
an unweighted graph) or the smallest sum of the weights of the path edges (for
a weighted graph). There may be more than one such shortest path.
A structure often used in approximating graph distance is the graph span-
ner [39, 11, 18]. An undirected graph 𝐺 = (𝑉, 𝐸) induces a metric space
𝒰 in which the vertex set 𝑉 serves as the set of points, and the shortest-
path distances serve as the distances between the points. The graph spanner
𝐺

= (𝑉, 𝐻), 𝐻 ⊆ 𝐸, is a sparse skeleton of the graph 𝐺 whose induced

metric space 𝒰

is a close approximation of the metric space 𝒰 of the graph
𝐺. That is, the distance between two vertices in 𝐺

is not far from the distance
between the same two vertices in 𝐺. For example, a subgraph 𝐺

= (𝑉, 𝐻),
𝐻 ⊆ 𝐸 is a (multiplicative) 𝑡-spanner of the graph 𝐺, if for every pair of ver-
tices 𝑢, 𝑣 ∈ 𝑉 , 𝑑𝑖𝑠𝑡
𝐺

(𝑢, 𝑣) ≤ 𝑡 ⋅ 𝑑𝑖𝑠𝑡
𝐺
(𝑢, 𝑣) (where 𝑑𝑖𝑠𝑡
𝐺
(𝑢, 𝑣) stands for
the distance between the vertices 𝑢 and 𝑣 in the graph 𝐺). The stretch factor
of a spanner is the parameter(s) that determines how close the spanner approx-
imates the distances in the original graph, e.g., in the case of a 𝑡-spanner, the
parameter 𝑡.
Clearly, if a spanner can be constructed for a massive graph, one can approx-
imate the node distance in the graph using the spanner. Because the spanner
is much smaller than the original graph, it can often be stored in the main
memory. In fact, an early application of spanners is to maintain a succinct rep-
resentation of the routing information [39, 11]. Instead of the original network
406 MANAGING AND MINING GRAPH DATA
graph, spanners are passed and stored by the routers for calculating the routing
paths. Besides distances, the diameter of a graph can be approximated using

the spanner diameter.
In [22], Feigenbaum et al.gave a simple streaming algorithm for spanner-
construction by adapting the technique of [4]. It displays a certain connection
between the girth of a graph and the spanner. (The girth of a graph is the length
of the shortest cycle in the graph.) However, in the worst case, the algorithm
needs more than 𝑂(𝑛) time to process an edge. Such a processing time is
prohibitively high for the streaming model.
For an unweighted graph, the algorithm of [22] in one pass constructs a
(log 𝑛/ log log 𝑛)-spanner 𝑆: Because a graph whose girth is larger than 𝑘
have at most ⌈𝑛
1+2/(𝑘−2)
⌉ edges [7, 17, 2], the algorithm constructs 𝑆 by
adding an edge in the stream to 𝑆 if the edge does not cause a cycle of length
less than log 𝑛/ log log 𝑛 in the 𝑆 constructed so far. Otherwise, the edge is
ignored. Note that for each ignored edge, there is a path 𝑃 of length at most
log 𝑛/ log log 𝑛 in 𝑆 that connects the two endpoints of this edge. Any shortest
path in the original graph that uses this edge can be replaced by a path in 𝑆 that
uses 𝑃 . Therefore, 𝑆 is a log 𝑛/ log log 𝑛 spanner of the original graph.
For a weighted graph, however, the construction in [4] requires sorting the
edges according to their weights, which is difficult in the streaming model.
Instead of sorting, a geometric grouping technique is used in [22] to extend
the spanner construction for unweighted graphs to a construction for weighted
graphs. This technique is similar to the one used in [12]. Let 𝜔
𝑚𝑖𝑛
be the
minimum weight and 𝜔
𝑚𝑎𝑥
be the maximum weight. We divide the range
[𝜔
𝑚𝑖𝑛

, 𝜔
𝑚𝑎𝑥
] into intervals of the form [(1 + 𝜖)
𝑖
𝜔
𝑚𝑖𝑛
, (1 + 𝜖)
𝑖+1
𝜔
𝑚𝑖𝑛
) and
round all the weights in the interval [(1+𝜖)
𝑖
𝜔
𝑚𝑖𝑛
, (1+𝜖)
𝑖+1
𝜔
𝑚𝑖𝑛
) down to (1+
𝜖)
𝑖
𝜔
𝑚𝑖𝑛
. For each induced graph 𝐺
𝑖
= (𝑉, 𝐸
𝑖
), where 𝐸
𝑖

is the set of edges in
𝐸 whose weight is in the interval [(1+𝜖)
𝑖
𝜔
𝑚𝑖𝑛
, (1+𝜖)
𝑖+1
𝜔
𝑚𝑖𝑛
), a spanner can
be constructed in parallel using the above construction for unweighted graphs.
The union of the spanners for all the 𝐺
𝑖
, 𝑖 ∈ {0, 1, . . . , log
(1+𝜖)
𝜔
𝑚𝑎𝑥
𝜔
𝑚𝑖𝑛
− 1},
forms a spanner for the graph 𝐺. Note that this can be done without prior
knowledge of 𝜔
𝑚𝑖𝑛
and 𝜔
𝑚𝑎𝑥
. The goal is to break the range [𝜔
𝑚𝑖𝑛
, 𝜔
𝑚𝑎𝑥
] into

a small number of intervals. Given any value 𝜔 ∈ [𝜔
𝑚𝑖𝑛
, 𝜔
𝑚𝑎𝑥
], we can use
the set of intervals of the form [(1 + 𝜖)
𝑖
𝜔, (1 + 𝜖)
𝑖+1
𝜔) and [
𝜔
(1+𝜖)
𝑖+1
,
𝜔
(1+𝜖)
𝑖
).
Therefore, we can determine the intervals without the prior knowledge of 𝜔
𝑚𝑖𝑛
and 𝜔
𝑚𝑎𝑥
.
5.1 Distance Approximation using Multiple Passes
Elkin and Zhang gave a multiple-pass streaming spanner construction
in [21]. This algorithm builds an additive spanner. A subgraph 𝐺

= (𝑉, 𝐻)

×