Tải bản đầy đủ (.pdf) (10 trang)

Managing and Mining Graph Data part 17 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.71 MB, 10 trang )

142 MANAGING AND MINING GRAPH DATA
for subgraph isomorphism. Procedure Search(𝑖) iterates on the 𝑖
𝑡ℎ
node to
find feasible mappings for that node. Procedure Check(𝑢
𝑖
, 𝑣) examines if 𝑢
𝑖
can be mapped to 𝑣 by considering their edges. Line 12 maps 𝑢
𝑖
to 𝑣. Lines
13–16 continue to search for the next node or if it is the last node, evaluate the
graph-wide predicate. If it is true, then a feasible mapping 𝜙 : 𝑉 (𝒫) → 𝑉 (𝐺)
has been found and is reported (line 15). Line 16 stops searching immediately
if only one mapping is required.
The graph pattern and the graph are represented as a vertex set and an edge
set, respectively. In addition, adjacency lists of the graph pattern are used to
support line 21. For line 22, edges of graph 𝐺 can be represented in a hashtable
where keys are pairs of the end points. To avoid repeated evaluation of edge
predicates (line 22), another hashtable can be used to store evaluated pairs of
edges.
The worst-case time complexity of Algorithm 4.1 is 𝑂(𝑛
𝑘
) where 𝑛 and 𝑘
are the sizes of graph 𝐺 and graph pattern 𝒫, respectively. This complexity
is a consequence of subgraph isomorphism that is known to be NP-hard. In
practice, the running time depends on the size of the search space.
We now consider possible ways to accelerate Algorithm 4.1:
1 How to reduce the size of Φ(𝑢
𝑖
) for each node 𝑢


𝑖
? How to efficiently
retrieve Φ(𝑢
𝑖
)?
2 How to reduce the overall search space Φ(𝑢
1
) × ×Φ(𝑢
𝑘
)?
3 How to optimize the search order?
We present three techniques that respectively address the above questions.
The first technique prunes each Φ(𝑢
𝑖
) individually and retrieves it efficiently
through indexing. The second technique prunes the overall search space by
considering all nodes in the pattern simultaneously. The third technique applies
ideas from traditional query optimization to find the right search order.
4.2 Local Pruning and Retrieval of Feasible Mates
Node attributes can be indexed directly using traditional index structures
such as B-trees. This allows for fast retrieval of feasible mates and avoids a
full scan of all nodes. To reduce the size of feasible mates Φ(𝑢
𝑖
)’s even further,
we can go beyond nodes and consider neighborhood subgraphs of the nodes.
The neighborhood information can be exploited to prune infeasible mates at an
early stage.
Definition 4.10. (Neighborhood Subgraph) Given graph 𝐺, node 𝑣 and radius
𝑟, the neighborhood subgraph of node 𝑣 consists of all nodes within distance
𝑟 (number of hops) from 𝑣 and all edges between the nodes.

Query Language and Access Methods for Graph Databases 143
Node 𝑣 is a feasible mate of node 𝑢
𝑖
only if the neighborhood subgraph of
𝑢
𝑖
is sub-isomorphic to that of 𝑣 (with 𝑢
𝑖
mapped to 𝑣). Note that if the radius
is 0, then the neighborhood subgraphs degenerate to nodes.
Although neighborhood subgraphs have high pruning power, they incur a
large computation overhead. This overhead can be reduced by representing
neighborhood subgraphs by their light-weight profiles. For instance, one can
define the profile as a sequence of the node labels in lexicographic order. The
pruning condition then becomes whether a profile is a subsequence of the other.
P
A
B
A
1
B
1
C
1
B
2
G
C C
2
A

2
Figure 4.16. A sample graph pattern and graph
A
1
B
1
C
2
A
1
B
1
C
1
C
2
B
1
C
1
A
1
B
1
B
2
C
2
A
1

Nodes
of G
Neighborhood sub-
graphs of radius 1
Profiles
B
1
B
2
C
1
C
2
ABC
ABCC
ABC
BC
ABBC
B
2
A
2
A
2
AB
B
2
C
2
A

2
Search space
Retrieve by nodes:
{A
1
, A
2
} X {B
1
, B
2
} X {C
1
, C
2
}
Retrieve by neighborhood
subgraphs:
{A
1
} X {B
1
} X {C
2
}
Retrieve by profiles of
neighborhood subgraphs:
{A
1
} X {B

1
, B
2
} X {C
2
}
Figure 4.17. Feasible mates using neighborhood subgraphs and profiles. The resulting search
spaces are also shown for different pruning techniques.
Figure 4.16 shows the sample graph pattern 𝒫 and the database graph 𝐺
again for convenience. Figure 4.17 shows the neighborhood subgraphs of ra-
144 MANAGING AND MINING GRAPH DATA
dius 1 and their profiles for nodes of 𝐺. If the feasible mates are retrieved using
node attributes, then the search space is {𝐴
1
, 𝐴
2
} × {𝐵
1
, 𝐵
2
} × {𝐶
1
, 𝐶
2
}. If
the feasible mates are retrieved using neighborhood subgraphs, then the search
space is {𝐴
1
}×{𝐵
1

}×{𝐶
2
}. Finally, if the feasible mates are retrieved using
profiles, then the search space is {𝐴
1
} × {𝐵
1
, 𝐵
2
} × {𝐶
2
}. These are shown
in the right side of Figure 4.17.
If the node attributes are selective, e.g., many unique attribute values, then
one can index the node attributes using a B-tree or hashtable, and store the
neighborhood subgraphs or profiles as well. Retrieval is done by indexed ac-
cess to the node attributes, followed by pruning using neighborhood subgraphs
or profiles. Otherwise, if the node attributes are not selective, one may have
to index the neighborhood subgraphs or profiles. Recent graph indexing tech-
niques [9, 17, 23, 34, 36, 39–42] or multi-dimensional indexing methods such
as R-trees can be used for this purpose.
4.3 Joint Reduction of Search Space
We reduce the overall search space iteratively by an approximation algo-
rithm called Pseudo Subgraph Isomorphism [17]. This prunes the search space
by considering the whole pattern and the space Φ(𝑢
1
) × ×Φ(𝑢
𝑘
) simultane-
ously. Essentially, this technique checks for each node 𝑢 in pattern 𝒫 and its

feasible mate 𝑣 in graph 𝐺 whether the adjacent subtree of 𝑢 is sub-isomorphic
to that of 𝑣. The check can be defined recursively on the depth of the adjacent
subtrees: the level 𝑙 subtree of 𝑢 is sub-isomorphic to that of 𝑣 only if the level
𝑙 − 1 subtrees of 𝑢’s neighbors can all be matched to those of 𝑣’s neighbors.
To avoid subtree isomorphism tests, a bipartite graph ℬ
𝑢,𝑣
is defined between
neighbors of 𝑢 and 𝑣. If the bipartite graph has a semi-perfect matching, i.e.,
all neighbors of 𝑢 are matched, then 𝑢 is level 𝑙 sub-isomorphic to 𝑣. In the
bipartite graph, an edge is present between two nodes 𝑢

and 𝑣

only if the level
𝑙 − 1 subtree of 𝑢

is sub-isomorphic to that of 𝑣

, or equivalently the bipar-
tite graph ℬ
𝑢

,𝑣

at level 𝑙 − 1 has a semi-perfect matching. A more detailed
description can be found in [17].
Algorithm 4.2 outlines the refinement procedure. At each iteration (lines
3–20), a bipartite graph ℬ
𝑢,𝑣
is constructed for each 𝑢 and its feasible mate

𝑣 (lines 5–9). If ℬ
𝑢,𝑣
has no semi-perfect matching, then 𝑣 is removed from
Φ(𝑢), thus reducing the search space (line 13).
The algorithm has two implementation improvements on the refinement pro-
cedure discussed in [17]. First, it avoids unnecessary bipartite matchings. A
pair ⟨𝑢, 𝑣⟩ is marked if it needs to be checked for semi-perfect matching (lines
2, 4). If the semi-perfect matching exists, then the pair is unmarked (lines
10–11). Otherwise, the removal of 𝑣 from Φ(𝑢) (line 13) may affect the exis-
tence of semi-perfect matchings of the neighboring ⟨𝑢

, 𝑣

⟩ pairs. As a result,
Query Language and Access Methods for Graph Databases 145
Algorithm 4.2: Refine Search Space
Input: Graph Pattern 𝒫, Graph 𝐺, Search space Φ(𝑢
1
) × × Φ(𝑢
𝑘
),
level 𝑙
Output: Reduced search space Φ

(𝑢
1
) × × Φ

(𝑢
𝑘

)
begin
1
foreach 𝑢 ∈ 𝒫, 𝑣 ∈ Φ(𝑢) do Mark ⟨𝑢, 𝑣⟩;2
for 𝑖 ← 1 to 𝑙 do3
foreach 𝑢 ∈ 𝒫, 𝑣 ∈ Φ(𝑢), ⟨𝑢, 𝑣⟩ is marked do4
//Construct bipartite graph ℬ
𝑢,𝑣
5
𝑁
𝒫
(𝑢), 𝑁
𝐺
(𝑣): neighbors of 𝑢, 𝑣;6
foreach 𝑢

∈ 𝑁
𝒫
(𝑢), 𝑣

∈ 𝑁
𝐺
(𝑣) do7

𝑢,𝑣
(𝑢

, 𝑣

) ←

{
1 if 𝑣

∈ Φ(𝑢

);
0 otherwise.
8
end9
if ℬ
𝑢,𝑣
has a semi-perfect matching then10
Unmark ⟨𝑢, 𝑣⟩;11
else12
Remove 𝑣 from Φ(𝑢);13
foreach 𝑢

∈ 𝑁
𝒫
(𝑢), 𝑣

∈ 𝑁
𝐺
(𝑣), 𝑣

∈ Φ(𝑢

) do14
Mark ⟨𝑢


, 𝑣

⟩;15
end16
end17
end18
if there is no marked ⟨𝑢, 𝑣⟩ then break;19
end20
end21
these pairs are marked and checked again (line 14). Second, the ⟨𝑢, 𝑣⟩ pairs
are stored and manipulated using a hashtable instead of a matrix. This reduces
the space and time complexity from 𝑂(𝑘 ⋅𝑛) to 𝑂(

𝑘
𝑖=1
∣Φ(𝑢
𝑖
)∣). The overall
time complexity is 𝑂(𝑙 ⋅

𝑘
𝑖=1
∣Φ(𝑢
𝑖
)∣ ⋅ (𝑑
1
𝑑
2
+ 𝑀(𝑑
1

, 𝑑
2
))) where 𝑙 is the
refinement level, 𝑑
1
and 𝑑
2
are maximum degrees of 𝒫 and 𝐺 respectively,
and 𝑀() is the time complexity of maximum bipartite matching (𝑂(𝑛
2.5
) for
Hopcroft and Karp’s algorithm [19]).
Figure 4.18 shows an execution of Algorithm 4.2 on the example in Fig-
ure 4.16. At level 1, 𝐴
2
and 𝐶
1
are removed from Φ(𝐴) and Φ(𝐶), respec-
tively. At level 2, 𝐵
2
is removed from Φ(𝐵) since the bipartite graph ℬ
𝐵,𝐵
2
has no semi-perfect matching (note that 𝐴
2
was already removed from Φ(𝐴)).
Whereas the neighborhood subgraphs discussed in Section 4.2 prune in-
feasible mates by using local information, the refinement procedure in Algo-
146 MANAGING AND MINING GRAPH DATA
B

C
B
1
A
1
C
2
B
2
Level-0 Level-2
A
B C
B
A C
C
A B
B
1
A
1
C
1
A
1
B
1
C
2
C
2

A
1
B
1
B
2
C
2
Input search space:
{A
1
, A
2
} X {B
1
, B
2
} X {C
1
, C
2
}
Output search space:
{A
1
} X {B
1
} X {C
2
}

B
2
A
2
A
2
C
2
A
C
has no semi-
perfect matching
C
2
A
2
A
C
1
Level-1
A
B C
B
A C
C
A B
B
1
A
1

C
1
A
1
B
1
C
2
C
2
A
1
B
1
B
2
C
2
B
2
A
2
C
2
A
2
B
2
C
1

B
1
Figure 4.18. Refinement of the search space
rithm 4.2 prunes the search space globally. The global pruning has a larger
overhead and is dependent on the output of the local pruning. Therefore, both
pruning methods are indispensable and should be used together.
4.4 Optimization of Search Order
Next, we consider the search order of Algorithm 4.1. The goal here is to find
a good search order for the nodes. Since the search procedure is equivalent to
multiple joins, it is similar to a typical query optimization problem [7]. Two
principal issues need to be considered. One is the cost model for a given search
order. The other is the algorithm for finding a good search order. The cost
model is used as the objective function of the search algorithm. Since the
search algorithm is relatively standard (e.g., dynamic programming, greedy
algorithm), we focus on the cost model and illustrate that it can be customized
in the domain of graphs.
Cost Model. A search order (a.k.a. a query plan) can be represented as a
rooted binary tree whose leaves are nodes of the graph pattern and each internal
node is a join operation. Figure 4.19 shows two examples of search orders.
We estimate the cost of a join (a node in the query plan tree) as the product
of cardinalities of the collections to be joined. The cardinality of a leaf node
is the number of feasible mates. The cardinality of an internal node can be
estimated as the product of cardinalities of collections reduced by a factor 𝛾.
Query Language and Access Methods for Graph Databases 147
A B C
(a) (b)
A C B
Figure 4.19. Two examples of search orders
Definition 4.11. (Result size of a join) The result size of join 𝑖 is estimated by
𝑆𝑖𝑧𝑒(𝑖) = 𝑆𝑖𝑧𝑒(𝑖.𝑙𝑒𝑓𝑡) × 𝑆𝑖𝑧𝑒(𝑖.𝑟𝑖𝑔ℎ𝑡) × 𝛾(𝑖)

where 𝑖.𝑙𝑒𝑓 𝑡 and 𝑖.𝑟𝑖𝑔ℎ𝑡 are the left and right child nodes of 𝑖 respectively,
and 𝛾(𝑖) is the reduction factor.
A simple way to estimate the reduction factor 𝛾(𝑖) is to approximate it by a
constant. A more elaborate way is to consider the probabilities of edges in the
join: Let ℰ(𝑖) be the set of edges involved in join 𝑖, then
𝛾(𝑖) =

𝑒(𝑢,𝑣 )∈ℰ(𝑖)
𝑃 (𝑒(𝑢, 𝑣))
where 𝑃 (𝑒(𝑢, 𝑣)) is the probability of edge 𝑒(𝑢, 𝑣) conditioned on 𝑢 and 𝑣.
This probability can be estimated as
𝑃 (𝑒(𝑢, 𝑣)) =
𝑓𝑟𝑒𝑞(𝑒(𝑢, 𝑣))
𝑓𝑟𝑒𝑞(𝑢) ⋅𝑓𝑟𝑒𝑞(𝑣)
where 𝑓𝑟𝑒𝑞() denotes the frequency of the edge or node in the large graph.
Definition 4.12. (Cost of a join) The cost of join 𝑖 is estimated by
𝐶𝑜𝑠𝑡(𝑖) = 𝑆𝑖𝑧𝑒(𝑖.𝑙𝑒𝑓𝑡) × 𝑆𝑖𝑧𝑒(𝑖.𝑟𝑖𝑔ℎ𝑡)
Definition 4.13. (Cost of a search order) The total cost of a search order Γ is
estimated by
𝐶𝑜𝑠𝑡(Γ) =

𝑖∈Γ
𝐶𝑜𝑠𝑡(𝑖)
For example, let the input search space be {𝐴
1
} × {𝐵
1
, 𝐵
2
} × {𝐶

2
}. If
we use a constant reduction factor 𝛾, then 𝐶𝑜𝑠𝑡(𝐴 ⊳⊲ 𝐵) = 1 × 2 = 2,
𝑆𝑖𝑧𝑒(𝐴 ⊳⊲ 𝐵) = 2𝛾, 𝐶𝑜𝑠𝑡((𝐴 ⊳⊲ 𝐵) ⊳⊲ 𝐶) = 2𝛾 × 1 = 2𝛾. The total cost is
2 + 2𝛾. Similarly, the total cost of (𝐴 ⊳⊲ 𝐶) ⊳⊲ 𝐵 is 1 + 2𝛾. Thus, the search
order (𝐴 ⊳⊲ 𝐶) ⊳⊲ 𝐵 is better than (𝐴 ⊳⊲ 𝐵) ⊳⊲ 𝐶.
148 MANAGING AND MINING GRAPH DATA
Search Order. The number of all possible search orders is exponential in
the number of nodes. It is expensive to enumerate all of them. As in many
query optimization techniques, we consider only left-deep query plans, i.e.,
the outer node of each join is always a leaf node. The traditional dynamic
programming would take an 𝑂(2
𝑘
) time complexity for a graph pattern of
size 𝑘. This is not scalable to large graph patterns. Therefore, we adopt a
simple greedy approach in our implementation: at join 𝑖, choose a leaf node
that minimizes the estimated cost of the join.
5. Experimental Study
In this section, we evaluate the performance of the presented graph pattern
matching algorithms on large real and synthetic graphs. The graph specific
optimizations are compared with an SQL-based implementation as described
in Figure 4.2. MySQL server 5.0.45 is used and configured as: storage en-
gine=MyISAM (non-transactional), key
buffer size = 256M. Other parameters
are set as default. For each large graph, two tables V(vid, label) and E(vid1,
vid2) are created as in Figure 4.2. B-tree indices are built for each field of the
tables.
The presented graph pattern matching algorithms were written in Java and
compiled with Sun JDK 1.6. All the experiments were run on an AMD Athlon
64 X2 4200+ 2.2GHz machine with 2GB memory running MS Win XP Pro.

5.1 Biological Network
the real dataset is a yeast protein interaction network [2]. This graph consists
of 3112 nodes and 12519 edges. Each node represents a unique protein and
each edge represents an interaction between proteins.
To allow for meaningful queries, we add Gene Ontology (GO) [14] terms
to the proteins. The Gene Ontology is a hierarchy of categories that describes
cellular components, biological processes, and molecular functions of genes
and their products (proteins). Each GO term is a node in the hierarchy and
has one or more parent GO Terms. Each protein has one or more GO terms.
We use high level GO terms as labels of the proteins (183 distinct labels in
total). We index the node labels using a hashtable, and store the neighborhood
subgraphs and profiles with radius 1 as well.
Clique Queries. The clique queries are generated with sizes (number
of nodes) between 2 and 7 (sizes greater than 7 have no answers). For each
size, a complete graph is generated with each node assigned a random label.
The random label is selected from the top 40 most frequent labels. A total of
1000 clique queries are generated and the results are averaged. The queries
are divided into two groups according to the number of answers returned: low
Query Language and Access Methods for Graph Databases 149
hits (less than 100 answers) and high hits (more than 100 answers). Queries
having no answers are not counted in the statistics. Queries having too many
hits (more than 1000) are terminated immediately and counted in the group of
high hits.
To evaluate the pruning power of the local pruning (Section 4.2) and the
global pruning (Section 4.3), we define the reduction ratio of search space as
𝛾(Φ, Φ
0
) =
∣Φ(𝑢
1

)∣ × × ∣Φ(𝑢
𝑘
)∣
∣Φ
0
(𝑢
0
)∣ × × ∣Φ
0
(𝑢
𝑘
)∣
where Φ
0
refers to the baseline search space.
2 3 4 5 6 7
10
−20
10
−15
10
−10
10
−5
10
0
Clique size
Reduction ratio



Retrieve by profiles
Retrieve by subgraphs
Refined search space
(a) Low hits
2 3 4 5 6
10
−10
10
−8
10
−6
10
−4
10
−2
10
0
Clique size
Reduction ratio


Retrieve by profiles
Retrieve by subgraphs
Refined search space
(b) High hits
Figure 4.20. Search space for clique queries
2 3 4 5 6 7
0
50
100

150
200
250
300
350
Clique size
Time (msec)


Retrieve by profiles
Retrieve by subgraphs
Refine search space
Search w/ opt. order
Search w/o opt. order
(a) Individual steps
2 3 4 5 6 7
10
0
10
1
10
2
10
3
10
4
10
5
Clique size
Time (msec)



Optimized
Baseline
SQL−based
(b) Total query processing
Figure 4.21. Running time for clique queries (low hits)
Figure 4.20 shows the reduction ratios of search space by different methods.
“Retrieve by profiles” finds feasible mates by checking profiles and “Retrieve
by subgraphs” finds feasible mates by checking neighborhood subgraphs (Sec-
150 MANAGING AND MINING GRAPH DATA
tion 4.2). “Refined search space” refers to the global pruning discussed in Sec-
tion 4.3 where the input search space is generated by “Retrieve by profiles”.
The maximum refinement level ℓ is set as the size of the query. As can be seen
from the figure, the refinement procedure always reduces the search space re-
trieved by profiles. Retrieval by subgraphs results in the smallest search space.
This is due to the fact that neighborhood subgraphs for a clique query is actu-
ally the entire clique.
Figure 4.21(a) shows the average processing time for individual steps under
varying clique sizes. The individual steps include retrieval by profiles, retrieval
by subgraphs, refinement, search with the optimized order (Section 4.4), and
search without the optimized order. The time for finding the optimized order is
negligible since we take a greedy approach in our implementation. As shown
in the figure, retrieval by subgraphs has a large overhead although it produces
a smaller search space than retrieval by profiles. Another observation is that
the optimized order improves upon the search time.
Figure 4.21(b) shows the average total query processing time in comparison
to the SQL-based approach on low hits queries. The “Optimized” processing
consists of retrieval by profiles, refinement, optimization of search order, and
search with the optimized order. The “Baseline” processing consists of re-

trieval by node attributes and search without the optimized order on the base-
line space. The query processing time in the “Optimized" case is improved
greatly due to the reduced search space.
The SQL-based approach takes much longer time and does not scale to large
clique queries. This is due to the unpruned search space and the large number
of joins involved. Whereas our graph pattern matching algorithm (Section 4.1)
is exponential in the number of nodes, the SQL-based approach is exponential
in the number of edges. For instance, a clique of size 5 has 10 edges. This
requires 20 joins between nodes and edges (as illustrated in Figure 4.2).
5.2 Synthetic Graphs
The synthetic graphs are generated using a simple Erd
˝
os-R
«
enyi [13] ran-
dom graph model: generate 𝑛 nodes, and then generate 𝑚 edges by randomly
choosing two end nodes. Each node is assigned a label (100 distinct labels in
total). The distribution of the labels follows Zipf’s law, i.e., probability of the
𝑥
𝑡ℎ
label 𝑝(𝑥) is proportional to 𝑥
−1
. The queries are generated by randomly
extracting a connected subgraph from the synthetic graph.
We first fix the size of synthetic graphs 𝑛 as 10𝐾, 𝑚 = 5𝑛, and vary the
query size between 4 and 20. Figure 4.22 shows the search space and pro-
cessing time for individual steps. Unlike clique queries, the global pruning
produces the smallest search space, which outperforms the local pruning by
full neighborhood subgraphs.
Query Language and Access Methods for Graph Databases 151

4 8 12 16 20
10
−40
10
−30
10
−20
10
−10
10
0
Query size
Reduction ratio


Retrieve by profiles
Retrieve by subgraphs
Refined search space
(a) Search space
4 8 12 16 20
0
20
40
60
80
100
Query size
Time (msec)



Retrieve by profiles
Retrieve by subgraphs
Refine search space
Search w/ opt. order
Search w/o opt. order
(b) Time for individual steps
Figure 4.22. Search space and running time for individual steps (synthetic graphs, low hits)
4 8 12 16 20
10
0
10
1
10
2
10
3
Query size
Time (msec)


Optimized
Baseline
SQL−based
(a) Varying query sizes (graph size:
10K)
10 20 40 80 160 320
10
0
10
1

10
2
10
3
10
4
Graph size (x1000)
Time (msec)


Optimized
Baseline
SQL−based
(b) Varying graph sizes (query size: 4)
Figure 4.23. Running time (synthetic graphs, low hits)
Figure 4.23 shows the total time with varying query sizes and graph sizes.
As can be seen, The SQL-based approach is not scalable to large queries,
though it scales to large graphs with small queries. In either case, the “Op-
timized” processing produces the smallest running time.
To summarize the experimental results, retrieval by profiles has much less
overhead than that of retrieval by subgraphs. The refinement step (Section 4.3)
greatly reduces the search space. The overhead of the search step is well com-
pensated by the extensive reduction of search space. A practical combination
would be retrieval by profiles, followed by refinement, and then search with
an optimized order. This combination scales well with various query sizes and
graph sizes. SQL-based processing is not scalable to large queries. Overall, the
optimized processing performs orders of magnitude better than the SQL-based
approach. While small improvements in SQL-based implementations can be

×