Tải bản đầy đủ (.pdf) (5 trang)

Keyword Search in Databases- P17 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (108.28 KB, 5 trang )

3.5. SUBGRAPH-BASED KEYWORD SEARCH 79
Algorithm 29 GetCommunity(G
D
, C, R
max
)
Input: a data graph G
D
, a core C =[c
1
, ··· ,c
l
], and a radius threshold R
max
.
Output: A community uniquely determined by C.
1: Find theset of cnodes,V
c
,by running |C| copies of Dijkstra’ssingle sourceshortest path algorithm
2: Run a single copy of Dijkstra’s algorithm to find the shortest distance to the nearest knode, for
each node v ∈ V(G
D
), i.e. dist
k
(v) = min
c∈C
dist(v,c)
3: Run a single copy of Dijkstra’s algorithm to find the shortest distance from the nearest cnode,
for each node v ∈ V(G
D
), i.e. dist


c
(v) = min
v
c
∈V
c
dist(v
c
,v)
4: V ←{u ∈ V(G
D
)|dist
c
(u) + dist
k
(u) ≤ R
max
}
5: Construct a subgraph R in G
D
induced by V and return it
set path nodes (pnode) that include all the nodes that appear on any path from a cnode v
c
∈ V
c
to a
knode v
l
∈ V
l

with dist(v
c
,v
l
) ≤ R
max
. E(R) is the set of edges induced by V(R).
A community, R, is uniquely determined by the set of knodes, V
l
, which is called the core
of the community and denoted as core(R). The weight of a community R, w(R) is defined as the
minimum value among the total edge weights from a cnode to every knode; more precisely,
w(R) = min
v
c
∈V
c

v
l
∈V
l
dist(v
c
,v
l
). (3.8)
For simplicity, we use C to represent a core as a list of l nodes, C =[c
1
,c

2
, ··· ,c
l
], and it may
use C[i] to denote c
i
∈ C, where c
i
contains the keyword term k
i
. Based on the definition of
community, once the core C is provided, the community is uniquely determined, and it can be found
by Algorithm 29, which is self-explanatory.
Qin et al. [2009b] enumerate all (or the top-k) communities in polynomial delay by adopting
the Lawler’s procedure [Lawler, 1972]. The general idea is the same as
EnumTreePD (Algo-
rithm 19). But it is much easier here, because
EnumTreePD enumerates trees which has structure,
while in this case only the cores are enumerated where each core is just a set of l keyword nodes.
In this problem, the answer space is S
1
× S
2
···×S
l
, where each S
i
is the set of nodes in G
D
that

contains keyword k
i
. A subspace is described by V
1
× V
2
··· , ×V
l
where V
i
⊆ S
i
and it also can be
compactly described by a set of inclusion constraints and exclusion constraints. Based on Lawler’s
procedure, in order to enumerate the communities in increasing cost order, it is straightforward to
obtain an algorithm whose time complexity of delay is O(l · c(l)), where c(l) is the time complexity
to compute the best community.
Two algorithms are proposed for enumerating communities in order with time complexity
O(c(l)): one enumerates all communities in arbitrary order with polynomial delay, and the other
enumerates top-k communities in increasing weight order with polynomial delay. In the following,
we discuss the second algorithm.
80 3. GRAPH-BASED KEYWORD SEARCH
Algorithm 30 COMM-K(G
D
, Q, R
max
)
Input: a data graph G
D
, keywords set Q ={k

1
, ··· ,k
l
}, and a radius threshold R
max
.
Output: Enumerate top-K communities in increasing weight order.
1: Find the set of knode s {S
1
, ··· ,S
l
} and their corresponding neighborhood nodes {N
1
, ··· ,N
l
}
2: Find the best core (with lowest weight) and the corresponding weight from {N
1
, ··· ,N
l
},
denoted (C, weight)
3: Initialize H ←∅; H.insert(C, weight, 1, ∅)
4: while H =∅and less than K communities output do
5: g ← H.pop(); {g = (C, weight, pos,prev)}
6: R

← Get Community(G
D
,g.C,R

max
), and output R

7: ∀i ∈[1,l]: update N
i
to be the neighborhood nodes of g.C[i],V
i
← S
i
8: update {V
1
, ··· ,V
l
} by following the links g.prev recursively
9: for i = l downto g.pos do
10: V
i
← V
i
−{g.C[i]}, update N
i
to be the neighborhood nodes of V
i
11: Find the best core from the current {N
1
, ··· ,N
l
}, denoted (C

,weight


)
12:
H.insert(C

,weight

,i,g) if C

exists
13: V
i
← V
i
∪{g.C[i]}, update N
i
to be the neighborhood nodes of V
i
Algorithm 30 shows the high-level pseudocode. H is a priority heap, used to store the inter-
mediate and potential cores with additional information. The general idea is to consider the entire
set of potential cores as an l-dimensional space S
1
× S
2
···×S
l
, and at each step, divide a subspace
into smaller subspaces and find a best core in each newly generated subspace. At any intermediate
step, the whole set of subspaces are disjoint, and the union is guaranteed to cover the whole space.
Each time a core with the lowest weight is removed from

H, it is guaranteed to be the next com-
munity in order (line 5). The best core of a subspace V
1
× V
2
···×V
l
, where V
i
⊂ S
i
, is found as
follows (lines 2,11). First, a neighborhood nodeset N
i
is found for each set V
i
, which consists of
all the nodes with a shortest distance no greater than R
max
to at least one of the nodes in V
i
. This
can be done by running a shortest path algorithm. Second, a linear scan of the nodes can find the
best core with the best center and weight. When the next best core g.C is found, the subspace from
which g.C is found is partitioned into several subspaces (lines 9-13); the best core from each newly
generated subspace is found (line 11) and inserted into
H (line 12). Each entry in H consists of
four fields, (C,weight, pos,prev), where C is the core and weight is the corresponding weight,
pos and pre is used to reconstruct efficiently the subspace (without storing the description of the
subspace explicitly) from which C is computed.

Algorithm 30 enumeratestop-k communities in increasing weight order,with timecomplexity
O(l(nlog n + m)), and using space O(l
2
· k + l · n + m) [Qin et al., 2009b]. Note that, finding the
best core in a subspace (under inclusion constraints and exclusion constraints) also takes time c(l) =
O(l(nlog n + m)). According to discussion of
EnumTreePD, it is easy to get an enumeration
3.5. SUBGRAPH-BASED KEYWORD SEARCH 81
algorithm with delay l · c(l). However, information can be shared during consecutive execution of
Line 11 of
EnumTreePD, so Algorithm 30 can enumerate communities with delay c(l).

83
CHAPTER 4
Keyword Search in XML
Databases
In this chapter, we focus on keyword search in XML databases where an XML database is treated as
a large data tree. We introduce various semantics to answer a keyword query on XML tree, and we
discuss efficient algorithms to find the answers under such semantics. A main difference between
this chapter and the previous chapters is that the underlying data structure is a large tree instead of
a large graph.
In Section 4.1, we introduce several important concepts and definitions such as Lower Com-
mon Ancestor (
LCA), Smallest LCA (SLCA), Exclusive LCA (ELCA), and Compact LCA
(CLCA).Their properties and the relationships among LCA, SLCA and ELCA will be discussed.
In Section 4.2, we discuss the algorithms that find answers based on
SLCA. In Section 4.3, we dis-
cuss the algorithms that focus on identifying meaningful return information. We discuss algorithm
to find answers based on
ELCA in Section 4.4. In Section 4.5, in brief, we give several approaches

based on meaning
LCA, interconnection, and relevance oriented ranking.
4.1 XML AND PROBLEM DEFINITION
XML is modeled as a rooted and labeled tree, such as the one shown in Figure 4.1. Each internal
node v in the tree corresponds to an XML element, called element node, and is labeled with a tag/label
tag(v). Each leaf node of the tree corresponds to a data value, called value node. For example, in
Figure 4.1, “Dean” and “Title” are element nodes, “John” and “Ben” are value nodes. In this model,
the attribute nodes are modeled as children of the associated elementnode,and we do not distinguish
them from element nodes.
Each node (element node or value node) in the XML tree is assigned an unique Dewey ID.
The Dewey ID of nodes are assigned in the following way: the relative position of each node among
its siblings are recorded, and the concatenation of these relative positions using dot ’.’ starting from
the root composes the De wey ID of the nodes. For example, the node with De wey ID 0.1.2.1
(Students) is the second child of its parent node 0.1.2 (Class). We denote the Dewey ID of a node
v as pre(v), as it is compatible with the preorder numbering, i.e., a node v
1
precedes another node
v
2
in the preorder left-to-right depth-first traversal of the tree, if and only if pre(v
1
)<pre(v
2
).
The < relationship between two Dewey IDs is the same as comparing between two sequences.
Besides the order information preserved by the Dewey ID, it also can be used to detect sibling and
ancestor-descendant relationships between nodes.

×