Tải bản đầy đủ (.pdf) (5 trang)

Keyword Search in Databases- P19 pot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (126.29 KB, 5 trang )

4.2. SLCA-BASED SEMANTICS 89
4.2.1 PROPERTIES OF LCA AND SLCA
Property 4.9 Given a set S and two nodes v
i
and v
j
with v
i
<v
j
, then closest (v
i
,S)≤
closest (v
j
,S).
Proof. We prove it by contradiction, by assuming that closest(v
i
,S) > closest(v
j
,S). Then
closest (v
i
,S)= rm(v
i
,S) and closest (v
j
,S)= lm(v
j
,S), rm(v
i


,S)>lm(v
j
,S). Recall that
closest (v, S) is chosen from lm(v, S) and rm(v, S), and lm(v
i
,S)≤ lm(v
j
,S) and rm(v
i
,S)≤
rm(v
j
,S)if all exists. If lm(v
j
,S)<rm(v
i
,S), then lm(v
j
,S)≤ lm(v
i
,S), therefore lm(v
i
,S)=
lm(v
j
,S)by the fact that lm(v
i
,S)≤ lm(v
j
,S). Similarly, we can get that rm(v

i
,S)= rm(v
j
,S).
Also, we can learn that lm(v
i
,S)= rm(v
i
,S), otherwise closest (v
i
,S)= lm(v
i
,S).
Let lm denote lm(v
i
,S) and rm denote rm(v
i
,S). It holds that lm<v
i
<v
j
<rm.Ac-
cording to Property 4.2, lca(lm,v
j
)  lca(lm,v
i
) and lca(rm, v
i
)  lca(rm, v
j

). According to
the definition of closest, lca(lm,v
i
) ≺ lca(rm, v
i
) and lca(rm, v
j
)  lca(lm,v
j
), which is a con-
tradiction.

Property 4.10 Let V and U be lists of nodes, e.g., V ={v
1
, ··· ,v
l
} and U ={u
1
, ··· ,u
l
}, such
that V ≤ U , e.g., v
i
≤ u
i
for 1 ≤ i ≤ l.Letlca(V) and lca(U) be the LCA of nodes in V and U ,
respectively. Then,
1. if lca(V ) ≥ lca(U), then lca(U)  lca(V ),
2. if lca(V)<lca(U), then
• either lca(V ) ≺ lca(U),

•orlca(V ) ⊀ lca(U), then for any W with U ≤ W , lca(V) ⊀ lca(W).
Proof. This is an extension of Property 4.3 to more than two nodes. The proof is by induction,
when V and U contain only two nodes, it is proven in Property 4.3. Assume that it is true for
V,U and W , we prove it is true for V

,U

,W

, where V

= V ∪{v
l
}, U

= U ∪{u
l
}, with v
l
≤ u
l
.
One important property of lca is that lca(V

) = lca(lca(V ), v
l
).Iflca(U)  lca(V ), then either
lca(U

)  lca(V


) or lca(V

) ≺ lca(U

). Otherwise, lca(V)<lca(U), according to Property 4.3,
there are three cases of lca(V

) and lca(U

), and we only need to prove the last case, i.e. the case
that lca(V

)<⊀ lca(U

). Then for any W

= W ∪{w
l
},iflca(U) ≤ lca(W), then we are done;
otherwise lca(W) ≺ lca(U), then lca(V

) ⊀ lca(W

), because lca(W

)  lca(W). ✷
90 4. KEYWORD SEARCH IN XML DATABASES
Table 4.0:
id k

1
k
2
··· k
l
id
m
···
id
2
id
1
Figure 4.3: Stack Data Structure
4.2.2 EFFICIENT ALGORITHMS FOR SLCAS
In this section, we consider three algorithms, namely StackAlgorithm, IndexedLookupEa-
ger
, and ScanEager [Xu and Papakonstantinou, 2005], that find all the slca(S
1
, ··· ,S
l
) effi-
ciently.Each algorithm has a different characteristic, and it works efficient in some situations.
Mul-
tiwaySLCA
further improves the performance of IndexedLookupEager by proposing some
heuristics but with the same worst case time complexity as
IndexedLookupEager. Note that
these algorithms only get all the
SLCAs, but they do not keep the match nodes for the SLCAs.
Finding the match nodes for all the

SLCAs can be done efficiently by one scan of SLCAs and one
scan of S
1
, ··· ,S
l
, provided that the nodes in SLCAs are in increasing Dewey ID order.
Stack Algorithm:This is an adaptation of the stack based sort-merge algorithm [Guoetal.,2003]to
compute all the
SLCAs. It uses a stack, each stack entry has a pair of components (id,keyword),
as shown in Figure 4.3. Assume the id components from the bottom entry to a stack entry en
are id
1
, ··· ,id
m
, respectively, then the stack entry en denotes the node with the De wey ID
id
1
.id
2
. ··· .id
m
. keyword is an array of length l of Boolean values, where keyword[i]=true
means that the subtree rooted at the node denoted by the entry contains keyword k
i
directly or
indirectly.
The general idea of
StackAlg orithm is to use a stack to simulate the postorder traversal
of a virtual XML tree formed by the union of the paths from root to each node in S
1

, ··· ,S
l
,
while the nodes are read in a preorder fashion. When an entry en is popped out, which means that
all the descendant-or-self nodes of en in S
1
, ··· ,S
l
have been visited, it is known whether or not
a keyword appears in the subtree.
StackAlg orithm merges all keyword lists and computes the
longest common prefix of the node with the smallest Dewey ID from the input lists and the node
denoted by the top entry of the stack.Then it pops out all top entries until the longest common prefix
is reached. If the keyword component of a popped entry en contains all the keywords, then the
node denoted by en is a
SLCA node. Based on the definition of SLCA, all the ancestor nodes of a
SLCA node can not be SLCA,so this information is recorded.Otherwise, the keyword containment
information of en is used to update its parent entry’s keyword array. Also, a stack entry is created
for each Dewey component of the current visiting node that is not part of the common prefix, where
each new entry corresponds to one node on the path from the longest common prefix to the current
4.2. SLCA-BASED SEMANTICS 91
Algorithm 31 StackAlgorithm (S
1
, ··· ,S
l
)
Input: l lists of Dewey IDs, S
i
is the list of Dewey IDs of the nodes containing keyword k
i

.
Output: All the
SLCAs
1: stack ←∅
2:
while has not reached the end of all Dewey lists do
3:
v ← getSmallestNode()
4: p ← lca(stack, v)
5: while stack.size > p do
6: en ← stack.pop()
7: if en.keyword[i]=true, ∀i(1 ≤ i ≤ l) then
8: output en as a SLCA
9: mark all the entries in stack so that it can never be SLCA node
10: else
11: ∀i(1 ≤ i ≤ l) : stack.top().keyword[i]←true,ifen.keyword[i]=true
12:
∀i(p < i ≤ v.length) : stack.push(v[i], [])
13:
stack.top().keyword[i]←true, where v ∈ S
i
14: check entries of the stack and return any SLCA node if exists
node. Essentially, the node represented by the top entry of the stack is the node visited in pre-order
traversal.
StackAlg orithm is shown in Algorithm 31. It first initializes the stack stack to be empty
(line 1). As long as there are Dewey lists that have not been visited (line 2), it reads the next node
with the smallest Dewey ID (line 3), and performs necessary operations. Essentially, reading nodes
in this order is equivalent to a preorder traversal of the original XML tree ignoring irrelevant nodes.
Let stack[i] denote the node represented by the i-th entry of stack starting from the bottom, and
v[i] denote the i-th component of the Dewey ID of v. After getting v, it computes the LCA of v

and the node represented by the top of stack (line 4), which is stack[p]. This means that all the
keyword nodes have been read that are descendants of stack[p + 1] if they exist, and the keyword
containment information has been stored in the corresponding stack entries. Then all those nodes
represented by stack[i] (p < i ≤ stack.size) are popped (lines 5-11). For each popped entry en
(line 6), it first checks whether it is a
SLCA node (line 7); if en is indeed a SLCA node, then it
is output (line 8) and the information is recorded that all its ancestors can not be
SLCAs (line
9). Otherwise, the keyword containment information of its parent node is updated (line 11). After
popping out all the non-ancestor nodes from stack, v and its ancestors are pushed onto stack
(line 12), and the keyword containment information is stored (line 13). At this moment, the node
represented by the top entry of stack is v, and the whole stack represents all the nodes on the path
from root to v, and the keyword containment information is stored compactly. After all the Dewey
92 4. KEYWORD SEARCH IN XML DATABASES
lists have been read, all the entries need to be popped from stack, and a check is performed to see
if there exists any
SLCA node (line 14).
StackAlg orithm outputs all the SLCA nodes, i.e. slca(S
1
, ··· ,S
l
), in time
O(d

l
i=1
|S
i
|),orO(ld|S|) [Xu and Papakonstantinou, 2005]. Note that the above time complex-
ity doesnot takeinto accountthe time to merge S

1
, ··· ,S
l
,as it willtake timeO(d log l ·

l
i=1
|S
i
|).
getSmallestNode (line 3) just retrieves the next node with smallest Dewey ID from the merged list.
Indexed Lookup Eager:
StackAlg orithm treats all the Dewey lists S
1
, ··· ,S
l
equally,but some-
times |S
1
|, ··· , |S
l
| vary dramatically. Xu and Papakonstantinou [2005] propose IndexedLooku-
pEager
to compute all the SLCA nodes, in the situation that |S
1
| is much smaller than |S|.Itis
based on the following properties of slca function.
Property 4.11 slca({v},S) = lca(v, closest (v, S)), and slca({v},S
2
, ··· ,S

l
) =
slca(slca({v},S
2
, ··· ,S
l−1
), S
l
) = lca(v, closest (v, S
2
), ··· ,closest(v, S
l
)) for l>2.
Property 4.11 suggests that we can find the
SLCA node of a node, v, and a set of nodes,
S, by finding the closest node of v and S first followed by finding the
LCA node of v and
the closest node of v and S. The definition of closest is given in Section 4.1.2. Based on
Property 4.11, we can compute slca({v
1
},S
2
, ··· ,S
l
) by first finding the closest point of v
1
from each set S
i
, denoted as closest (v
1

,S
i
); then finding the slca consists of the single node
lca(v
1
,closest(v
1
,S
2
), ··· ,closest(v
1
,S
l
)). The computation of slca({v
1
},S
2
, ··· ,S
l
) takes
time O(d

l
i=2
log |S
i
|). Then for arbitrary S
1
, ··· ,S
l

, we have the following property.
Property 4.12 slca(S
1
, ··· ,S
l
) = removeAncestor(

v
1
∈S
1
slca({v
1
},S
2
, ··· ,S
l
)).
Property 4.12 shows that in order to find
SLCA nodes of S
1
, ··· ,S
l
, we can first find
slca({v
1
},S
2
, ··· ,S
l

) for each v
1
∈ S
1
, and then remove all these ancestor nodes. Its correctness
follows from the fact that, slca(S
1
, ··· ,S
l
) = removeAncestor(lca(S
1
, ··· ,S
l
)). The definition
of removeAncestor is given in Section 4.1.2.
The above two properties directly lead to an algorithm to compute slca(S
1
, ··· ,S
l
):
(1) first compute {x
i
}=slca({v
i
},S
2
, ··· ,S
l
), for each v
i

∈ S
1
(1 ≤ i ≤|S
1
|); (2)
removeAncestor({x
1
, ··· ,x
|S
1
|
}) is the answer. The time complexity of the algorithm is
O(|S
1
|

l
i=2
d log |S
i
|+|S
1
|d log |S
1
|) or O(|S
1
|ld log |S|). The first step of computing
slca({v
i
},S

2
, ··· ,S
l
) for each v
i
∈ S
1
takes time O(|S
1
|

l
i=2
d log |S
i
|). The second step takes
time O(|S
1
|d log |S
1
|), which can be implemented by first sorting {x
1
, ··· ,x
|S
1
|
} in increasing
Dewey ID order, and then finding the
SLCA nodes by a linear scan. Note that, this time complexity
is different from Xu and Papakonstantinou [2005], which is O(|S

1
|

l
i=2
d log |S
i
|+|S
1
|
2
).
Although it has the same time complexity of
IndexedLookupEager, the above algorithm is a
blocking algorithm, while
IndexedLookupEager is non-blocking.
Lemma 4.13 Given any two nodes v
i
and v
j
, with pre(v
i
)<pre(v
j
), and a set S of Dewey IDs:
4.2. SLCA-BASED SEMANTICS 93
1. if slca({v
i
},S)≥ slca({v
j

},S), then slca({v
j
},S) slca({v
i
},S).
2. if slca({v
i
},S)<slca({v
j
},S),
• either slca({v
i
},S)is an ancestor of slca({v
j
},S),
•orslca({v
i
},S) is not an ancestor of slca({v
j
},S), then for any v such that pre(v) >
pre(v
j
), slca({v
i
},S)⊀ slca({v},S).
The correctness of the above lemma directly follows from Property 4.3 and Property 4.11. It
straightforwardly leads to a non-blocking algorithm to compute slca(S
1
,S
2

), by removing ancestor
nodes on-the-fly, which is shown as the subroutine getSLCA in
IndexedLookupEager. The
above lemma can be directly applied to multiple sets with the first set as a singleton, i.e. by replacing
S by S
2
, ··· ,S
l
in the lemma. The correctness directly follows Property 4.10, Property 4.9, and
Property 4.11.
Property 4.14 slca(S
1
, ··· ,S
l
) = slca(slca(S
1
, ··· ,S
l−1
), S
l
) for l>2.
IndexedLookupEager, as shown in Algorithm 32, directly follows from Lemma 4.13 and
Property 4.11, Property 4.12, and Property 4.14. p in Line 3 is the buffer size, it can be any value
ranging from 1 to |S
1
|; the smaller p is, the faster the algorithm produces the first SLCA. It first
computes X
2
= slca(X
1

,S
2
), where X
1
is the next p nodes from S
1
(line 3). Then it computes
X
3
= slca(X
2
,S
3
) and so on, until it computes X
l
= slca(X
l−1
,S
l
) (lines 4-5). Note that at any
step, the nodes in X
i
are in increasing De wey ID order, and there is no ancestor-descendant relation-
ship between any two nodes in X
i
. All nodes in X
l
except the first and the last one are guaranteed to
be
SLCA nodes (line 9).The first node of X

l
is checked at line 6.The last node of X
l
is carried on to
the next iteration (line 9) to be determined whether or not it is a
SLCA (line 7). IndexedLooku-
pEager
outputs all the SLCA nodes, i.e., slca(S
1
, ··· ,S
l
), in time O(|S
1
|

l
i=2
d log |S
i
|),or
O(|S
1
|ld log |S|) [Xu and Papakonstantinou, 2005].
Scan Eager: When the keyword frequencies, i.e., |S
1
|, ··· , |S
l
|, do not differ significantly, the to-
tal cost of finding matches by lookups using binary search may exceed the total cost of finding
the matches by scanning the keyword lists, i.e O(|S

1
|ld log |S|)>O(ld|S|). ScanEager (Algo-
rithm 33) [Xu and Papakonstantinou, 2005] modifies Line 15 of
IndexedLookupEager by using
linear scan to findthe lm() and rm().It takes advantage of thefact that the accesses toany keyword list
are strictly in increasing order in
IndexedLookupEager. Consider the getSLCA(S
1
,S
2
) subrou-
tine in
IndexedLookupEager, in order to find lm(v, S
2
) and rm(v, S
2
), ScanEager maintains
a cursor for each keyword list, and it advances the cursor of S
2
until it finds the node that is closest to
v from the left or the right side. Note that if rm(v, S
2
) exists, then it should be the next node in S
2
of lm(v, S
2
), or the first node in S
2
if lm(v, S
2

) =⊥.The main idea is based on the fact that, for any
v
i
and v
j
in S
1
, with pre(v
i
)<pre(v
j
), lm(v
i
,S
2
) ≤ lm(v
j
,S
2
) and rm(v
i
,S
2
) ≤ rm(v
j
,S
2
),it
assumes that all lm() and rm() are not equal to ⊥. Note that, in order to ensure the correctness of

×