Tải bản đầy đủ (.pdf) (50 trang)

Tài liệu Advances in Database Technology- P11 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.2 MB, 50 trang )

482
M. Marx
Remark 1. XPath 1.0 (and hence Core XPath) has a strange asymmetry between the
vertical (parent and child) and the horizontal (sibling) axis relations. For the vertical
direction, both transitive and reflexive–transitive closure of the basic steps are primi-
tives. For the horizontal direction, only the transitive closure of the immediate_left_ and
immediate_right_sibling axis are primitives (with the rather ambiguous
names following
and
preceding_sibling).
removes this asymmetry and has all four one step naviga-
tional axes as primitives. XPath 1.0 also has two primitive axis related to the document
order and transitive closures of one step navigational axes. These are just syntactic sugar,
as witnessed by the following definitions:
So we can conclude that is at least as expressive as Core XPath, and has a more
elegant set of primitives.
The reader might wonder why the set of XPath axes was not closed under taking con-
verses. The next theorem states that this is not needed.
Theorem 2. The set of axes of all four XPath languages are closed under taking con-
verses.
4
Expressive Power
This section describes the relations between the four defined XPath dialects and two
XPath dialects from the literature. We also embed the XPath dialects into first order and
monadic second order logic of trees and give a precise characterization of the conditional
path dialect in terms of first order logic.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
XPath with Conditional Axis Relations
483
Core XPath [17] was introduced in the previous section. [4] considers the Core XPath
fragment obtained by deleting the sibling relations and the booleans on the filter


expressions.
Theorem 3. 1. is strictly contained in Core XPath, and Core XPath is strictly
contained in
We now view the XPath dialects as query languages over trees, and compare them to
first and second order logic interpreted on trees. Before we can start, we must make clear
what kind of queries XPath expresses.
In the literature on XPath it is often tacitly assumed that each expression is always
evaluated at the root. Then the meaning of an expression is naturally viewed as a set of
nodes. Stated differently, it is a query with one variable in the select clause. This tacit as-
sumption can be made explicit by the notion of an absolute XPath expression
/
locpath
.
The answer set of
/
locpath
evaluated on a tree (notation:
is the set The relation between filter expressions
and arbitrary XPath expressions evaluated at the root becomes clear by the follow-
ing equivalence. For each filter expression
fexpr
,
is true if and only if
On the other hand, looking at Table 1 it is immediate
that in general an expression denotes a binary relation.
Let and be the first order and monadic second order languages in the
signature with two binary relation symbols Descendant and Sibling and countably many
unary predicates P, Q , . . . B oth languages are interpreted on node labeled sibling ordered
trees in the obvious manner: Descendant is interpreted as the descendant relation
Sibling as the strict total order on the siblings, and the unary predicates P as the

sets of nodes labeled with P. For the second order formulas, we always assume that
all second order variables are quantified. So a formula in two free variables means a
formula in two free variables ranging over nodes.
It is not hard to see that every expression is equivalent
4
to an formula
in two free variables, and that every filter expression is equivalent to a formula in one
free variable. can express truly second order properties as shown in Example 2. A
little bit harder is
Proposition 1. Every (filter) expression is equivalent to an formula in
two (one) free variable(s).
The converse of this proposition would state that is powerful enough to express
every first order expressible query. For one variable queries on Dedekind complete linear
structures, the converse is known as Kamp’s Theorem [21]. Kamp’s result is generalized
to other linear structures and given a simple proof in the seminal paper [14]. [24] showed
that the result can further be generalized to sibling ordered trees:
4
In fact, there is a straightforward logspace translation, see the proof of Theorem 7.(i).
2.
3.
4.
is strictly contained in
is strictly contained in
and
are equally expressive.
Theorem 4 ([24]). Every
query in one free variable is expressible as an absolute
expression.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
484

M. Marx
Digression: Is first order expressivity enough? [5] argues for the non-counting property
of natural languages. A counting property expresses that a path consists of a number
of nodes that is a multiple of a given number Since first order definable properties
of trees are non-counting [32], by Theorem 4, has the non-counting property.
Example 2 shows that with regular expressions one can express counting properties.
DTD’s also allow one to express these: e.g., expresses that A nodes
have a number of B children divisible by 3.
It seems to us that for the node addressing language first order expressivity is suf-
ficient. This granted, Theorem 4 means that one need not look for other (i.e., more
expressive) XPath fragments than Whether this also holds for constraints is de-
batable. Fact is that many natural counting constraints can be equivalently expressed with
first order constraints. Take for example, the DTD rule
which describes the couple element as a sequence of man, woman pairs. Clearly this
expresses a counting property, but the same constraint can be expressed by the following
(i.e., first order) inclusions on sets of nodes. Both left and right hand side of
the inclusions are filter expressions. (In these rules we just write man instead of the
cumbersome self :: man, and similarly for the other node labels.)
In the next two sections on the complexity of query evaluation and containment we
will continue discussing more powerful languages than partly because there is
no difference in complexity, and partly because DTD’s are expressible in them.
5
Query Evaluation
The last two sections are about the computational complexity of two key problems related
to XPath: query evaluation and query containment. We briefly discuss the complexity
classes used in this paper. For more thorough surveys of the related theory see [20,
27]. By PTIME and EXPTIME we denote the well-known complexity classes of problems
solvable in deterministic polynomial and deterministic exponential time, respectively,
on Turing machines.
queries can be evaluated in linear time in the size of the data and the query. This

is the same bound as for Core XPath. This bound is optimal because the combined
complexity of Core XPath is already PTIME hard [16]. It is not hard to see that the
linear time algorithm for Core XPath in [15] can be extended to work for full But
the result also follows from known results about Propositional Dynamic Logic model
checking [3] by the translation given in Theorem 10. For a model a node and an
XPath expression
Theorem 5.
For
expressions
can be computed in time
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
XPath with Conditional Axis Relations
485
6
Query Containment under Constraints
We discuss equivalence and containment of XPath expressions in the presence of con-
straints on the document trees. Constraints can take the form of a Document Type Defi-
nition (DTD) [37], an XSchema [40] specification, and in general can be any statement.
For XPath expressions containing the child and descendant axes and union of paths, this
problem —given a DTD— is already EXPTIME-complete [26]. The containment prob-
lem has been studied in [36,9,25,26]. For example, consider the XPath expressions in
abbreviated syntax
5
and Then obviously implies But
the converse does not hold, as there are trees with nodes having just a single child.
Of course there are many situations in which the “counterexamples” to the converse
containment disappear, and in which and are equivalent. For instance when
(a) every node has a child. This is the case when the DTD contains for instance the
rule or in general on trees satisfying
(b) if every node has a sibling. This holds in all trees in which

(c) if no has a child. This holds in all trees satisfying
We consider the following decision problem: for XPath expressions, does it follow
that given that and and
The example above gives four instances of this problem:
This first instance does not hold, all others do. A number of instructive observations can
be drawn from these examples. First, the constraints are all expressed as equivalences
betwee
n
XPath expressions denoting sets of nodes, while the conclusion relates two sets
of pairs of nodes. This seems general: constraints on trees are naturally expressed on
nodes, not on edges
6
. A DTD is a prime example. From the constraints on sets of nodes
we deduce equivalences about sets of pairs of nodes. In example (a) this is immediate
by substitution of equivalents. In example (b), some simple algebraic manipulation is
needed, for instance using the validity
That constraints on trees are naturally given in terms of nodes is fortunate because
we can reason about sets of nodes instead of sets of edges. The last becomes (on arbitrary
graphs) quickly undecidable. The desire for computationally well behaved languages for
reasoning on graphs resulted in the development of several languages which can specify
sets of nodes of trees or graphs like Propositional Dynamic Logic [19], Computation
Tree Logic [11] and the prepositional [22]. Such languages are –like XPath–
5
In our syntax they are and
6
This does not mean that we cannot express general properties of trees. For instance,
expresses that the tree has depth
at most two, and that
the tree is at most binary branching.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

486
M. Marx
typically two-sorted, having a sort for the relations between nodes and a sort for sets
of nodes. Concerns about computational complexity together with realistic modeling
needs determine which operations are allowed on which sort. For instance, boolean
intersection is useful on the set sort (and available in XPath 1.0) but not in the relational
sort. Similarly, complementation on the set sort is still manageable computationally but
leads to high complexity when allowed on the relation sort.
All these propositional languages contain a construct with from the edge
sort and from the set sort. denotes the set of nodes from which there exists a
path to a node. Full boolean operators can be applied to these constructs. Note that
XPath contains exactly the same construct: the location path evaluated as a
filter expression.
Thus we consider the constraint inference problem restricted to XPath expressions
denoting sets of nodes, earlier called filter expressions, after their use as filters of node
sets in XPath. This restriction is natural, computationally still feasible and places us in
an established research tradition.
We distinguish three notions of equivalence. The first two are the same as in [4]. The
third is new. Two expressions and are equivalent if for every tree model
They are root equivalent if the answer sets of and are the same,
when evaluated at the root
7
. The difference between these two notions is easily seen
by the following example. The expressions and are
equivalent when evaluated at the root (both denote the empty set) but nowhere else.
If are both filter expressions and self
we call
them
filter
equivalent. Equivalence of and is denoted by letting context decide

which
notion of equivalence is meant. Root and filter equivalence are closely related notions:
Theorem 6. Root equivalence can effectively be reduced to filter equivalence and vice
verse.
The statement expresses that is logically
implied by the set of constraints This is this case if for each
model in which holds for all between 1 and also
holds. The following results follow easily from the literature.
Theorem 7. expressions.
The problem whether is decidable.
(ii)
Let be filter
expressions
in
which
child
is the
only
axis.
The
problem whether
is
EXPTIME
hard.
Decidability for the problem in (i) is obtained by an interpretation into whence the
complexity is non-elementary [30]. On the other hand, (ii) shows that a single exponential
complexity is virtually unavoidable: it already obtains for filter expressions with the most
popular axis. Above we argued that a restriction to filter expressions is a good idea when
looking for “low” complexity, and this still yields a very useful fragment. And indeed
we have

Theorem 8. Let be root or filter expressions. The problem whether
is in EXPTIME.
7
Thus, if for every tree model
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
XPath with Conditional Axis Relations
487
6.1
Expressing DTDs in
In this section we show how a DTD can effectively be transformed into a set of constraints
on filter expressions.
The DTD and this set are
equivalent in
the sense that a
tree model conforms to the DTD if and only if each
filter expression
is true at
every node. A regular expression is a formula generated by the following grammar:
with an element name. A DTD rule is a
statement of
the form with an element name and a regular expression. A DTD
consists of
a set of such rules and a rule stating the label of the root. An example of a DTD
rule is
husband; kid*. A document conforms to this rule if each
family node
has a wife, a husband and zero or more kid nodes as children, in that order
. But that is
equivalent to saying that for this document, the following equivalence holds:
where

first
and last abbreviate
not and not denoting
the first and
the last child, respectively. This example yields the idea for an effective
reduction.
Note that instead of
husband[
fexpr
]
we could have used a conditional
For ease of translation, we assume without loss of generality
that each DTD rule is of the form for an element name. Replace each
element name in by and call the result Now a DTD rule is
transformed into the equivalent filter expression constraint
8
:
That this transformation is correct is most easily seen by thinking of the finite state
automaton (FSA) corresponding to The word to be recognized is the sequence of
children of an node in order. The expression in the right hand side of (1) processes
this sequence just like an FSA would. We assumed that every node has at least one
child, the first being labeled by The transition from the initial state corresponds to
making the step to the first child The output state of the FSA
is encoded by last, indicating the last child. Now describes the transition in the FSA
from the first node after the input state to the output state. The expression
encodes an transition in the FSA. If a DTD specifies that the root is labeled
by this corresponds to the constraint (Here and elsewhere root is an
abbreviation for So we have shown the following
Theorem 9. Let be DTD and the set of expressions obtained by the above
transformation.

Then for each tree T in which each node has a single label it holds that
T conforms to the DTD
This yields together with Theorem 8,
Corollary 1. Both root equivalence and filter equivalence of expressions given a
DTD can be decided in EXPTIME.
8
The symbol denotes set inclusion. Inclusion of node sets is definable as follows: iff
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
488
M. Marx
7
Conclusions
We can conclude that
can be seen as a stable fixed point in the development
of XPath languages. is expressively complete for first order properties on node
labeled sibling ordered trees. The extra expressive power comes at no (theoretical) extra
cost: query evaluation is in linear and query equivalence given a DTD in exponential
time
.
These results even hold for the much stronger language
Having a stable fixed point means that new research directions come easily. We
mention a few. An important question is whether is also complete with respect
to first order definable paths (i.e., first order formulas in two free variables.) Another
expressivity question concerns XPath with regular expressions and tests, our language
Is there a natural extension of first order logic for which is complete? An
empirical question related to first and second order expressivity —see the discussion at
the end of Section 4— is whether second order expressivity is really needed, both for
XPath and for constraint languages like DTD and XSchema. Finally we would like to
have a normal form theorem for
and

In particular it would be useful to have
an effective algorithm transforming
expressions into directed step expressions
(cf. the proof of Theorem 8.)
References
S. Abiteboul, P. Buneman, and D. Suciu. Data on the web. Morgan Kaufman, 2000.
N. Alechina, S. Demri, and M. de Rijke. A modal perspective on path constraints. Journal of
Logic and Computation, 13:1–18, 2003.
N. Alechina and N. Immerman. Reachability logic: An efficient fragment of transitive closure
logic. Logic Journal of the IGPL, 8(3):325–337, 2000.
M. Benedikt, W. Fan, and G. Kuper. Structural properties of XPath fragments. In Proc.
ICDT’03, 2003.
R. Berwick and A. Weinberg. The Grammatical Basis of Natural Languages. MIT Press,
Cambridge, MA, 1984.
P. Blackburn, B. Gaiffe, and M. Marx. Variable free reasoning on finite trees. In Proceedings
o
f
Mathematics of Language (MOL–8), Bloomington, 2003.
D. Calvanese, G. De Giacomo, and M. Lenzerini. Representing and reasoning on XML
documents
:
A description logic approach. J. of Logic and Computation, 9(3):295–318, 1999.
E.M. Clarke and B.-H. Schlingloff. Model checking. Elsevier Science Publishers, to appear.
A. Deutsch and V. Tannen. Containment of regular path expressions under integrity constraints.
In Knowledge Representation Meets Databases, 2001.
J. Doner. Tree acceptors and some of their applications. J. Comput. Syst. Sci., 4:405–451,
1970.
E.M. Clarke and E.A. Emerson. Design and Synthesis of Synchronization Skeletons using
Branching
Time Temporal Logic. In D. Kozen, editor, Proceedings of the Workshop on Logics

of Programs, volume 131 of LNCS, pages 52–71, Springer, 1981.
M. Fisher and R. Ladner. Propositional dynamic logic of regular programs. J. Comput. Syst.
Sci.
,
18(2):194–211, 1979.
D.M. Gabbay, I. Hodkinson, and M. Reynolds. Temporal Logic. Oxford Science Publications,
1994. Volume 1: Mathematical Foundations and Computational Aspects.
D.M. Gabbay, A. Pnueli, S. Shelah, and J. Stavi. On the temporal analysis of fairness. In
Proc. 7th ACM Symposium on Principles of Programming Languages, pages 163–173, 1980.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
XPath with Conditional Axis Relations
489
15.
16.
17.
18.

19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms for processing XPath queries. In
Proc. of the 28th International Conference on Very Large Data Bases (VLDB 2002), 2002.
G. Gottlob, C. Koch, and R. Pichler. The complexity of XPath query evaluation. In PODS
2003, pages 179–190, 2003.
G. Gottlob and C. Koch. Monadic queries over tree-structured data. In Proc. LICS, Copen-
hagen, 2002.

D
.
Harel. Dynamic logic. In D.M. Gabbay and F. Guenther, editors, Handbook of Philosoph-
ical Logic, volume 2, pages 497–604. Reidel, Dordrecht, 1984.
D. Harel, D. Kozen, and J. Tiuryn. Dynamic Logic. MIT Press, 2000.
D. Johnson. A catalog of complexity classes. In J. van Leeuwen, editor, Handbook of
Theoretical Computer Science, volume B, pages 67–161. Elsevier, 1990.
J.A.W. Kamp. Tense Logic and the Theory of Linear Order. PhD thesis, University of
California, Los Angeles, 1968.
D. Kozen. Results on the propositional mu-calculus. Th. Comp. Science, 27, 1983.
D. Kozen. Kleene algebra with tests. ACM Transactions on Programming Languages and
Systems, 19(3):427–443, May 1997.
M. Marx. the expressively complete XPath fragment. Manuscript, July 2003.
G
.
Miklau and D. Suciu. Containment and equivalence for an XPath fragment. In Proc.
PODS’02, pages 65–76, 2002.
F. Neven and T. Schwentick. XPath containment in the presence of disjunction, DTDs, and
variables. In ICDT 2003, 2003.
Ch. Papadimitriou. Computational Complexity. Addison–Wesley, 1994.
V
.
Pratt. Models of program logics. In Proceedings FoCS, pages 115–122, 1979.
M. Rabin. Decidability of second order theories and automata on infinite trees. Transactions
of the American Mathematical Society, 141:1–35, 1969.
K. Reinhardt. The complexity of translating logic to finite automata. In E. Grädel et al.,
editor, Automata, Logics, and Infinite Games, volume 2500 of LNCS, pages 231–238. 2002.
J. Rogers. A descriptive approach to language theoretic complexity. CSLI Press, 1998.
W. Thomas. Logical aspects in the study of tree languages. In B. Courcelle, editor, Ninth
Colloquium on Trees in Algebra and Programming, pages 31–50. CUP, 1984.

M.Y. Vardi and P. Wolper. Automata-theoretic techniques for modal logics of programs.
Journal of Computer and System Sciences, 32:183–221, 1986.
P. Wadler. Two semantics for XPath. Technical report, Bell Labs, 2000.
M. Weyer. Decidability of S1S and S2S. In E. Grädel et al., editor, Automata, Logics, and
Infinite Games, volume 2500
of
LNCS, pages 207–230. Springer, 2002.
P. Wood. On the equivalence of XML patterns. In Proc. 1st Int. Conf. on Computational
Logic,
volume 1861 of LNCS, pages 1152–1166, 2000.
W3C. Extensible markup language (XML) 1.0
/>W3C. XML path language (XPath 1.0). />W3C.
XML
path language (XPpath 2.0)
/>W3C.
XML schema part 1: Structures.
/>W3C. Xquery
1.0:
A
query language
for
XML.
http://www. w3. org/TR//xquery/.
W3C.
XSL
transformations language XSLT 2.0.
http://www. w3. org/TR/xslt20/.
A
Appendix
A.1

XPath and Prepositional Dynamic Logic
The next theorem states that PDL and the
filter expressions are equally expressive and
effectivel
y
reducible to each other.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
490
M. Marx
For our translation it is easier to use a variant of standard PDL in which the state formulas are
boolean formulas over formulas for a program. Now has the same meaning as in
standard PDL. Clearly this variant is equally expressive as PDL by the equivalences
and
Theorem 10. There are logspace translations from filter
expressions to PDLformulas and
in the converse direction such that for all models
for all nodes in
PROOF OF THEOREM
Io
. Let from
filter expressions to PDL formulas be defined as
follows:
Then (2) holds. For the other direction, let commute with the booleans and let
self
where is with each occurrence of ? replaced by ?self Then (3)
holds.
A.2
Proofs
PROOF OF THEOREM 2. Every axis containing converses can be rewritten into one without con-
verses by pushing them in with the following equivalences:

PROOF OF THEOREM 3.
is a fragment of Core XPath, but does not have negation on pred-
icates. That explains the first strict inclusion. The definition of given here was explicitly
designed to make the connection with Core XPath immediate. does not have the Core XPath
axes
following
and
preceding
as
primitives, but they are definable as explained in Remark 1.
All other Core XPath axes are clearly in The inclusion is strict because Core XPath does
not contain the immediate right and left sibling axis.
holds already on linear structures. This follows from a fundamental result
in temporal logic stating that on such structures the temporal operator until is not expressible by
the operators next-time, sometime-in-the-future and their inverses [13]. The last two correspond to
the
XPath axis
child
and
descendant
,
respectively.
Until(A,B)
is
expressible with conditional
paths as
also holds on linear structures already. is a fragment of the first
order logic of ordered trees by Proposition 1. But Example 2 expresses a relation which is not first
order expressible on trees [32].
because the conditional axis can be expressed by ? and ;. For the other

direction, let be an axis. Apply to all subterms of the following rewrite rules until no
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
XPath with Conditional Axis Relations
491
more is applicable
9
. From the axioms of Kleene algebras with tests [23] it follows that all rules
state an equivalence.
Then all tests occurring under the scope of a or in a sequence will be conditional to an atomic
axes.
Now
consider
a
location step
axes:
:
ntst
[fexpr].
If
axes
is a
expression
or a
sequence,
the location step is in
If
it is a union or a test, delete tests by the equivalences
P
ROOF OF
P

ROPOSITION I.
The only interesting cases are transitive closures of conditional paths,
like This expression translates to the formula (letting (·)° denote the trans-
lation) ;
P
ROOF OF
T
HEOREM
5.
Computing the set of states in a model
at which a PDL formula is
true can be done in time [3]. Thus for f
ilter expressions, the result follows from
Theorem 10. Now let be an arbitrary expression, and we want to compute Expand
with a new label such that Use the technique in the proof of Theorem 6
to obtain a filter expression such that iff (here use the new
label instead of root.
)
PROOF OF T HEOREM 6.
We use the fact that axis are closed under converse (Theo-
rem 2) and that every location path is equivalent to a simple location path of the
form The last holds as is equiv-
alent to etc. We claim that
for each model if and only if
and This is
easily proved by writing out the definitions using the equivalence Fil-
ter equivalence can be reduced to root equivalence because iff
P
ROOF OF
T

HEOREM
7.
(i) Decidability follows from an interpretation in the monadic second
order logic over variably branching trees of [31]. The consequence problem for is
shown to be decidable by an interpretation into Finiteness of a tree can be expressed in
The translation of expressions into is straightforward given the meaning definition.
We only have to use second order quantification to define the transitive closure of a relation. This
can be done by the usual definition: for R any binary relation, holds iff
(ii) Theorem 10 gives an effective reduction from PDL tree formulas to filter expressions. The
consequence problem for ordinary PDL interpreted on graphs is
EXPTIME
hard [12]. An inspection
9
An exponential blowup cannot be avoided. Consider a composition of
expressions
for the atomic axis. Then the only way of rewriting this into an axes is to fully
distribute the unions over the compositions, leaving a union consisting of elements.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
492
M. Marx
of the proof shows that the path can be used as the only program and that checking consequence
on finite trees is sufficient.
P
ROOF OF
T
HEOREM
8.
We first give a proof for by a reduction to deterministic PDL with
converse. Then we give a direct algorithm which might be easier to implement. This algorithm
unfortunately works only for expressions in which there are no occurrences of an arrow

and its inverse under the Kleene star.
Proof by reduction. By Theorem 10, and standard PDL reasoning we need only decide satisfi-
ability of PDL formulas in the signature with the four arrow programs interpreted on finite trees.
Call the language compass-PDL. Consider the language the PDL language with only the
two programs and their inverses is interpreted on finite at most binary-
branching trees, with and interpreted by the first and second daughter relation, respectively.
We will effectively reduce compass-PDL satisfiability to satisfiability. is a fragment
of deterministic PDL with converse. [33] shows that the satisfiability problem for this latter lan-
guage
is
decidable
in
EXPTIME
over
the
class
of all
models. This
is
done
by
constructing
for
each
formula a tree automaton which accepts exactly all tree models in which is satisfied. Thus
deciding satisfiability of reduces to checking emptiness of The last check can be done in
time polynomial in the size of As the size of is exponential in the length of this yields
the exponential time decision procedure.
But we want satisfiability on finite trees. This is easy to cope with in an automata-theoretic
framework: construct an automaton which accepts only finite binary trees, and check

emptiness of The size of does not depend on so this problem is still in
EXPTIME.
The reduction from compass-PDL to formulas is very simple: replace the compass-
PDL programs by the programs respectively. It is
straightforward to prove that this reduction preserves satisfiability, following the reduction from
to S
2
S as explained in [35]: a compass-PDL model is turned into an
model by defining and is the first daughter of }
and model is turned into a compass-PDL model
by defining and
Direct proof. Consider as in the Theorem. By Theorem 6 we
may assume that all terms are filter expressions. For this proof, assume also that the expressions
are in and that in each subexpression of an axis, may not contain an arrow and
its inverse. We call such expressions directed. First we bring these expressions into a normal
form. Define the set of step paths by the grammar
where is one of the four arrows and the can be any path. The
next lemma gives the reason for considering directed step paths. Its simple proof is omitted.
Lemma 1. Let be a step path not containing an arrow axis and its inverse. Then
in any model holds only if and the relation is conversely well-founded.
Lemma 2. Every expression is effectively reducible to an expression in which
for every occurrence of is a step path.
Proof. Define a function
G
from expressions to sets of expressions which yields
the set of step paths which “generate” the expression: for an expression
if is a step path. Otherwise set and
Now define the translation from expressions to expressions
in which for every occurrence of is a step path: if is a step path. Otherwise set
and for A

rather straightforward induction shows that and that is linear in
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
XPath with Conditional Axis Relations
493
The rest of the proof is a reduction to the consequence problem for a simple language interpreted
on binary trees. The language is equivalent to filter expressions with only two axis, first and second
child. We could give the proof in terms of the language, but the syntax is very cumbersome
to work with. For that reason we use the effective reduction to PDL formulas from Theorem 10
and do the rest of the proof in the PDL syntax.
The proof consists of two linear reductions and a decision algorithm. The first reduction
removes the transitive closure operation by adding new propositional symbols. Similar techniques
are employed in [29,10] for obtaining normalized monadic second order formulas. The second
reduction is based on Rabin’s reduction of to S2S.
Let be the modal language
10
with only two atomic paths and the modal constant
root. is interpreted on finite binary trees, with and interpreted by the first and second
daughter relation, respectively, and root holds exactly at the root. The semantics is the same as
that of PDL.
The consequence problem for PDL and for is the following: for determine
whether is true. The last is true if for each model with node set T in which
it also holds that
Lemma
3. The
consequence problem
is
decidable
in
EXPTIME.
Proof. A direct proof of this lemma using a bottom up version of Pratt’s [28]

EXPTIME
Hintikka
Set elimination technique was given in [6]. As is a sublanguage of the lemma also
follows from the argument given above in the proof by reduction.
A PDL formula in which the programs are axis with the restriction that for every occurrence
of is a step path not containing an arrow and its inverse is called a directed step PDL formula.
The
next Lemma together with Lemmas
2 and 3 and
Theorem
10
yield
the
EXPTIME
result
for
directed expressions.
Lemma 4. There is an effective reduction from the consequence problem for directed step PDL
formulas to the consequence problem for formulas.
Proof. Let be a directed step PDL formula. Let be the Fisher–Ladner closure [12] of
We associate a formula with as follows. We create for each a new propositional
variable and for each
we also create
Now “axiomatizes”
these new variables as follows:
We claim that for every model which validates for every node and for every new
variable In the proof we use the usual logical notation instead of the
more cumbersome
The proof is by induction on the structure of the formula and the path and for the left to right
direction of the case by induction on the depth of direction of The case for proposition

10
That is, the syntax is defined as that of PDL, with and
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
494
M. Marx
letters and the Boolean cases are immediate by the axioms in For the diamonds, we also
need an induction on the structure of the paths, so let
for
Iff, by
the axiom, Iff for some such that Iff, by induction hypothesis,
for some such that Iff The cases for the conditional arrows are
shown similarly. The cases for composition and union and the right to left direction of the Kleene
star case follow immediately from the axioms. For the left to right direction of the Kleene star we
also do an induction on the depth of the direction of This is possible by Lemma 1. As an example
let the direction be Let be a leaf and let By the axiom in or
The latter implies, by induction on the complexity of paths, that
But that is not possible by Lemma 1 as is a leaf. Thus and by IH, whence
Now let be a node with descendants, and let the claim hold for nodes with
descendants. Let Then by the axiom In the first case,
by IH. In the second case, by the induction on the structure of paths,
As is a directed step path, there exists a child of and Whence, by the induction
on the depth of the direction But then also
Hence for directed step PDL formulas, we have
As the language is closed under conjunction, we need only consider problems of the form
and we do that from now on.
Note that the only modalities occurring in are for one of the four arrows. We
can further reduce the number of arrows to only when we add two modal constants root
and first for the root and first elements, respectively. Let be a formula in this fragment. As
before create a new variable for each (single negation of a) subformula of Create
each subformula and add to the axioms

We claim that for every model which validates for every node and for every sub formula
iff An easy induction shows this. Consider the case of
then the parent of models whence by inductive hypothesis, it models
so by the axiom Conversely, if then by axiom
is not the root. So the parent of exists and it models Then it models
by axiom and by inductive hypothesis it models Thus Hence,
Note that the formulas on the right hand side only contain the modalities and Now we
come to the second reduction: to the consequence problem of binary branching trees. Let be a
formula, translate it to a formula as in the proof by reduction. Note that this is a directed
step PDL formula. Finally use the first reduction again to reduce the consequence problem to
the consequence problem of the language with just the modalities and interpreted on
binary trees.
Clinching all these reductions together, we obtain the reduction stated in Lemma 4.
as follows:
and for and for
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Declustering Two-Dimensional Datasets over
MEMS-Based Storage*
Hailing Yu, Divyakant Agrawal, and Amr El Abbadi
Department of Computer Science
University of California at Santa Barbara
Santa Barbara, 93106, USA
{hailing,agrawal,amr}@cs.ucsb.edu
Abstract. Due to the large difference between seek time and transfer time in
current disk technology, it is advantageous to perform large I/O using a single
sequential access rather than multiple small random I/O accesses. However, prior
optimal cost and data placement approaches for processing range queries over two-
dimensional datasets do not consider this property. In particular, these techniques
do not consider the issue of sequential data placement when multiple I/O blocks
need to be retrieved from a single device. In this paper, we reevaluate the optimal

cost of range queries by declustering two-dimensional datasets over multiple de-
vices, and prove that, in general, it is impossible to achieve the new optimal cost.
This is because disks cannot facilitate two-dimensional sequential access which
is required by the new optimal cost. Fortunately, MEMS-based storage is being
developed to reduce I/O cost. We first show that the two-dimensional sequential
access requirement can not be satisfied by simply modeling MEMS-based storage
as conventional disks. Then we propose a new placement scheme that exploits the
physical properties of MEMS-based storage to solve this problem. Our theoreti-
cal analysis and experimental results show that the new scheme achieves almost
optimal results.
1
Introduction
Multi-dimensional datasets have received a lot of attention due to their wide applications,
such as relational databases, images, GIS, and spatial databases. Range queries are an
important class of queries for multi-dimensional data applications. In recent years, the
size of datasets has been rapidly growing, hence, fast retrieval of relevant data is the key to
improve query performance. A common method of browsing geographic applications is
to display a low resolution map of some area on a screen. A user may specify a rectangular
bounding box over the map, and request the retrieval of the higher resolution map of the
specified region. In general, the amount of data associated with different regions may
be quite large, and may involve a large number of I/O transfers. Furthermore, due to
advances in processor and semi-conductor technologies, the gap in access cost between
main memory and disk is constantly increasing. Thus, I/O cost will dominate the cost
of range query. To alleviate this problem, two-dimensional datasets are often uniformly
*
This research is supported by the NSF under grants CCR-9875732, IIS-0220152, and EIA
00-80134.
E. Bertino et al. (Eds.): EDBT 2004, LNCS 2992, pp. 495–512, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

496
H. Yu, D. Agrawal, and A. El Abbadi
divided into tiles. The tiles are then declustered over multiple disks to improve query
performance through parallel I/O. In [10], it is shown that strictly optimal solutions
for range queries exist in only a small number of cases. Given a query that retrieves
A tiles, the optimal cost is where M is the number of I/O devices. A data
assignment scheme is said to be strictly optimal if it satisfies the optimal cost. Because it
is impossible to find the optimal solution for the general case, several assignment schemes
have been proposed [5,11,3,4,6] to achieve near optimal solutions. However, all prior
research, including the optimal cost formulation itself, is based on the assumption that
the retrieval of each tile takes constant time and is retrieved independently (random I/O
involving both seek and transfer time). This condition does not hold when considering
current disk technology. Recent trends in hard disk development favor reading large
chunks of consecutive disk blocks (sequential access). Thus, the solutions for allocating
two-dimensional data across multiple disk devices should also account for this factor.
In this paper, we reevaluate the optimal cost for current disk devices, and show that it is
still impossible to achieve the new optimal cost for the general case.
Even though the performance gap between main memory and disks is mitigated
by a variety of techniques, it is still bounded by the access time of hard disks. The
mechanical positioning system in disks limits the access time improvements to only
about 7% per year. Therefore, this performance gap can only be overcome by inventing
new storage technology. Due to advances in nano-technology, Micro-ElectroMechanical
Systems (MEMS) based storage systems are appearing on the horizon as an alternative
to disks [12,1,2]. MEMS are devices that have microscopic moving parts made by using
techniques similar to those used in semiconductor manufacturing. As a result, MEMS
devices can be produced and manufactured at a very low cost. Preliminary studies [13]
have shown that stand-alone MEMS based storage devices reduce I/O stall time by 4 to
74 times over disks and improve the overall application run times by 1.9X to 4.4X.
MEMS-based storage devices have very different characteristics from disk devices.
When compared to disk devices, head movement in MEMS-based storage devices can

be achieved in both X and Y dimensions, and multiple probe tips (over thousands)
can be activated concurrently to access data. In [8], Griffin et al. consider the problem
of integrating MEMS-based storage into computers by modeling them as conventional
disks. In [9], we proposed a new data placement scheme for relational data on MEMS
based storage by taking its characteristics into account. The performance analysis showed
that it is beneficial to use MEMS storage in a novel way. Range queries over two-
dimensional data (map or images) are another data-intensive application, thus, in this
paper, we explore data placement schemes for two-dimensional data over MEMS-based
storage to facilitate efficient processing of range queries. In our new scheme, we tile two-
dimensional datasets according to the physical properties of I/O devices. In all previous
work, the dataset is divided into tiles along each dimension uniformly. Through the
performance study, we show that tiling the dataset based on the properties of devices
can significantly improve the performance without any adverse impact on users, since
users only care about query performance, and do not need to know how or where data
is stored. The significant improvement in performance further demonstrates the great
potential that novel storage devices, like MEMS, have for database applications.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Declustering Two-Dimensional Datasets over MEMS-Based Storage
497
The rest of the paper is organized as follows. Section 2 revisits the optimal cost of a
range query on a two-dimensional dataset. In Section 3, we first give a brief introduction
of MEMS-based storage; then we adapt existing data placement schemes proposed for
disks to MEMS-based storage. A novel scheme is proposed in Section 4. Section 5
presents the experimental results. We conclude the paper in Section 6.
2
Background
In this section, we first address the problem of existing optimal cost evaluation for range
queries over two-dimensional data, then we derive the optimal cost for current disks.
2.1
Access Time versus Transfer Time

A disk is organized as a sequence of identically-sized disk pages. In general, the I/O
performance of queries is expressed in terms of the number of pages accessed, assuming
that the access cost is the same for each page. Access time is composed of three elements:
seek time, rotational delay, and transfer time. We refer to the sum of the seek time and
rotational delay as access time. Based on this assumption, the optimal cost of a query is
derived as
Where A is the number of tiles intersecting the query, M is the number of disk devices,
is the average access time of one page, is the average transfer time of
one page (all the terms have the same meaning throughout the paper). We will refer to
equation (1) as
the
prior optimal cost. Because each tile is retrieved by an independent
random I/O, in order to improve the access cost of a query, existing data placement
schemes decluster the dataset into tiles and assign neighboring tiles to different devices,
thus allowing concurrent access of multiple pages.
In recent years, disk external transfer rates have improved significantly from 4MB/s
to over 500MB/s; and the internal transfer rate is up to 160MB/s. However, access time
has only improved by a factor of 2. For a disk with 8KB page size, 160MB/s internal
Fig. 1. An image picture is placed on two disk devices
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
498
H. Yu, D. Agrawal, and A. El Abbadi
transfer rate, and 5ms access time, the ratio of access time over transfer time is about
100. Thus, current disk technology is heavily skewed towards sequential access instead
of random access due to the high cost of access time. This principle can benefit range
queries over two-dimensional data, if we could place the related tiles of a query that are
allocated on a single disk device sequentially. For example, Figure 1 shows an image
with sixteen disk pages (or tiles). Assume that this image is placed on two disk devices.
According to [10], tiles will be placed as shown in Figure 1. If tiles 1 and 3 are placed
sequentially on disk device 1 and tiles 2 and 4 are placed on device 2, the access cost of

query Q (given by the dashed rectangle in Figure 1) is instead
of Based on this observation, we reevaluate the optimal cost
of range queries for hard disks.
2.2
Optimal Cost Estimation
Due to the fact that current hard disks are more efficient when accessing data sequentially,
the advantage of sequential access must be considered to facilitate fast retrieval of the
related tiles in a range query. Ideally, if all tiles related to a query, which are allocated
to the same device, are placed sequentially, then only one access time is incurred for the
query. Thus the optimal retrieval cost is given by the following formula (referred to as
new optimal cost
).
where is the average number of blocks in a track, is the time of switching
from one track to a neighboring track. The first term in the formula is the cost of one
access time, because all the related tiles on one device are placed consecutively, and
all initial accesses to different disks are performed concurrently. The second term is the
maximal transfer time of all devices, which is derived by considering the parallelism
among multiple disk devices. The third term represents the track switching time if the
number of involved tiles is too large to be stored on one track. However, it
is impossible to achieve the optimal cost for all queries, as shown by the following
argument.
In general, the two-dimensional data is divided into tiles. The number of
disk devices is
M
. Consider a query with A tiles. Without loss of generality, assume
A is much smaller than (thus there is no track switching time), and A is equal to
M
as shown in Figure 2 (a). Based on the new optimal cost formula (2), each tile in this
query is allocated to a different device. Thus, the access cost of this query is
This cost is optimal. Figure 2 (b) shows query which extends query in the X

dimension. The number of tiles in query is and In order to achieve
the optimal cost, each device is allocated two tiles, and these two tiles need to be placed
sequentially. Alternatively, we extend query in the Y dimension to generate a new
query as shown in Figure 2 (c). Query accesses tiles, where also
equals M. Therefore, each device needs to place two tiles (one from A and one from
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Declustering Two-Dimensional Datasets over MEMS-Based Storage
499
Fig. 2. Queries in a data set
Fig. 3. Tile allocation to sequential disk pages
sequentially to achieve the optimal cost. However, it is impossible to achieve the optimal
cost for both query and Assume tiles are on device tile intersects
tiles intersect and tiles intersect In order to achieve sequential access
for query tiles and need to be placed next to each other. Thus, tile can not be
placed next to as shown in Figure 3. Alternatively, if is placed next to then can
not be placed next to Thus, in general, it is impossible to achieve the optimal cost for
all queries of a data set, given the physical characteristics of disks.
3
Adapting Disk Allocation Scheme to MEMS Storage
MEMS storage devices are novel experimental systems that can be used as storage. Their
two-dimensional structure and inherent parallel retrieval capability hold the promise of
solving the problems disks face during the retrieval of two-dimensional datasets. In this
section, we first briefly introduce MEMS-based storage, then show how to adapt existing
disk allocation schemes to MEMS-based storage.
3.1
MEMS-Based Storage
MEMS are extremely small mechanical structures on the order of to
which are formed by the integration of mechanical elements, actuators, electronics,
and sensors. These micro-structures are fabricated on the surface of silicon wafers by
using photolithographic processes similar to those employed in manufacturing standard

semiconductor devices. MEMS-based storage systems use MEMS for the positioning of
read/write heads. Therefore, MEMS-based storage systems can be manufactured at a very
low cost. Several research centers are developing real MEMS-based storage devices, such
as IBM Millipede [12], Hewlett-Packard Laboratories ARS [2], and Carnegie Mellon
University CHIPS [1].
Due to the difficulty of manufacturing efficient and reliable rotating parts in silicon,
MEMS-based storage devices do not make use of rotating platters. All of the designs
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
500
H. Yu, D. Agrawal, and A. El Abbadi
Fig. 4. The architecture of CMU CHIPS
use a large-scale MEMS array, thousands of read/write heads (called probe tips), and
a recording media surface. The probe tips are mounted on micro-cantilevers embedded
in a semiconductor wafer and arranged in a rectangular fashion. The recording media
is mounted on another rectangular silicon wafer (called the media sled) that uses dif-
ferent techniques [12,1] for recording data. The IBM Millipede uses special polymers
for storing and retrieving information [12]. The CMU CHIPS project adopts the same
techniques as data recording on magnetic surfaces [1]. To access data under this model,
the media sled is moved from its current position to a specific position defined by
coordinates. After the “seek” is performed, the media sled moves in the Y direction while
the probe tips access the data. The data can be accessed sequentially in the Y dimension.
The design is shown in Figure 4 (a).
The media sled is organized into rectangular regions at the lower level. Each of
these rectangular regions contains M × N bits and is accessible by one tip, as shown
in Figure 4 (b). Bits within a region are grouped into vertical 90-bit columns called tip
sectors. Each tip sector contains 8 data bytes, which is the smallest accessible unit of
data in MEMS-based storage [8]. In this paper, we will use the CMU CHIPS model
to motivate two-dimensional data allocation techniques for MEMS-based storage. The
parameters are given in Table 1. Because of heat-dissipation constraints, in this model,
not all tips can be activated simultaneously even though it is feasible theoretically. In the

CMU CHIPS model, the maximal number of tips that can be activated concurrently is
limited to 1280 (which provides the intra-device parallelism). Each rectangular region
stores 2000 × 2000 bits.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Declustering Two-Dimensional Datasets over MEMS-Based Storage
501
3.2
Adapting Disk Placement Approaches
Even though MEMS-based storage has very different characteristics from disk devices,
in order to simplify the process of integrating it into computing systems, some prior
approaches have been proposed to map MEMS-based storage into disk-like devices.
Using disk terminology, a cylinder is defined as the set of all bits with identical
offset within a region; i.e., a cylinder consists of all bits accessible by all tips when
the sled moves only in the Y direction. Due to power and heat considerations, only a
subset can be activated simultaneously. Thus cylinders are divided into tracks. A track
consists of all bits within a cylinder that can be accessed by a group of concurrently
active tips. Each track is composed of multiple tip sectors, which contain less data than
sectors of disks. Sectors can be grouped into logical blocks. In [8], each logical block
is 512 bytes and is striped across 64 tips. Therefore, by using the MEMS-based storage
devices in Table 1, there are up to 20 logical blocks that can be accessed concurrently
(1280 concurrent tips/64 tips = 20). Data is accessed based on its logical block number
on the MEMS-based storage.
After this mapping, a MEMS-based storage device can be treated as a regular disk.
Thus, all prior approaches for two-dimensional data allocation proposed for disk de-
vices can be adapted to MEMS-based storage, such as Generalized Fibonacci (GFIB)
scheme [11], Golden Ratio Sequences (GRS) [4], Almost optimal access [3], and other
approaches based on replication [6]. All of these approaches are based on the prior op-
timal cost (1) and do not factor the sequential access property of current disks. Hence,
when adapting them to MEMS-based storage devices, this problem still remains un-
solved. For example, a two-dimensional dataset with 4 × 4 tiles is shown in Figure 5

(a). Two MEMS-based storage devices are used to store the data (M = 2). The MEMS
devices in this example have four probe tips each, and only two of them can be activated
concurrently. Based on the prior optimal cost (1), the tiles should be distributed evenly
and alternately to the two devices, as shown in Figure 5 (a). The number in a tile
means that the tile is allocated to MEMS device and stands for the relative position
Fig. 5. An example of placing tiles on MEMS-based storage
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

×