632
N. Karayannidis, T. Sellis, and Y. Kouvaras
Fig. 7.
Impact of cube dimensionality increase to the CUBE File size
We used synthetic data sets that were produced with an OLAP data generator that
we have developed. Our aim was to create data sets with a realistic number of
dimensions and hierarchy levels. In Table 1, we present the hierarchy configuration
for each dimension used in the experimental data sets. The shortest hierarchy consists
of 2 levels, while the longest consists of 10 levels. We tried each data set to consist of
a good mixture of hierarchy lengths. Table 2 shows the data set configuration for each
series of experiments. In order to evaluate the adaptation to sparse data spaces, we
created cubes that were very sparse. Therefore the number of input tuples was kept
from a small to a moderate level. To simulate the cube data distribution, for each cube
we created ten hyper-rectangular regions as data point containers. These regions are
defined randomly at the most detailed level of the cube and not by combination of
hierarchy values (although this would be more realistic), in order not to favor the
CUBE File particularly, due to the hierarchical chunking. We then filled each region
with data points uniformly spread and tried to maintain the same number of data
points in each region.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CUBE File: A File Structure for Hierarchically Clustered OLAP Cubes
633
Fig. 8.
Size ratio between the UB-tree and the CUBE File for increasing dimensionality
Fig. 9.
Size scalability in the number of input tuples (i.e., stored data points)
4.2
Structure Experiments
Fig. 7 shows the size of the CUBE File as the dimensionality of the cube increases.
The vertical axe is in logarithmic scale. We see the cube data space size (i.e., the
product of the dimension grain-level cardinalities) “exploding” exponentially as the
number of dimensions increases. The CUBE File size remains many orders of
magnitude smaller than the data space. Moreover, the CUBE File size is also smaller
than the ASCII file, containing the input tuples to be loaded into SISYPHUS. This
clearly shows that the CUBE File:
1.
2.
Adapts to the large sparseness of the cube allocating space comparable to the
actual number of data points
Achieves a compression of the input data since it does not store the data point
coordinates (i.e., the h-surrogate keys of the dimension values) in each cell but
only the measure values.
Furthermore, we wish to pinpoint that the current CUBE File implementation ([6])
does not impose any compression to the intermediate nodes (i.e., the directory
chunks). Only the data chunks are compressed by means of a bitmap representing the
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
634
N. Karayannidis, T. Sellis, and Y. Kouvaras
cell offsets, which however is stored uncompressed also. This was a deliberate choice
in order to evaluate the compression achieved merely by the “pruning ability” of our
chunk-to-bucket allocation scheme, according to which no space is allocated for
empty chunk-trees (i.e., empty data space regions). Therefore, regarding the
compression achieved the following could improve the compression ratio even
further: (a) compression of directory chunks and (b) compression of offset-bitmaps
(e.g., with run-length encoding).
Fig. 8 shows the ratio of the UB-tree size to the CUBE File size for increasing
dimensionality. We see that the UB-tree imposes a greater storage overhead than the
CUBE File for almost all cases. Indeed, the CUBE file remains 2-3 times smaller in
size than the UB-tree/MHC. For eight dimensions both structures have approximately
the same size but for nine dimensions the CUBE File size is four times larger. This is
primarily due to the increase of the size of the intermediate nodes in the CUBE File,
since for 9 dimensions and 100,000 data points the data space has become extremely
sparse As we noted above, our implementation does not apply
any compression to the directory chunks. Therefore, it is reasonable that for such
extremely sparse data spaces the overhead from these chunks becomes significant,
since a single data point might trigger the allocation of all the cells in the parent
nodes. An implementation that would incorporate the compression of directory
chunks as well would eliminate this effect substantially.
Fig. 9 depicts the size of the CUBE File as the number of cube data points (i.e.,
input tuples) scales up, while the cube dimensionality remains constant (five
dimensions with a good mixture of hierarchy lengths – see Table 1). In the same
graph we show the corresponding size of the UB-tree/MHC and the size of the root-
bucket. The CUBE File maintains a lower storage cost for all tuple cardinalities.
Moreover, the UB-tree size increases in a faster rate making the difference of the two
larger as the number of tuples increases. The root-bucket size is substantially lower
than the CUBE File and demonstrates an almost constant behaviour. Note that in our
implementation we store the whole root-directory in the root-bucket and thus the
whole root-directory is kept in main memory during query evaluation. Thus the graph
also shows that the root-directory size becomes very fast negligible compared to the
CUBE File size as the number of data points increase. Indeed, for cubes containing
more than 1 million tuples the root-directory size is below 5% of the CUBE File size,
although the directory chunks are stored uncompressed in our current implementation.
Hence it is feasible to keep the whole root-directory in main memory.
4.3
Query Experiments
For the query experiments we ran a total of 5,234 HPP queries both on the CUBE File
and the UB-tree/MHC. These queries were classified in three classes: (a) 1,593 prefix
queries, (b) 1,806 prefix range queries and (c) 1,835 prefix multi-range queries. A
prefix query is one in which we access the data points by a specific chunk-id prefix.
For example the following prefix query is represented by the shown chunk
expression, which denotes the restriction on each hierarchy of a 3-dimensional cube
of 4 chunking depth levels
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CUBE File: A File Structure for Hierarchically Clustered OLAP Cubes
635
This expression represents a chunk-id access pattern, denoting the cells that we
need to access in each chunk. means “any”, i.e., no restriction is imposed on this
dimension level. The greatest depth containing at least one restriction is called the
maximum depth of restrictions In this example it corresponds to the D-
domain and thus equals 1. The greater the maximum depth of
restrictions the less are the returned data points (smaller cube selectivity) and vice-
versa. A prefix range query is a prefix query that includes at least one range selection
on a hierarchy level, thus resulting in a larger selection hyper-rectangle at the grain
level of the cube. For example:
Finally, a prefix multi-range query is a prefix range query that includes at least one
multiple range restriction on a hierarchy level of the
form {[a-b],[c-d]...}. This results
in multiple disjoint selection hyper-rectangles at the grain level. For example:
As mentioned earlier, our goal was to evaluate the hierarchical clustering achieved
by means of the performed I/Os for the evaluation of these queries. To this end, we
ran two series of experiments: the hot-cache experiments and the cold-cache ones. In
the hot-cache experiments we assumed that the root-bucket (containing the whole
root-directory) is cached in main memory and counted only the remaining bucket
I/Os. For the UB-tree in the hot-cache case, we counted only the page I/Os at the leaf
level omitting the intermediate node accesses altogether. In contrast, for the cold-
cache experiments for each query on the CUBE File we counted also the size of the
whole root-bucket, while for the UB-tree we counted both intermediate and leaf-level
page accesses. The root-bucket size equals to 295 buckets according to the following,
which shows the sizes for the two structures for the data set used:
UB-tree total num of pages: 15,752
CUBE File total num of buckets: 4,575
Root bucket number of buckets: 295
Fig. 11, shows the I/O ratio between the UB-tree and the CUBE File for all three
classes of queries for the hot-cache case. This ratio is calculated from the total number
of I/Os for all queries of the same maximum depth of restrictions for each data
structure. As increases, essentially the cube selectivity decreases (i.e., less data
points are returned in the result set). We see that the UB-tree performs more I/Os for
all depths and for all query classes. For small-depth restrictions where the selection
rectangles are very large the CUBE File performs 3 times less I/Os than the UB-tree.
Moreover, for more restrictive queries the CUBE file is multiple times better
achieving up to 37 times less I/Os. An explanation for this is that the smaller the
selection hyper-rectangle the greater becomes the percentage of UB-tree leaf-pages
containing very few (or even none) of the qualifying data points in the total number of
accessed pages. Thus more I/Os are required on the whole, in order to evaluate the
restriction, and for large-depth restrictions the UB-tree performs even worse, because
essentially it fails to cluster the data with respect to the more detailed hierarchy levels.
This behaviour was also observed in [7], where for queries with small cube
selectivities the UB-tree performance was worse and the hierarchical clustering effect
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
636
N. Karayannidis, T. Sellis, and Y. Kouvaras
reduced. We believe this is due to the way data are clustered into z-regions (i.e., disk
pages) along the z-curve [1]. In contrast, the hierarchical chunking applied in the
CUBE File, creates groups of data (i.e., chunks) that belong in the same “hierarchical
family” even for the most detailed levels. This, in combination with the chunk-to-
bucket allocation that guarantees that hierarchical families will be clustered together,
results in better hierarchical clustering of the cube even for the most detailed levels of
the hierarchies.
Fig. 10.
Size ratio between the UB-tree and the CUBE File for increasing tuple cardinality
Fig. 11.
I/O ratios for the hot-cache experiments
Note that in two subsets of queries the returned result set was empty (prefix multi-
range queries for and The UB-tree had to descend down to the
leaf level and access the corresponding pages, performing I/Os essentially for nothing.
In contrast, the CUBE File performed no I/Os, since directly from a root directory
node it could identify an empty subtree and thus terminate the search immediately.
Since the denominator was zero, we depict the corresponding ratios for these two
cases in Fig. 11 with a zero value.
Fig. 12, shows the I/O ratios for the cold-cache experiments. In this figure we can
observe the impact of having to read the whole root directory in memory for each
query on the CUBE File. For queries of small-depth restrictions (large result set) the
difference in the performed I/Os for the two structures remains essentially the same
with the hot-cache case. However, for larger-depth restrictions (smaller result set) the
overhead imposed by the root-directory reduces the difference between the two, as it
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CUBE File: A File Structure for Hierarchically Clustered OLAP Cubes
637
was expected. Nevertheless, the CUBE File is still multiple times better in all cases,
clearly demonstrating a better hierarchical clustering. Furthermore, note that even if
no cache area is available, in reality there will never be a case where the whole root
directory is accessed for answering a single query. Naturally, only the relative buckets
of the root-directory are accessed for each query.
Fig. 12. I/O ratios for the cold-cache experiments
5
Summary and Conclusions
In this paper we presented the CUBE File, a novel file structure for organizing the
most detailed data of an OLAP cube. This structure is primarily aimed at speeding up
ad hoc OLAP queries containing restrictions on the hierarchies, which comprise the
most typical OLAP workload.
The key features of the CUBE File are that it is a natively multidimensional data
structure. It explicitly supports dimension hierarchies, enabling fast access to cube
data via a directory of chunks formed exactly from the hierarchies. It clusters data
with respect to the dimension hierarchies resulting in reduced I/O cost for query
evaluation. It imposes a low storage overhead basically for two reasons: (a) it adapts
perfectly to the extensive sparseness of the cube, not allocating space for empty
regions, and (b) it does not need to store the dimension values along with the
measures of the cube, due to its location-based access mechanism of cells. These two
result in a significant compression of the data space. Moreover this compression can
increase even further, if compression of intermediate nodes is employed. Finally, it
achieves a high space utilization filling the buckets to capacity. We have verified the
aforementioned performance aspects of the CUBE File by running an extensive set of
experiments and we have also shown that the CUBE File outperforms UB-Tree/MHC,
the most effective method proposed up to now for hierarchically clustering the cube,
in terms of storage cost and number of disk I/Os. Furthermore, the CUBE File fits
perfectly to the processing framework for ad hoc OLAP queries over hierarchically
clustered fact tables (i.e., cubes) proposed in our previous work [7]. In addition, it
supports directly the effective hierarchical pre-grouping transformation [13, 19],
since it uses hierarchically encoded surrogate keys. Finally, it can be used as a
physical base for implementing a chunk-based caching scheme [3].
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
638
N. Karayannidis, T. Sellis, and Y. Kouvaras
Acknowledgements. We wish to thank Transaction Software GmbH for providing us
Transbase Hypercube to run the UB-tree/MHC experiments. This work has been
partially funded by the European Union’s Information Society Technologies
Programme (IST) under project EDITH (IST-1999-20722).
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
R. Bayer: The universal B-Tree for multi-dimensional Indexing: General Concepts.
WWCA 1997.
C. Y. Chan, Y. E. Ioannidis: Bitmap Index Design and Evaluation. SIGMOD 1998.
P. Deshpande, K. Ramasamy, A. Shukla, J. F. Naughton: Caching Multidimensional
Queries Using Chunks. SIGMOD 1998.
Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh: Data Cube: A Relational
Aggregation Operator Generalizing Group-By, Cross-Tab, and SubTotal. ICDE 1996.
N. Karayannidis: Storage Structures, Query Processing and Implementation of On-Line
Analytical Processing Systems, Ph.D. Thesis, National Technical University of Athens,
2003. Available at: kos/thesis/PhD_thesis_en.pdf.
N. Karayannidis, T. Sellis: SISYPHUS: The Implementation of a Chunk-Based Storage
Manager for OLAP Data Cubes. Data and Knowledge Engineering, 45(2): 155-188, May
2003.
N. Karayannidis et al: Processing Star-Queries on Hierarchically-Clustered Fact-Tables.
VLDB 2002.
L. V. S. Lakshmanan, J. Pei, J. Han: Quotient Cube: How to Summarize the Semantics of a
Data Cube. VLDB 2002.
V. Markl, F. Ramsak, R. Bayern: Improving OLAP Performance by Multidimensional
Hierarchical Clustering. IDEAS 1999.
P. E. O’Neil, G. Graefe: Multi-Table Joins Through Bitmapped Join Indices. SIGMOD
Record 24(3): 8-11 (1995).
J. Nievergelt, H. Hinterberger, K. C. Sevcik: The Grid File: An Adaptable, Symmetric
Multikey File Structure. TODS 9(1): 38-71 (1984).
P. E. O’Neil, D. Quass: Improved Query Performance with Variant Indexes. SIGMOD
1997.
R. Pieringer et al: Combining Hierarchy Encoding and Pre-Grouping: Intelligent Grouping
in Star Join Processing. ICDE 2003.
F. Ramsak et al: Integrating the UB-Tree into a Database System Kernel. VLDB 2000.
S. Sarawagi: Indexing OLAP Data. Data Engineering Bulletin 20(1): 36-43 (1997).
Y. Sismanis, A. Deligiannakis, N. Roussopoulos, Y. Kotidis: Dwarf: shrinking the
PetaCube. SIGMOD 2002.
S. Sarawagi, M. Stonebraker: Efficient Organization of Large Multidimensional Arrays.
ICDE 1994
The Transbase Hypercube® relational database system ().
Aris Tsois, Timos Sellis: The Generalized Pre-Grouping Transformation: Aggregate-
Query Optimization in the Presence of Dependencies. VLDB 2003.
Roger Weber, Hans-Jörg Schek, Stephen Blott: A Quantitative Analysis and Performance
Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB 1998: 194-205.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
Mukund Raghavachari
1
and Oded Shmueli
2
1
IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA,
,
2
Technion – Israel Institute of Technology, Haifa, Israel,
Abstract
.
As XML schemas evolve over time or as applications are in-
tegrated, it is sometimes necessary to validate an XML document known
to conform to one schema with respect to another schema. More gener-
ally, XML documents known to conform to a schema may be modified,
and then, require validation with respect to another schema. Recently,
solutions have been proposed for incremental validation of XML docu-
ments. These solutions assume that the initial schema to which a doc-
ument conforms and the final schema with which it must be validated
after modifications are the same. Moreover, they assume that the in-
pu
t
document may be preprocessed, which in certain situations, may be
computationall
y
and memory intensive. In this paper, we describe how
knowledg
e
of conformance to an XML Schema (or DTD) may be used
t
o
determine conformance to another XML Schema (or DTD) efficiently.
W
e
examine both the situation where an XML document is modified
befor
e
it is to be revalidated and the situation where it is unmodified.
1
Introduction
The ability to validate XML documents with respect to an XML Schema [21]
or DTD is central to XML’s emergence as a key technology for application
integration. As XML data flow between applications, the conformance of the
data to either a DTD or an XML schema provides applications with a guarantee
that a common vocabulary is used and that structural and integrity constraints
are met. In manipulating XML data, it is sometimes necessary to validate data
with respect to more than one schema. For example, as a schema evolves over
time, XML data known to conform to older versions of the schema may need to
be verified with respect to the new schema. An intra-company schema used by
a business might differ slightly from a standard, external schema and XML data
valid with respect to one may need to be checked for conformance to the other.
The validation of an XML document that conforms to one schema with re-
spect to another schema is analogous to the cast operator in programming lan-
guages. It is useful, at times, to access data of one type as if it were associated
with a different type. For example, XQuery [20] supports a validate operator
which converts a value of one type into an instance of another type. The type
safety of this conversion cannot always be guaranteed statically. At runtime,
E. Bertino et al. (Eds.): EDBT 2004, LNCS 2992, pp. 639–657, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
640
M. Raghavachari and O. Shmueli
XML fragments known to correspond to one type must be verified with respect
to another. As another example, in XJ [9], a language that integrates XML into
Java, XML variables of a type may be updated and then cast to another type.
A compiler for such a language does not have access to the documents that
are to be revalidated. Techniques for revalidation that rely on preprocessing the
document [3,17] are not appropriate.
This paper focuses on the validation of XML documents with respect to
the structural constraints of XML schemas. We present algorithms for schema
cast validation with and without modifications that avoid traversing subtrees of
an XML document where possible. We also provide an optimal algorithm for
revalidating strings known to conform to a deterministic finite state automaton
according to another deterministic finite state automaton; this algorithm is used
to revalidate content model of elements. The fact that the content models of
XML Schema types are deterministic [6] can be used to show that our algorithm
for XML Schema cast validation is optimal as well. We describe our algorithms in
terms of an abstraction of XML Schemas, abstract XML schemas, which model
the structural constraints of XML schema. In our experiments, our algorithms
achieve 30-95% performance improvement over Xerces 2.4.
The question we ask is how can one use knowledge of conformance of a doc-
ument to one schema to determine whether the document is valid according to
another schema? We refer to this problem as the schema cast validation prob-
lem. An obvious solution is to revalidate the document with respect to the new
schema, but in doing so, one is disregarding useful information. The knowledge
of a document’s conformance to a schema can help determine its conformance to
another schema more efficiently than full validation. The more general situation,
which we refer to as schema cast with modifications validation, is where a docu-
ment conforming to a schema is modified slightly, and then, verified with respect
to a new schema. When the new schema is the same as the one to which the
document conformed originally, schema cast with modifications validation ad-
dresses the same problem as the incremental validation problem for XML [3,17].
Our solution to this problem has different characteristics, as will be described.
The scenario we consider is that a source schema A and a target schema B
are provided and may be preprocessed statically. At runtime, documents valid
according to schema A are verified with respect to schema B. In the modification
case, inserts, updates, and deletes are performed to a document before it is
verified with respect to B. Our approach takes advantage of similarities (and
differences) between the schemas A and B to avoid validating portions of a
document if possible. Consider the two XML Schema element declarations for
purchaseOrder shown in Figure 1. The only difference between the two is that
whereas the
billTo element is optional in the schema of Figure 1a, it is required
in the schema of Figure 1b. Not all XML documents valid with respect to the first
schema are valid with respect to the second — only those with a billTo element
would be valid. Given a document valid according to the schema of Figure 1a,
an ideal validator would only check the presence of a billTo element and ignore
the validation of the other components (they are guaranteed to be valid).
This paper focuses on the validation of XML documents with respect to
the structural constraints of XML schemas. We present algorithms for schema
cast validation with and without modifications that avoid traversing subtrees of
an XML document where possible. We also provide an optimal algorithm for
revalidating strings known to conform to a deterministic finite state automaton
according to another deterministic finite state automaton; this algorithm is used
to revalidate content model of elements. The fact that the content models of
XML Schema types are deterministic [6] can be used to show that our algorithm
for XML Schema cast validation is optimal as well. We describe our algorithms in
terms of an abstraction of XML Schemas, abstract XML schemas, which model
the structural constraints of XML schema. In our experiments, our algorithms
achieve 30-95% performance improvement over Xerces 2.4.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
641
Fig. 1. Schema fragments defining a purchaseOrder element in (a) Source Schema (b)
Targe
t
Schema.
The contributions of this paper are the following:
1.
2.
3.
4.
An abstraction of XML Schema, abstract XML Schema, which captures
the structural constraints of XML schema more precisely than specialized
DTDs [16] and regular type expressions [11].
Efficient algorithms for schema cast validation (with and without updates) of
XML documents with respect to XML Schemas. We describe optimizations
for the case where the schemas are DTDs. Unlike previous algorithms, our
algorithms do not preprocess the documents that are to be revalidated.
Efficient algorithms for revalidation of strings with and without modifica-
tions according to deterministic finite state automata. These algorithms are
essential for efficient revalidation of the content models of elements.
Experiments validating the utility of our solutions.
Structure of the Paper: We examine related work in Section 2. In Section 3,
we introduce abstract XML Schemas and provide an algorithm for XML schema
revalidation. The algorithm relies on an efficient solution to the problem of string
revalidation according to finite state automata, which is provided in Section 4.
We discuss the optimality of our algorithms in Section 5. We report on experi-
ments in Section 6, and conclude in Section 7.
2
Related Work
Papakonstantinou and Vianu [17] treat incremental validation of XML docu-
ments (typed according to specialized DTDs). Their algorithm keeps data struc-
tures that encode validation computations with document tree nodes and utilizes
these structures to revalidate a document. Barbosa et al. [3] present an algorithm
that also encodes validation computations within tree nodes. They take advan-
tage of the 1-unambiguity of content models of DTDs and XML Schemas [6],
and structural properties of a restricted set of DTDs, to revalidate documents.
Our algorithm is designed for the case where schemas can be preprocessed, but
the documents to be revalidated are not available a priori to be preprocessed.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
642
M. Raghavachari and O. Shmueli
Example
s
include message brokers, programming language and query compilers,
etc
.
In these situations, techniques that preprocess the document and store state
informatio
n
at each node could incur unacceptable memory and computing over-
head
,
especially if the number of updates is small with respect to the document,
o
r
the size of the document is large. Moreover, our algorithm handles the case
wher
e
the document must be revalidated with respect to a different schema.
Kan
e
et al. [12] use a technique based on query modification for handling the
incrementa
l
update problem. Bouchou and Halfeld-Ferrari [5] present an algo-
rith
m
that validates each update using a technique based on tree automata.
Again
,
both algorithms consider only the case where the schema to which the
documen
t
must conform after modification is the same as the original schema.
Th
e
subsumption of XML schema types used in our algorithm for schema
cas
t
validation is similar to Kuper and Siméon’s notion of type subsumption [13].
Thei
r
type system is more general than our abstract XML schema. They assume
tha
t
a subsumption mapping is provided between types such that if one schema
i
s
subsumed by another, and if a value conforming to the subsumed schema is
annotate
d
with types, then by applying the subsumption mapping to these type
annotations
,
one obtains an annotation for the subsuming schema. Our solution
i
s
more general in that we do not require either schema to be subsumed by the
other
,
but do handle the case where this occurs. Furthermore, we do not require
typ
e
annotations on nodes. Finally, we consider the notion of disjoint types in
additio
n
to subsumption in the revalidation of documents.
On
e
approach to handling XML and XML Schema has been to express them
i
n
terms of formal models such as tree automata. For example, Lee et al. describe
ho
w
XML Schema may be represented in terms of deterministic tree grammars
with one lookahead [15]. The formalism for XML Schema and the algorithms in
thes
e
paper are a more direct solution to the problem, which obviates some of
th
e
practical problems of the tree automata approach, such as having to encode
unranke
d
XML trees as ranked trees.
Programmin
g
languages with XML types [1,4,10,11] define notions of types
an
d
subtyping that are enforced statically. XDuce [10] uses tree automata as
th
e
base model for representing XML values. One difference between our work
an
d
XDuce is that we are interested in dynamic typing (revalidation) where
stati
c
analysis is used to reduce the amount of needed work. Moreover, unlike
XDuce’
s
regular expression types and specialized DTDs [17], our model for XML
value
s
captures exactly the structural constraints of XML Schema (and is not
equivalen
t
to regular tree automata). As a result, our subtyping algorithm is
polynomia
l
rather than exponential in complexity.
3
XML Schema and DTD Conformance
I
n
this section, we present the algorithm for revalidation of documents accord-
in
g
to XML Schemas. We first define our abstractions for XML documents,
ordered
labeled trees, and for XML Schema, abstract XML Schema. Abstract
XM
L
Schema captures precisely the structural constraints of XML Schema.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
643
Ordered Labeled Trees.
We
abstract
XML
documents
as
ordered
labeled
trees,
where an ordered labeled tree over a finite alphabet is a pair where
is an ordered tree consisting of a finite set of nodes, N, and a set of
edges E, and is a function that associates a label with each
node of N. The label, which can only be associated with leaves of the tree
represents XML Schema simple values. We use root(T) to denote the root of
tree We shall abuse notation slightly to allow to denote the label of the
root node of the ordered labeled tree T. We use to denote an
ordered tree with root and subtrees where denotes an ordered tree
with a root that has no children. We use to represent the set of all ordered
labeled trees.
Abstract XML Schema. XML Schemas, unlike DTDs, permit the decoupling
of an element tag from its type; an element may have different types depending on
context. XML Schemas are not as powerful as regular tree automata. The XML
Schema specification places restrictions on the decoupling of element tags and
types. Specifically, in validating a document according to an XML Schema, each
element of the document can be assigned a single type, based on the element’s
label and the type of the element’s parent (without considering the content of
the element). Furthermore, this type assignment is guaranteed to be unique.
We define an abstraction of XML Schema, an abstract XML Schema, as a
4-tuple, where
is the alphabet of element labels (tags).
is the set of types defined in the schema.
is a set of type declarations, one for each where is either
a simple type of the form : simple, or a complex type of the form
where:
is a regular expression over denotes the language
associated with
Let be the set of element labels used in Then, :
is a function that assigns a type to each element label used
in the type declaration of The function, abstracts the notion
of XML Schema that each child of an element can be assigned a type
based on its label without considering the child’s content. It also models
the XML Schema constraint that if two children of an element have the
same label, they must be assigned the same type.
is a partial function which states which element labels can occur
as the root element of a valid tree according to the schema, and the type
this root element is assigned.
Consider the XML Schema fragment of Figure 1a. The function maps global
element declarations to their appropriate types, that is,
Table 1 shows the type declaration for POType1 in our formalism.
Abstract XML Schemas do not explicitly represent atomic types, such as
xsd:integer.
For simplicity of exposition, we have assumed that all XML Schema
atomic and simple types are represented by a single simple type. Handling atomic
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
644
M. Raghavachari and O. Shmueli
and simple types, restrictions on these types and relationships between the values
denoted by these types is a straightforward extension. We do not address the
identity constraints (such as key and keyref constraints) of XML Schema in
this paper. This is an area of future work. Other features of XML Schema such
as substitution groups, subtyping, and namespaces can be integrated into our
model. A discussion of these issues is beyond the scope of the paper.
We define the validity of an ordered, labeled tree with respect to an abstract
XML Schema as follows:
Definition 1. The set of ordered labeled trees that are valid with respect to a
type is defined in terms of the least solution to a set of equations, one for each
of the form ( are nodes):
An ordered labeled tree,
T
, is valid with respect to a schema
if is defined and If is a complex type, and
contains the empty string contains all trees of height 0,
where the root node has a label from that is, may have an empty content
model.
We are interested only in productive types, that is types, where
We assume that for a schema all are productive.
Whether a type is productive can be verified easily as follows:
1.
2.
3.
4.
Mark all simple types as productive since by the definition of valid, they
contain trees of height 1 with labels from
For complex types, compute the set defined as
is productive}.
Mark as productive if In other words, a
type is productive if or there is a string in that
uses only labels from
Repeat Steps 2 and 3 until no more types can be marked as productive.
This procedure identifies all productive types defined in a schema. There is
a straightforward algorithm for converting a schema with types that are non-
productive into one that contains only productive types. The basic idea is to
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
645
modify for each productive so that the language of the new regular
expression is
Pseudocode for validating an ordered, labeled tree with respect to an abstract
XML Schema is provided below, constructstring is a utility method (not shown)
that creates a string from the labels of the root nodes of a sequence of trees (it
returns if the sequence is empty). Note that if a node has no children, the
body
of the foreach loop will not be executed.
A DTD can be viewed as an abstract XML Schema, where
each is assigned a unique type irrespective of the context in which it
is used. In other words, for all there exists such that for all
is either not defined or If
is defined, then as well.
3.1
Algorithm Overview
Given two abstract XML Schemas, and and
an ordered labeled tree, T, that is valid according to S, our algorithm validates
T with respect to S and in parallel. Suppose that during the validation of T
with respect to we wish to validate a subtree of T, with respect to a type
Let be the type assigned to during the validation of T with respect to
S
. If one can assert that every ordered labeled tree that is valid according to
is also valid according to then one can immediately deduce the validity of
according to Conversely, if no ordered labeled tree that is valid according
to is also valid according to then one can stop the validation immediately
since will not be valid according to
We use subsumed type and disjoint type relationships to avoid traversals of
subtrees of T where possible:
Definition
2. A
type
is
subsumed
by a
type
denoted
if
Note that and can belong to different schemas.
Definition 3. Two types
and
are disjoint, denoted
if
Again, note that
and
can belong to different schemas.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
646
M. Raghavachari and O. Shmueli
In the following sections, we present algorithms for determining whether an
abstract XML Schema type is subsumed by another or is disjoint from another.
We present an algorithm for efficient schema cast validation of an ordered la-
beled tree, with and without updates. Finally, in the case where the abstract
XML Schemas represent DTDs, we describe optimizations that are possible if
additional indexing information is available on ordered labeled trees.
3.2
Schema Cast Validation
Our algorithm relies on relations, and that capture precisely all sub-
sumed type and disjoint type information with respect to the types defined in
and We first describe how these relations are computed, and then, present
our algorithm for schema cast validation.
Computing the Relation
Definition 4. Given two schemas, and
let be the largest relation such that for all one of
the following two conditions hold:
are both simple types.
are both complex types, and where
is defined,
As mentioned before, for exposition reasons, we have chosen to merge all simple
types into one common simple type. It is straightforward to extend the definition
above so that the various XML Schema atomic and simple types, and their
derivations are used to bootstrap the definition of the subsumption relationship.
Also, observe that is a finite relation since there are finitely many types.
The following theorem states that the relation captures precisely the
notion of subsumption defined earlier:
Theorem
1. if and
only
if
We now present an algorithm for computing the relation. The algorithm
starts with a subset of and refines it successively until is obtained.
1.
2.
3.
4.
Let such that implies that both and are
simple types, or both of them are complex types.
For if remove from
For each if there exists and and
remove from
Repeat Step 3 until no more tuples can be removed from the relation
i.
ii.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
647
Computing the
relation.
Rather than computing
directly, we com-
pute its complement. Formally:
Definition 5. Given two schemas, and
let
be defined as the smallest relation (least fixpoint) such that
if:
and
are both simple types.
and
are both complex types,
To compute the relation, the algorithm begins with an empty relation
and adds tuples until is obtained.
1.
2.
3.
4.
Let
Add all to such that simple simple
For each let
If
add
to
Repeat Step 3 until no more tuples can be added to
Theorem 2. if and only if
Algorithm for Schema Cast Validation. Given the relations and
if at any time, a subtree of the document that is valid with respect to from S is
being validated with respect to from and then the subtree need not
be examined (since by definition, the subtree belongs to On the other
hand, if the document can be determined to be invalid with respect to
immediately. Pseudocode for incremental validation of the document is provided
below. Again, constructstring is a utility method (not shown) that creates a
string from the labels of the root nodes of a sequence of trees (returning if the
sequence is empty). We can efficiently verify the content model of with respect
to by using techniques for finite automata schema cast validation, as
will be described in the Section 4.
i.
ii.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
648
M. Raghavachari and O. Shmueli
3.3
Schema Cast Validation with Modifications
Given an ordered, labeled tree, T, that is valid with respect to an abstract XML
Schema S, and a sequence of insertions and deletions of nodes, and modifications
of element tags, we discuss how the tree may be validated efficiently with respect
to a new abstract XML Schema The updates permitted are the following:
1.
2.
3.
Modify the label of a specified node with a new label.
Insert a new leaf node before, or after, or as the first child of a node.
Delete a specified leaf node.
Given a sequence of updates, we perform the updates on T, and at each step,
we encode the modifications on T to obtain by extending with special
element tags of the form where A node in with label
represents the modification of the element tag in T with the element tag
in Similarly, a node in with label represents a newly inserted node
with tag and a label denotes a node deleted from T. Nodes that have not
been modified have their labels unchanged. By discarding all nodes with label
and converting the labels of all other nodes labeled into one obtains
the tree that is the result of performing the modifications on T.
We
assume
the
availability
of a
function
modified
on the
nodes
of
that
re-
turns for each node whether any part of the subtree rooted at that node has been
modified. The function modified can be implemented efficiently as follows. We
assume we have the Dewey decimal number of the node (generated dynamically
as we process). Whenever a node is updated we keep it in a trie [7] according
to its Dewey decimal number. To determine whether a descendant of a node
was modified, the trie is searched according to the Dewey decimal number of
Note that we can navigate the trie in parallel to navigating the XML tree.
The algorithm for efficient validation of schema casts with modifications val-
idates with respect to S and in parallel. While processing a
subtree of with respect to types from S and from one of the
following cases apply:
1.
2.
3.
4.
If modified is false, we can run the algorithm described in the previous
subsection on this subtree. Since the subtree is unchanged and we know
that when checked with respect to S, we can treat the vali-
dation of as an instance of the schema cast validation problem (without
modifications) described in Section 3.2.
Otherwise, if we do not need to validate the subtree with
respect to any since that subtree has been deleted.
Otherwise, if since the label denotes that is a newly in-
serted subtree, we have no knowledge of its validity with respect to any
other schema. Therefore, we must validate the whole subtree explicitly.
Otherwise, if or
since elements may have been added or deleted from the original content
model of the node, we must ensure that the content of is valid with
respect to If is a simple type, the content model must satisfy (1) of
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
649
Definition 1. Otherwise, if one must check that fit
into the content model of as specified by In verifying the content
model, we check whether where
is defined as:
is defined analogously. If the content model check succeeds, and
is also a complex type, then we continue recursively validating
with respect to from S and from
(note that if is we do not have to validate that since it
has been deleted in If is not a complex type, we must validate each
explicitly.
3.4
DTDs
Since the type of an element in an XML Schema may depend on the context in
which it appears, in general, it is necessary to process the document in a top-
down manner to determine the type with which one must validate an element
(and its subtree). For DTDs, however, an element label determines uniquely the
element’s type. As a result, there are optimizations that apply to the DTD case
that cannot be applied to the general XML Schema case. If one can access all
instances of an element label in an ordered labeled tree directly, one need only
visit those elements where the types of in S and are neither subsumed nor
disjoint from each other and verify their immediate content model.
4
Finite Automata Conformance
In this section, we examine the schema cast validation problem (with and without
modifications) for strings verified with respect to finite automata. The algorithms
described in this section support efficient content model checking for DTDs and
XML Schemas (for example, in the statement of the method validate of Sec-
tion 3.2:
if
Since XML Schema
content models correspond directly to deterministic finite state automata, we
only address that case. Similar techniques can be applied to non-deterministic
finite state automata, though the optimality results do not hold. For reasons of
space, we omit details regarding non-deterministic finite state automata.
4.1
Definitions
A deterministic finite automaton is a 5-tuple where Q is a finite
set of states, is a finite alphabet of symbols, is the start state,
is a set of final, or accepting, states, and is the transition relation. is a map
from Without loss of generality, we assume that for all
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
650
M. Raghavachari and O. Shmueli
is defined. We use where to denote that
maps
to
For string and state denotes the state
reached by operating on one symbol at a time. A string
is accepted by a finite
state automaton if is rejected by the automaton if is
not accepted by it.
The language accepted (or recognized) by a finite automaton
denoted
is the set of strings accepted by We also define as
Note that for a finite state automaton if a string
is in and then is in
We shall drop the subscript from when the automaton is clear from the
context.
A state is a dead state if either:
In other words, either the state is not reachable from the start state or no
final state is reachable from it. We can identify all dead states in a finite state
automaton in time linear in the size of the automaton via a simple graph search.
Intersection Automata. Given two automata, and
one can derive an intersection automaton such that
accepts exactly the language The intersection automaton evaluates
a string on both and in parallel and accepts only if both would. Formally,
where and
where Note that if and are
deterministic, is deterministic as well.
Immediate Decision Automata. We introduce immediate decision automata
as modified finite state automata that accept or reject strings as early as pos-
sible. Immediate decision automata can accept or reject a string when certain
conditions are met, without scanning the entire string. Formally, an immediate
decision automaton is a 7-tuple, where I A, IR
are disjoint sets and I A, (each member of IA and IR is a state). As
with ordinary finite state automata, a string is accepted by the automaton
if Furthermore, also accepts after evaluating a
strict prefix of (that is ) if rejects
after evaluating a strict prefix of if We can derive
an immediate decision automaton from a finite state automaton so that both
automata accept the same language.
Definition 6. Let be a finite state automaton. The de-
rived immediate decision automaton is
where:
and
1.
if then or
2. if then
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
651
It can be easily shown that and accept the same language.
For deterministic automata, we can determine all states that belong to
and efficiently in time linear in the number of states of the automaton. The
members of can be derived easily from the dead states of
4.2
Schema Cast Validation
The problem that we address is the following: Given two deterministic finite-
state automata, and and a string
does One could, of course, scan using to determine
acceptance by When many strings that belong to are to be validated
with respect to it can be more efficient to prepreprocess and so that
the knowledge of acceptance by can be used to determine its membership
in Without loss of generality, we assume that
Our method for the efficient validation of a string in
with respect to relies on evaluating on and in parallel. Assume that after
parsing a prefix of we are in a state in and a state
in Then, we can:
1.
2.
Accept immediately if because is guaranteed
to be in (since accepts ), which implies that will be in
By definition of will accept
Reject immediately if Then, is guaranteed
not to be in and therefore, will not accept
We construct an immediate decision automaton, from the intersection
automaton of and with and based on the two conditions above:
Definition 7. Let be the intersection automaton
derived
from two finite state automata and The derived immediate decision automa-
ton is where:
Theorem 3. For all accepts if and only if
The determination of the members of can be done efficiently for deter-
ministic finite state automata. The following proposition is useful to this end.
Proposition 1. For any state, if and only if
there exist states and such that and if _
then
We now present an alternative, equivalent definition of
Definition 8. For all if there exist states
such that and if then
is a dead state }.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.