0321210255 split 2 4624

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.07 MB, 7 trang )

Simpo PDF Merge and Split Unregistered Version -

Chapter

21

Query Processing

Chapter Objectives
In this chapter you will learn:
n

The objectives of query processing and optimization.

n

Static versus dynamic query optimization.

n

How a query is decomposed and semantically analyzed.

n

How to create a relational algebra tree to represent a query.

n

The rules of equivalence for the relational algebra operations.

n

How to apply heuristic transformation rules to improve the efficiency of a query.

n

The types of database statistics required to estimate the cost of operations.

n

The different strategies for implementing the relational algebra operations.

n

How to evaluate the cost and size of the relational algebra operations.

n

How pipelining can be used to improve the efficiency of queries.

n

The difference between materialization and pipelining.

n

The advantages of left-deep trees.

n

Approaches for finding the optimal execution strategy.

n

How Oracle handles query optimization.

When the relational model was first launched commercially, one of the major criticisms
often cited was inadequate performance of queries. Since then, a significant amount of
research has been devoted to developing highly efficient algorithms for processing queries.
There are many ways in which a complex query can be performed, and one of the aims of
query processing is to determine which one is the most cost effective.
In first generation network and hierarchical database systems, the low-level procedural
query language is generally embedded in a high-level programming language such as
COBOL, and it is the programmer’s responsibility to select the most appropriate execution
strategy. In contrast, with declarative languages such as SQL, the user specifies what data
is required rather than how it is to be retrieved. This relieves the user of the responsibility
of determining, or even knowing, what constitutes a good execution strategy and makes
the language more universally usable. Additionally, giving the DBMS the responsibility

21.1 Overview of Query Processing

|

Simpo PDF Merge and Split Unregistered Version -
for selecting the best strategy prevents users from choosing strategies that are known to be
inefficient and gives the DBMS more control over system performance.
There are two main techniques for query optimization, although the two strategies are
usually combined in practice. The first technique uses heuristic rules that order the operations in a query. The other technique compares different strategies based on their relative
costs and selects the one that minimizes resource usage. Since disk access is slow compared with memory access, disk access tends to be the dominant cost in query processing
for a centralized DBMS, and it is the one that we concentrate on exclusively in this chapter when providing cost estimates.

Structure of this Chapter
In Section 21.1 we provide an overview of query processing and examine the main phases
of this activity. In Section 21.2 we examine the first phase of query processing, namely
query decomposition, which transforms a high-level query into a relational algebra query
and checks that it is syntactically and semantically correct. In Section 21.3 we examine the
heuristic approach to query optimization, which orders the operations in a query using
transformation rules that are known to generate good execution strategies. In Section 21.4
we discuss the cost estimation approach to query optimization, which compares different
strategies based on their relative costs and selects the one that minimizes resource usage.
In Section 21.5 we discuss pipelining, which is a technique that can be used to further
improve the processing of queries. Pipelining allows several operations to be performed in
a parallel way, rather than requiring one operation to be complete before another can start.
We also discuss how a typical query processor may choose an optimal execution strategy.
In the final section, we briefly examine how Oracle performs query optimization.
In this chapter we concentrate on techniques for query processing and optimization in
centralized relational DBMSs, being the area that has attracted most effort and the model
that we focus on in this book. However, some of the techniques are generally applicable
to other types of system that have a high-level interface. Later, in Section 23.7 we briefly
examine query processing for distributed DBMSs. In Section 28.5 we see that some of the
techniques we examine in this chapter may require further consideration for the ObjectRelational DBMS, which supports queries containing user-defined types and user-defined
functions.
The reader is expected to be familiar with the concepts covered in Section 4.1 on the
relational algebra and Appendix C on file organizations. The examples in this chapter are
drawn from the DreamHome case study described in Section 10.4 and Appendix A.

Overview of Query Processing
Query
processing

The activities involved in parsing, validating, optimizing, and executing
a query.

21.1

631

632

|

Chapter 21 z Query Processing

Simpo PDF Merge and Split Unregistered Version -
The aims of query processing are to transform a query written in a high-level language,
typically SQL, into a correct and efficient execution strategy expressed in a low-level
language (implementing the relational algebra), and to execute the strategy to retrieve the
required data.
Query
optimization

The activity of choosing an efficient execution strategy for processing
a query.

An important aspect of query processing is query optimization. As there are many
equivalent transformations of the same high-level query, the aim of query optimization
is to choose the one that minimizes resource usage. Generally, we try to reduce the total
execution time of the query, which is the sum of the execution times of all individual
operations that make up the query (Selinger et al., 1979). However, resource usage may

also be viewed as the response time of the query, in which case we concentrate on
maximizing the number of parallel operations (Valduriez and Gardarin, 1984). Since the
problem is computationally intractable with a large number of relations, the strategy
adopted is generally reduced to finding a near optimum solution (Ibaraki and Kameda,
1984).
Both methods of query optimization depend on database statistics to evaluate properly
the different options that are available. The accuracy and currency of these statistics
have a significant bearing on the efficiency of the execution strategy chosen. The statistics
cover information about relations, attributes, and indexes. For example, the system catalog
may store statistics giving the cardinality of relations, the number of distinct values for
each attribute, and the number of levels in a multilevel index (see Appendix C.5.4).
Keeping the statistics current can be problematic. If the DBMS updates the statistics
every time a tuple is inserted, updated, or deleted, this would have a significant impact on
performance during peak periods. An alternative, and generally preferable, approach is
to update the statistics on a periodic basis, for example nightly, or whenever the system is
idle. Another approach taken by some systems is to make it the users’ responsibility to
indicate when the statistics are to be updated. We discuss database statistics in more detail
in Section 21.4.1.
As an illustration of the effects of different processing strategies on resource usage, we
start with an example.

Example 21.1 Comparison of different processing strategies
Find all Managers who work at a London branch.

We can write this query in SQL as:
SELECT *
FROM Staff s, Branch b
WHERE s.branchNo = b.branchNo AND
(s.position = ‘Manager’ AND b.city = ‘London’);

21.1 Overview of Query Processing

Simpo PDF Merge and Split Unregistered Version -
Three equivalent relational algebra queries corresponding to this SQL statement are:
(1) σ(position =‘Manager’) ∧ (city =‘London’) ∧ (Staff.branchNo = Branch.branchNo)(Staff × Branch)
(2) σ(position =‘Manager’) ∧ (city =‘London’)(Staff 1Staff.branchNo = Branch.branchNo Branch)
(3) (σposition =‘Manager’(Staff)) 1Staff.branchNo = Branch.branchNo (σcity =‘London’(Branch))
For the purposes of this example, we assume that there are 1000 tuples in Staff, 50 tuples
in Branch, 50 Managers (one for each branch), and 5 London branches. We compare these
three queries based on the number of disk accesses required. For simplicity, we assume
that there are no indexes or sort keys on either relation, and that the results of any intermediate operations are stored on disk. The cost of the final write is ignored, as it is the
same in each case. We further assume that tuples are accessed one at a time (although in
practice disk accesses would be based on blocks, which would typically contain several
tuples), and main memory is large enough to process entire relations for each relational
algebra operation.
The first query calculates the Cartesian product of Staff and Branch, which requires
(1000 + 50) disk accesses to read the relations, and creates a relation with (1000 * 50)
tuples. We then have to read each of these tuples again to test them against the selection
predicate at a cost of another (1000 * 50) disk accesses, giving a total cost of:
(1000 + 50) + 2*(1000 * 50) = 101 050 disk accesses
The second query joins Staff and Branch on the branch number branchNo, which again
requires (1000 + 50) disk accesses to read each of the relations. We know that the join of
the two relations has 1000 tuples, one for each member of staff (a member of staff can only
work at one branch). Consequently, the Selection operation requires 1000 disk accesses to
read the result of the join, giving a total cost of:
2*1000 + (1000 + 50) = 3050 disk accesses
The final query first reads each Staff tuple to determine the Manager tuples, which
requires 1000 disk accesses and produces a relation with 50 tuples. The second Selection
operation reads each Branch tuple to determine the London branches, which requires 50

disk accesses and produces a relation with 5 tuples. The final operation is the join of the
reduced Staff and Branch relations, which requires (50 + 5) disk accesses, giving a total
cost of:
1000 + 2*50 + 5 + (50 + 5) = 1160 disk accesses
Clearly the third option is the best in this case, by a factor of 87:1. If we increased the
number of tuples in Staff to 10 000 and the number of branches to 500, the improvement
would be by a factor of approximately 870:1. Intuitively, we may have expected this as the
Cartesian product and Join operations are much more expensive than the Selection operation, and the third option significantly reduces the size of the relations that are being joined
together. We will see shortly that one of the fundamental strategies in query processing is
to perform the unary operations, Selection and Projection, as early as possible, thereby
reducing the operands of any subsequent binary operations.

|

633

634

|

Chapter 21 z Query Processing

Simpo PDF Merge and Split Unregistered Version -
Figure 21.1
Phases of query
processing.

Query processing can be divided into four main phases: decomposition (consisting
of parsing and validation), optimization, code generation, and execution, as illustrated in

Figure 21.1. In Section 21.2 we briefly examine the first phase, decomposition, before
turning our attention to the second phase, query optimization. To complete this overview,
we briefly discuss when optimization may be performed.

Dynamic versus static optimization
There are two choices for when the first three phases of query processing can be carried
out. One option is to dynamically carry out decomposition and optimization every time the
query is run. The advantage of dynamic query optimization arises from the fact that all
information required to select an optimum strategy is up to date. The disadvantages are
that the performance of the query is affected because the query has to be parsed, validated,
and optimized before it can be executed. Further, it may be necessary to reduce the number of execution strategies to be analyzed to achieve an acceptable overhead, which may
have the effect of selecting a less than optimum strategy.
The alternative option is static query optimization, where the query is parsed, validated, and optimized once. This approach is similar to the approach taken by a compiler
for a programming language. The advantages of static optimization are that the runtime

Index

|

1373

Simpo PDF Merge and Split Unregistered Version -
nested model 616–18
sagas 618–19
workflow models 621–2
architecture for 576–7
classification of 724–5
concurrency control 577–605
deadlock 594–7

granularity 602–5
locking methods 587–94
multiversion timestamp ordering
600–1
need for 577–80
optimistic techniques 601–2
recoverability 587
serializability 580–6
timestamping 597–600
and denormalization 531
design 300–1
as logical units of work 573
in object model 907
in OODBMS 871
in Oracle 774
in physical design 502–6, 1331
data usage 505–6
frequency information 504–5
paths to relations 503–4
properties 575–6
in RDBMS 813
and recovery 607–9
serializability of 580–6
conflict, testing for 582–3
distributed 737
view serializability 583–6
testing for 584–6
in SQL 187–9
transform methods in OODBMS 835
transform-oriented languages 109

transformation rules in relational algebra
640–4
transformation tools in data warehousing
1165–6
transient objects 867, 902
transient versions 872
transitive closure in RDBMS 812–13
transitive dependency 396–7
transitive persistence 870
transparency of DDBMS 690, 719–28
distribution transparency 719–22
performance transparency 725–8
transaction transparency 722–5
transparent network access in Web-DBMS
applications 1008
transparent SQL access in Oracle 775
tree induction 1235
tree structure 1280
triggers
and denormalization 531
in Oracle 245, 263–7

in replication 790–1
in SQL 967–70
tuple relational calculus 103–7
expressions 105–7
safety of 106–7
formulae 105–7
tuple variables 103
tuples 69, 73–4

distributing 529
Tuxedo 63
two-phase commit (2PC) 746–52
communication topologies 751–2
election protocols 751
termination protocols 748–50
two-phase locking (2PL) 589–91
in DDBMS
centralized 738
distributed 2PL 739–40
majority locking 740
primary copy 739
Two Phase Optimization 672
type hierarchy 374
type inheritance in Oracle 982
type model 982
typed views in SQL 965–6
types in Object Model 906
typespecs in ObjectStore 927
typing judgment 1126
unary operations 89–91
unary relationship 349
uncommitted dependency problem 577,
578–9, 590–1
undo operation 607, 608, 612
undone transaction 574
unfederated multidatabase systems 699
unicode compression property in Access
232–3
Unified Modeling Language (UML) 288,

894
OODBMS design with 836–44
UML diagrams 837–42
usage of UML 842–4
Uniform Resource Identifiers (URIs) 1002
Uniform Resource Locators (URL)
1002–4
Uniform Resource Names (URNs) 1002
unilateral abort 746
union 92, 102
of tables in SQL 147, 148
union operations in relational algebra 642
uniqueness of candidate key 78
Universal Discovery, Description and
Integration (UDDI) 1088–91
universal object storage standards 899
Universe of Discourse (UoD) 44
University Accommodation Office case
study 1255–8

data requirements 1255–7
query transactions 1257–8
unnormalized form (UNF) 402, 403
unnormalized table 403
unordered (heap) files 1270, 1288
unpinned data page 609
unsafe expressions 107
unstructured complex objects 825
unstructured interviews 317
unsupervised learning approach to

database segmentation 1236
update anomalies 391
update-anywhere ownership 784, 787–8
UPDATE in SQL 117, 152–3
update of data 48
restrictions in SQL 186
update query 217–20
update transactions 301
Upper-CASE tools 307
use case diagrams in UML 838–9, 840
user-accessible catalog 48–9
user-defined data types in Oracle 978–83
user-defined routines in SQL 953–5
user-defined types in SQL 948–51
user-defined words in SQL 116
user interface design 301–3
user-level security in Microsoft Office
Access 555–8
user transactions
in conceptual design 456–8, 1327
in logical design 474, 1329
user views
in database planning 287
in Dreamhome case study 336–7
in physical design 515–16, 1331
users in Oracle 247
utility services 52
validation phase of optimistic concurrency
control protocol 602
validation property in Access 232

validation rules 235–6
validation techniques in normalization
389
VB.net 304
VBScript 1012–13
Versant OODBMS 834, 850
version history 872
version management 872
versionable classes 873
versions 872–3
vertical fragmentation
in distributed query optimization 764–5
of DDBMS 708, 713–15
vertical partitioning 529
view maintenance 187
view materialization 176, 186–7
view mechanism 18

1374

|

Index

Simpo PDF Merge and Split Unregistered Version -
view resolution 176, 180–1
view serializability 583–6
testing for 584–6
views 83–5

in DBMS 17–18
lack of in OODBMS 885
in Oracle 245
purpose of 84–5
and security 550
in SQL 176–87
advantages 185–6
WITH CHECK OPTION 183–4
creating 177–9
disadvantages 186
grouped and joined 179
horizontal 176–7
materialization 186–7
removing 179–80
resolution of 180–1
restrictions 181
updating 181–2
vertical 177
typed, in SQL 965–6
updating 85
virtual memory mapping architecture in
ObjectStore 923–4
virtual relation 83
Visual Basic (VB) 40, 304
Visual FoxPro database system 25
volatile storage 606, 1268
wait-for graph (WFG) 595
warehouse manager in data warehousing
1158
weak entity type 356, 465

Web 998–1011
ActiveX security 569
integration with DBMS 1005–6
advantages of 1006–8
approaches to 1011
disadvantages of 1008–11
security on 562–9
digital certificates 564–5
digital signatures 564
firewalls 563–4
Java security 566–9
Kerberos 565
message digest algorithm 564
proxy servers 563

secure electronic transactions 566
secure sockets layer 565–6
server, extending 1020–1
services 1004–5
static and dynamic pages 1004
Web-based database solutions 228
Web data in Oracle Warehouse Builder
1199
Web services 1004–5
Web Services Description Language
(WSDL) 1088
Web sites
interactive and dynamic 808
well-formed formula 104
Wellmeadows Hospital case study 1260–7

data requirements 1260–6
transaction requirements 1266–7
wide area networks 61, 700
width-balanced histograms in query
optimization 677
wildcard characters 202
windowing calculations in OLAP 1223–4
windows in Oracle 268
Windows NT 63
Wireless Application Protocol (WAP) 702
Wisconsin benchmarking 878–9
WITH CHECK OPTION in SQL 183–4
wizards in Office Access 226, 229
workflow ownership 784, 787
working versions 872
workload and physical database design
502
write fault 924
write phase of optimistic concurrency
control protocol 602
write_timestamp 598
X/Open Distributed Transaction
Processing 62
Model 758–61
XML 1073–82
advantages 1074–6
CDATA 1078
comments 1078
and databases 1128–39
schema independent representation

1131–2
storing in an attribute 1130
storing in shredded form 1130

declaration 1076
document type definitions 1078–82
elements 1076–7
entity references 1077
ordering 1078
related technologies 1082–91
schema 1091–100
and SQL 1132–7
mapping functions 1135–7
new data type 1132–4
XML Information Set 1114–15
XML Linking Language (XLink) 1086
XML Metadata Interchange (XMI) 895
XML Path Language (XPath) 1085
2.0 data model 1115–20
XML Pointer Language (XPointer)
1085–6
XML Query Languages 1100–28
formal semantics 1121–8
dynamic evaluation 1126–7
logical expressions 1127–8
normalization 1121–5
static type analysis 1125–6
information set 1114–15
Lore and Lorel, extending 1100–1
query working group 1101–3

XQuery 1103–14
XQuery 1.0 data model 1115–20
XML schema 1091–100
built-in types 1092
cardinality 1093
constraints 1096
groups 1094 –5
lists and unions 1095–6
new types 1094
references 1093–4
simple and complex types 1092–3
XML see eXtensible Mark-up Language
(XML)
XQuery 1103–14
built-in functions and user-defined
functions 1111–12
1.0 data model 1115–20
FLWOR expressions 1105–11
path expressions 1103–4
types and sequence types 1112–14
XSL Transformations (XSLT) 1084
Yes/No data type 229

0321210255 split 2 4624

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về