Distributed Database
Management Systems
Lecture 33
In the previous lecture
• Final phase of QD
• Data Localization: for HF,
VF and DF.
In today’s Lecture
• Data Localization for
Hybrid Fragmentation
• Query Optimization.
Reduction for HyF
• HyF contains both types of
Fragmentations
• EMP1=
eNo ≤ E4
(
eNo, eName
(EMP))
• EMP2=
eNo > E4
(
eNo, eName
(EMP))
• EMP3=
eNo, title
(EMP).
• Select eName from EMP
where eNo = E5
eName
eName
eNo = E5
⋈
eNo = E5
e
U
EMP1
No
EMP2
EMP2
EMP3
Reduced Query-
Summary of what we
have done so far
• Query Decomposition:
generates an efficient query
in relational algebra
–
Normalization, Analysis,
Simplification, Rewriting
• Data Localization: applies
global query to fragments;
increases optimization level-
• So, next is the cost-based
optimization
• Mainly concentrates on
the order of performing
joins
• Characteristics of relations
like cardinalities are
considered
• First QO in general
• QO refers to
producing a Query
Execution plan (QEP)
that represents
execution strategy.
• Components of Optimizer
• Search Space: set of eq.
alternative exec plans
• Cost Model: predicts cost
of a execution plan
• Search Strategy:
produces best plan
Search Space
• Search space consists of
eq. Query Trees
produced using Tr Rules
• Optimizer concentrates
on join trees, since join
cost is the most effective
• Example:
• Select eName, resp
From EMP, ASG, PROJ
where EMP.eNo = ASG.
eNo and ASG.pNo =
PROJ.pNo.
⋈pNo
⋈eNo
EMP
⋈eNo
⋈pNo
PROJ
ASG
PROJ
⋈pNo, eNo
x
EMP
ASG
PROJ
ASG
EMP
• Alternatives with N
relations are O(N!)
based on properties of
relations
• So, restrictions are
applied
1- Heuristics
- Selection and
projection on base
relations
- Avoid Cartesian
product
2- Shape of Tree
- Linear Tree: At least one
node for each operand is
a base relation
- Bushy tree: May have
operators with interm
tables only; allows
parallel execution
Search Strategy
• Most popular is Dynamic
Programming
• That starts with base
relations and keeps on
adding relations calculating
cost
• DP is almost exhaustive
so produces best plan
• Too expensive with more
than 5 relations
• Other option is
Randomized strategy
• Do not guarantee best
Cost Model
• Cost of operators, statistics
of base data to predict size
of intermediate tables
• Cost considered as Total
Time and Response Time.
• Total time = CPU time +
I/O time + tr time
• In WAN, major cost is tr
time
• Initially ratios were 20:1
for tr and I/O, for LAN it
is 1:1.6
• Response time = CPU
time + I/O time + tr
time
• Difference.?
• TCPU = time for a CPU inst
• TI/O = a disk I/O
• TMSG = fixed time for
initiating and recv a msg
• TTR = transmit a data unit
from one site to another.
Site 1
X units
Site 3
Site 2
Y units
• TT = 2TMSG + TTR*(x+y)
• RT = max{TMSG + TTR*X,
TMSG + TTR*Y}
Database Statistics
• Major factor is interm tabs
• If the interm results are to
be transmitted, then
estimation about size is a
must
• More precise statistics cost
more
• For each relation R[A1, A2, …, An]
fragmented as R1, …, Rr
1.length of each attribute: length(Ai)
2.the number of distinct values for
each attribute in each fragment:
card( Ai(Rj))
3.maximum and minimum values in
the domain of each attribute:
min(Ai), max(Ai).