Distributed Database Management Systems: Lecture 33

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (93.57 KB, 36 trang )

Distributed Database
Management Systems
Lecture 33

In the previous lecture
• Final phase of QD
• Data Localization: for HF,
VF and DF.

In today’s Lecture
• Data Localization for
Hybrid Fragmentation
• Query Optimization.

Reduction for HyF
• HyF contains both types of
Fragmentations
• EMP1=

eNo ≤ E4

(

eNo, eName

(EMP))

• EMP2=

eNo > E4

(

eNo, eName

(EMP))

• EMP3=

eNo, title

(EMP).

• Select eName from EMP
where eNo = E5
eName
eName
eNo = E5

⋈

eNo = E5

e

U
EMP1

No

EMP2

EMP2
EMP3

Reduced Query-

Summary of what we
have done so far

• Query Decomposition:
generates an efficient query
in relational algebra
–

Normalization, Analysis,
Simplification, Rewriting

• Data Localization: applies
global query to fragments;
increases optimization level-

• So, next is the cost-based
optimization

• Mainly concentrates on
the order of performing
joins
• Characteristics of relations
like cardinalities are
considered

• First QO in general
• QO refers to
producing a Query
Execution plan (QEP)
that represents
execution strategy.

• Components of Optimizer
• Search Space: set of eq.
alternative exec plans
• Cost Model: predicts cost
of a execution plan
• Search Strategy:
produces best plan

Search Space
• Search space consists of
eq. Query Trees
produced using Tr Rules
• Optimizer concentrates

on join trees, since join
cost is the most effective

• Example:

• Select eName, resp
From EMP, ASG, PROJ
where EMP.eNo = ASG.
eNo and ASG.pNo =
PROJ.pNo.

⋈pNo
⋈eNo
EMP

⋈eNo
⋈pNo

PROJ

ASG

PROJ

⋈pNo, eNo
x
EMP

ASG
PROJ

ASG

EMP

• Alternatives with N
relations are O(N!)
based on properties of
relations
• So, restrictions are
applied

1- Heuristics
- Selection and
projection on base
relations
- Avoid Cartesian
product

2- Shape of Tree
- Linear Tree: At least one
node for each operand is
a base relation
- Bushy tree: May have
operators with interm

tables only; allows
parallel execution

Search Strategy
• Most popular is Dynamic
Programming
• That starts with base
relations and keeps on
adding relations calculating
cost

• DP is almost exhaustive
so produces best plan
• Too expensive with more
than 5 relations
• Other option is
Randomized strategy
• Do not guarantee best

Cost Model
• Cost of operators, statistics
of base data to predict size
of intermediate tables
• Cost considered as Total
Time and Response Time.

• Total time = CPU time +
I/O time + tr time
• In WAN, major cost is tr
time
• Initially ratios were 20:1
for tr and I/O, for LAN it
is 1:1.6

• Response time = CPU
time + I/O time + tr
time
• Difference.?

• TCPU = time for a CPU inst
• TI/O = a disk I/O
• TMSG = fixed time for
initiating and recv a msg
• TTR = transmit a data unit
from one site to another.

Site 1

X units
Site 3

Site 2

Y units

• TT = 2TMSG + TTR*(x+y)
• RT = max{TMSG + TTR*X,
TMSG + TTR*Y}

Database Statistics

• Major factor is interm tabs
• If the interm results are to
be transmitted, then
estimation about size is a
must
• More precise statistics cost
more

• For each relation R[A1, A2, …, An]
fragmented as R1, …, Rr
1.length of each attribute: length(Ai)
2.the number of distinct values for
each attribute in each fragment:
card( Ai(Rj))
3.maximum and minimum values in
the domain of each attribute:
min(Ai), max(Ai).

Distributed Database Management Systems: Lecture 33

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về