Tải bản đầy đủ (.pdf) (10 trang)

Joe Celko s SQL for Smarties - Advanced SQL Programming P32 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (327.09 KB, 10 trang )

282 CHAPTER 13: BETWEEN AND OVERLAPS PREDICATES
celebration. A little algebra tells you that the length of an event is
(Event.finish_date - Event.start_date + INTERVAL '1'
DAY) and that the length of a guest’s stay is(Guest.depart_date -
Guest.arrival_date + INTERVAL '1' DAY). Let’s do one of those
timeline charts again:

What we want is the part of the Guests interval that is inside the
Celebrations interval.
Guests 1 and 2 spent only part of their time at the celebration; Guest
3 spent all of his time at the celebration and Guest 4 stayed even longer
than the celebration. That interval is defined by the two points
(GREATEST(arrival_date, start_date),
LEAST(depart_date, finish_date)).
Instead, you can use the aggregate functions in SQL to build a
VIEW
on a
VIEW, like this:
CREATE VIEW Working (guest_name, celeb_name, entered, exited)
AS SELECT GE.guest_name, GE.celeb_name, start_date, finish_date
FROM GuestCelebrations AS GE, Celebrations AS E1
WHERE E1.celeb_name = GE.celeb_name
UNION
SELECT GE.guest_name, GE.celeb_name, arrival_date, depart_date
FROM GuestCelebrations AS GE, Guests AS G1
WHERE G1.guest_name = GE.guest_name;
VIEW Working
guest_name celeb_name entered exited
================================================================
'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-02-28'
'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-11-01'


'Dorothy Gale' 'Garlic Festival' '2005-02-01' '2005-11-01'
'Dorothy Gale' 'Garlic Festival' '2005-01-15' '2005-02-15'
'Dorothy Gale' 'St. Fred's Day' '2005-02-01' '2005-11-01'
Figure 13.3
Timeline Diagram.
13.2 OVERLAPS Predicate 283
'Dorothy Gale' 'St. Fred's Day' '2005-02-24' '2005-02-24'
'Dorothy Gale' 'Year of the Prune' '2005-02-01' '2005-11-01'
'Dorothy Gale' 'Year of the Prune' '2005-01-01' '2005-12-31'
'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-01'
'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-28'
'Indiana Jones' 'Garlic Festival' '2005-02-01' '2005-02-01'
'Indiana Jones' 'Garlic Festival' '2005-01-15' '2005-02-15'
'Indiana Jones' 'Year of the Prune' '2005-02-01' '2005-02-01'
'Indiana Jones' 'Year of the Prune' '2005-01-01' '2005-12-31'
'Don Quixote' 'Apple Month' '2005-02-01' '2005-02-28'
'Don Quixote' 'Apple Month' '2005-01-01' '2005-10-01'
'Don Quixote' 'Garlic Festival' '2005-01-01' '2005-10-01'
'Don Quixote' 'Garlic Festival' '2005-01-15' '2005-02-15'
'Don Quixote' 'National Pear Week' '2005-01-01' '2005-01-07'
'Don Quixote' 'National Pear Week' '2005-01-01' '2005-10-01'
'Don Quixote' 'New Year's Day' '2005-01-01' '2005-01-01'
'Don Quixote' 'New Year's Day' '2005-01-01' '2005-10-01'
'Don Quixote' 'St. Fred's Day' '2005-02-24' '2005-02-24'
'Don Quixote' 'St. Fred's Day' '2005-01-01' '2005-10-01'
'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-12-31'
'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-10-01'
'James T. Kirk' 'Apple Month' '2005-02-01' '2005-02-28'
'James T. Kirk' 'Garlic Festival' '2005-02-01' '2005-02-28'
'James T. Kirk' 'Garlic Festival' '2005-01-15' '2005-02-15'

'James T. Kirk' 'St. Fred's Day' '2005-02-01' '2005-02-28'
'James T. Kirk' 'St. Fred's Day' '2005-02-24' '2005-02-24'
'James T. Kirk' 'Year of the Prune' '2005-02-01' '2005-02-28'
'James T. Kirk' 'Year of the Prune' '2005-01-01' '2005-12-31'
'Santa Claus' 'Christmas Season' '2005-12-01' '2005-12-25'
'Santa Claus' 'Year of the Prune' '2005-12-01' '2005-12-25'
'Santa Claus' 'Year of the Prune' '2005-01-01' '2005-12-31'
This will put the earliest and latest points in both intervals into one
column. Now we can construct a
VIEW like this:
CREATE VIEW Attendees (guest_name, celeb_name, entered, exited)
AS SELECT guest_name, celeb_name, MAX(entered), MIN(exited)
FROM Working
GROUP BY guest_name, celeb_name;
VIEW Attendees
284 CHAPTER 13: BETWEEN AND OVERLAPS PREDICATES
guest_name celeb_name entered exited
===============================================================
'Dorothy Gale' 'Apple Month' '2005-02-01' '2005-02-28'
'Dorothy Gale' 'Garlic Festival' '2005-02-01' '2005-02-15'
'Dorothy Gale' 'St. Fred's Day' '2005-02-24' '2005-02-24'
'Dorothy Gale' 'Year of the Prune' '2005-02-01' '2005-11-01'
'Indiana Jones' 'Apple Month' '2005-02-01' '2005-02-01'
'Indiana Jones' 'Garlic Festival' '2005-02-01' '2005-02-01'
'Indiana Jones' 'Year of the Prune' '2005-02-01' '2005-02-01'
'Don Quixote' 'Apple Month' '2005-02-01' '2005-02-28'
'Don Quixote' 'Garlic Festival' '2005-01-15' '2005-02-15'
'Don Quixote' 'National Pear Week' '2005-01-01' '2005-01-07'
'Don Quixote' 'New Year's Day' '2005-01-01' '2005-01-01'
'Don Quixote' 'St. Fred's Day' '2005-02-24' '2005-02-24'

'Don Quixote' 'Year of the Prune' '2005-01-01' '2005-10-01'
'James T. Kirk' 'Apple Month' '2005-02-01' '2005-02-28'
'James T. Kirk' 'Garlic Festival' '2005-02-01' '2005-02-15'
'James T. Kirk' 'St. Fred's Day' '2005-02-24' '2005-02-24'
'James T. Kirk' 'Year of the Prune' '2005-02-01' '2005-02-28'
'Santa Claus' 'Christmas Season' '2005-12-01' '2005-12-25'
'Santa Claus' 'Year of the Prune' '2005-12-01' '2005-12-25'
The Attendees VIEW can be used to compute the total number of
room days for each celebration. Assume that the difference between two
dates will return an integer that is the number of days between them:
SELECT celeb_name,
SUM(exited - entered + INTERVAL '1' DAY) AS roomdays
FROM Attendees
GROUP BY celeb_name;
Result
celeb_name roomdays
============================
'Apple Month' 85
'Christmas Season' 25
'Garlic Festival' 63
'National Pear Week' 7
'New Year's Day' 1
'St. Fred's Day' 3
'Year of the Prune' 602
13.2 OVERLAPS Predicate 285
If you would like to get a count of the room days sold in the month of
January, you could use this query, which avoids a
BETWEEN or
OVERLAPS predicate completely:
SELECT SUM(CASE WHEN depart > DATE '2005-01-31'

THEN DATE '2005-01-31'
ELSE depart END
- CASE WHEN arrival_date < DATE '2005-01-01'
THEN DATE '2005-01-01'
ELSE arrival_date END + INTERVAL '1' DAY) AS
room_days
FROM Guests
WHERE depart > DATE '2005-01-01' AND arrival_date <= DATE
'2005-01-31';


CHAPTER

14

The [NOT] IN() Predicate

T

HE IN() PREDICATE IS very natural. It takes a value and sees whether that
value is in a list of comparable values. Standard SQL allows value
expressions in the list, or for you to use a query to construct the list.
The syntax is:

<in predicate> ::=
<row value constructor> [NOT] IN <in predicate value>
<in predicate value> ::=
<table subquery> | (<in value list>)
<in value list> ::=
<row value expression> { <comma> <row value expression> }


The expression

<row value constructor> NOT IN <in
predicate value>

has the same effect as

NOT (<row value
constructor> IN <in predicate value>)

. This pattern for the
use of the keyword

NOT

is found in most of the other predicates.
The expression

<row value constructor> IN <in
predicate value>

has, by definition, the same effect as

<row
value constructor> = ANY <in predicate value>

. Most
optimizers will recognize this and execute the same code for both


288 CHAPTER 14: THE [NOT] IN() PREDICATE

expressions. This means that if the

<in predicate value>

is empty,
such as one you would get from a subquery that returns no rows, the
results will be equivalent to

(<row value constructor> = (NULL,
, NULL))

, which is always evaluated to

UNKNOWN

. Likewise, if the

<in predicate value>

is an explicit list of

NULL

s, the results will be

UNKNOWN

. However, please remember that there is a difference between

an empty table and a table with rows of all

NULL

s.

IN()

predicates with a subquery can sometimes be converted into

EXISTS

predicates, but there are some problems and differences in the
predicates. The conversion to an

EXISTS

predicate is often a good way
to improve performance, but it will not be as easy to read as the original

IN()

predicate. An

EXISTS

predicate can use indexes to find (or fail to
find) a single value that confirms (or denies) the predicate, whereas the

IN()


predicate often has to build the results of the subquery in a
working table.

14.1 Optimizing the IN() Predicate

Most database engines have no statistics about the relative frequency of
the values in a list of constants, so they will scan them in the order in
which they appear in the list. People like to order lists alphabetically or
by magnitude, but it would be better to order the list from most
frequently occurring values to least frequent. It is also pointless to have
duplicate values in the constant list, since the predicate will return TRUE
if it matches the first duplicate it finds, and never get to the second
occurrence. Likewise, if the predicate is FALSE for that value, it wastes
computer time to traverse a needlessly long list.
Many SQL engines perform an

IN()

predicate with a subquery by
building the result set of the subquery first as a temporary working table,
then scanning that result table from left to right. This can be expensive in
many cases; for example, in a query to find employees in a city with a
major sport team (we want them to get tickets for us), we could write
(assuming that city names are unique):

SELECT *
FROM Personnel
WHERE city_name
IN (SELECT city_name _name

FROM SportTeams);

14.1 Optimizing the IN() Predicate 289

But let us further assume that our personnel are located in (

n

) cities
and the sports teams are in (

m

) cities, where (

m

) is much greater than
(

n

). If the matching cities appear near the front of the list generated by
the subquery expression, it will perform much faster than if they appear
at the end of the list. In the case of a subquery expression, you have no
control over how the subquery is presented back in the containing
query.
However, you can order the expressions in a list in the order in which
they are most likely to occur, such as:


SELECT *
FROM Personnel
WHERE city_name
IN ('New York', 'Chicago', 'Atlanta', , 'Austin');

Incidentally, Standard SQL allows row expression comparisons, so if
you have a Standard SQL implementation with separate columns for the
city and state, you could write:

SELECT *
FROM Personnel
WHERE (city_name , state)
IN (SELECT city_name , state
FROM SportTeams);

Teradata did not get correlated subqueries until 1996, so they often
used this syntax as a workaround. I am not sure if you should count
them as being ahead or behind the technology for that.
Today, all major versions of SQL remove duplicates in the result table
of the subquery, so you do not have to use a

SELECT DISTINCT

in the
subquery. You might see this in legacy code. A trick that can work for
large lists on some products is to force the engine to construct a list
ordered by frequency. This involves first constructing a

VIEW


that has an

ORDER BY

clause; this practice is not part of the SQL standard, which
does not allow a

VIEW

to have an

ORDER BY

clause. For example, a
paint company wants to find all the products offered by their
competitors who use the same color as one of their products. First
construct a

VIEW

that orders the colors by frequency of appearance:

CREATE VIEW PopColor (color, tally)
AS SELECT color, COUNT(*) AS tally

290 CHAPTER 14: THE [NOT] IN() PREDICATE

FROM Paints
GROUP BY color
ORDER BY tally DESC;


Then go to the Competitor data and do a simple column

SELECT

on
the

VIEW

, thus:

SELECT *
FROM Competitor
WHERE color IN (SELECT color FROM PopColor);

The

VIEW

is grouped, so it will be materialized in sort order. The
subquery will then be executed and (we hope) the sort order will be
maintained and passed along to the

IN()

predicate. Another trick is to
replace the

IN()


predicate with a

JOIN

operation. For example, you
have a table of restaurant telephone numbers and a guidebook, and you
want to pick out the four-star places, so you write this query:

SELECT restaurant_name, phone_nbr
FROM Restaurants
WHERE restaurant_name
IN (SELECT restaurant_name
FROM QualityGuide
WHERE stars = 4);

If there is an index on QualityGuide.stars, the SQL engine will
probably build a temporary table of the four-star places and pass it on to
the outer query. The outer query will then handle it as if it were a list of
constants.
However, this is not the sort of column that you would normally
index. Without an index on stars, the engine will simply do a sequential
search of the QualityGuide table. This query can be replaced with a

JOIN


query, thus:

SELECT restaurant_name, phone_nbr

FROM Restaurants, QualityGuide
WHERE stars = 4
AND Restaurants.restaurant_name =
QualityGuide.restaurant_name;

14.1 Optimizing the IN() Predicate 291

This query should run faster, since restaurant_name is a key for both
tables and will be indexed to ensure uniqueness. However, this can
return duplicate rows in the result table that you can handle with a

SELECT DISTINCT

. Consider a more budget-minded query, where we
want places with a meal that costs less than $10, and the menu
guidebook lists all the meals. The query looks about the same:

SELECT restaurant_name, phone_nbr
FROM Restaurants
WHERE restaurant_name
IN (SELECT restaurant_name
FROM MenuGuide
WHERE price <= 10.00);

And you would expect to be able to replace it with:

SELECT restaurant_name, phone_nbr
FROM Restaurants, MenuGuide
WHERE price <= 10.00
AND Restaurants.restaurant_name = MenuGuide.restaurant_name;


Every item in Murphy’s Two-Dollar Hash House will get a line in the
results of the

JOIN

ed version. However, this can be fixed by changing

SELECT restaurant_name, phone_nbr

to

SELECT DISTINCT
restaurant_name, phone_nbr

, but it will cost more time to do a
sort to remove the duplicates. There is no good general advice, except to
experiment with your particular product.
The

NOT IN()

predicate is probably better replaced with a

NOT
EXISTS

predicate. Using the restaurant example again, our friend John
has a list of eateries and we want to see those that are not in the
guidebook. The natural formation of the query is:


SELECT *
FROM JohnsBook
WHERE restaurant_name
NOT IN (SELECT restaurant_name
FROM QualityGuide);

But you can write the same query with a

NOT EXISTS

predicate and
it will probably run faster:

×