Tải bản đầy đủ (.pdf) (10 trang)

Joe Celko s SQL for Smarties - Advanced SQL Programming P50 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (132.92 KB, 10 trang )

462 CHAPTER 21: AGGREGATE FUNCTIONS
or:
SELECT P2.dept_nbr, MIN(P1.salary_amt)
FROM Personnel AS P1, Personnel AS P2
WHERE P1.dept_nbr = P2.dept_nbr
AND P1.salary_amt >= P2.salary_amt
GROUP BY P2.dept_nbr, P2.salary_amt
HAVING COUNT(DISTINCT P1.salary_amt) <= 3;
21.4.4 GREATEST() and LEAST() Functions
Oracle has a proprietary pair of functions that return greatest and least
values, respectively—a sort of “horizontal”
MAX() and MIN(). The
syntax is
GREATEST (<list of values>) and LEAST (<list of
values>). Awkwardly, DB2 allows MIN and MAX as synonyms for
LEAST and GREATEST.
If you have NULLs, then you have to decide if they sort high or low, if
they will be excluded or will propagate the
NULL, so that you can define
this function several ways.
If you don’t have
NULLs in the data:
CASE WHEN col1 > col2
THEN col1 ELSE col2 END
If you want the highest non-NULL value:
CASE WHEN col1 > col2
THEN col1 ELSE COALESCE(col2, col1) END
If you want to return NULL where one of the cols is NULL:
CASE WHEN col1 > col2 OR col1 IS NULL
THEN col1 ELSE col2 END
But for the rest of this section, let’s assume (a < b) and NULL is high:


GREATEST (a, b) = b
GREATEST (a, NULL) = NULL
GREATEST (NULL, b) = NULL
GREATEST (NULL, NULL) = NULL
21.4 Extrema Functions 463
We can write this as:
GREATEST(x, y) ::= CASE WHEN (COALESCE (x, y) > COALESCE (y, x))
THEN x
ELSE y END
The rules for LEAST() are:
LEAST (a, b) = a
LEAST (a, NULL) = a
LEAST (NULL, b) = b
LEAST (NULL, NULL) = NULL
This is written:
LEAST(x, y) ::= CASE WHEN (COALESCE (x, y) <= COALESCE (y, x))
THEN COALESCE (x, y)
ELSE COALESCE (y, x) END
This can be done in Standard SQL, but takes a little bit of work. Let’s
assume that we have a table that holds the scores for a player in a series
of five games and we want to get his best score from all five games.
CREATE TABLE Games
(player CHAR(10) NOT NULL PRIMARY KEY,
score_1 INTEGER NOT NULL DEFAULT 0,
score_2 INTEGER NOT NULL DEFAULT 0,
score_3 INTEGER NOT NULL DEFAULT 0,
score_4 INTEGER NOT NULL DEFAULT 0,
score_5 INTEGER NOT NULL DEFAULT 0);
and we want to find the GREATEST (score_1, score_2, score_3,
score_4, score_5).

SELECT player, MAX(CASE X.seq_nbr
WHEN 1 THEN score_1
WHEN 2 THEN score_2
WHEN 3 THEN score_3
WHEN 4 THEN score_4
WHEN 5 THEN score_5
ELSE NULL END) AS best_score
464 CHAPTER 21: AGGREGATE FUNCTIONS
FROM Games
CROSS JOIN
(VALUES (1), (2), (3), (4), (5)) AS X(seq_nbr)
GROUP BY player;
Another approach is to use a pure CASE expression:

CASE
WHEN score_1 <= score_2 AND score_1 <= score_3
AND score_1 <= score_4 AND score_1 <= score_5
THEN score_1
WHEN score_2 <= score_3 AND score_2 <= score_4
AND score_2 <= score_5
THEN score_2
WHEN score_3 <= score_4 AND score_3 <= score_5
THEN score_3
WHEN score_4 <= score_5
THEN score_4
ELSE score_5
END
A final trick is to use a bit of algebra. You can define:
GREATEST(a, b) ::= (a + b + ABS(a - b)) / 2
LEAST(a, b) ::= (a + b - ABS(a - b)) / 2

Then iterate on it as a recurrence relation on numeric values. For
example, for three items, you can use
GREATEST (a, GREATEST(b,
c)), which expands to:
((a + b) + ABS(a - b)
+ 2 * c + ABS((a + b) + ABS(a - b)
- 2 * c))/4
You need to watch for possible overflow errors if the numbers are
large and
NULLs propagate in the math functions. Here is the answer for
five scores.
(score_1 + score_2 + 2*score_3 + 4*score_4 + 8*score_5
+ ABS(score_1 - score_2) + ABS((score_1 + score_2) +
ABS(score_1 - score_2) - 2*score_3)
21.5 The LIST() Aggregate Function 465
+ ABS(score_1 + score_2 + 2*score_3 - 4*score_4 + ABS(score_1 -
score_2) + ABS((score_1 + score_2 - 2*score_3) + ABS(score_1 -
score_2)))
+ ABS(score_1 + score_2 + 2*score_3 + 4*score_4 - 8*score_5
+ ABS(score_1 - score_2) + ABS((score_1 + score_2) +
ABS(score_1 - score_2) - 2*score_3)
+ ABS(score_1 + score_2 + 2*score_3 - 4*score_4 + ABS(score_1 -
score_2) + ABS((score_1 + score_2 - 2*score_3) + ABS(score_1 -
score_2))) )) / 16
21.5 The LIST() Aggregate Function
The LIST([DISTINCT] <string expression>) is part of Sybase’s
SQL Anywhere (formerly WATCOM SQL). It is the only aggregate
function to work on character strings. It takes a column of strings,
removes the
NULLs and merges them into a single result string with

commas between each of the original strings. The
DISTINCT option
removes duplicates as well as
NULLs before concatenating the strings.
This function is a generalized version of concatenation, just as
SUM() is a
generalized version of addition.

MySQL 4.1 extended this function into the GROUP_CONCAT()
function, which does the same thing but adds options for
ORDER BY and
SEPARATOR.
This is handy when you use SQL to write SQL queries. As one simple
example, you can apply it against the schema tables and obtain the
names of all the columns in a table, then use that list to expand a
SELECT * into the current column list.
One nonproprietary way of doing this query is with scalar subquery
expressions. Assume we have these two tables:
CREATE TABLE People
(person_id INTEGER NOT NULL PRIMARY KEY,
name CHAR(10) NOT NULL);
INSERT INTO People
VALUES (1, 'John'), (2, 'Mary'), (3, 'Fred'), (4, 'Jane');
CREATE TABLE Clothes
(person_id INTEGER NOT NULL,
seq_nbr INTEGER NOT NULL,
item_name CHAR(10) NOT NULL,
worn_flag CHAR(1) NOT NULL
466 CHAPTER 21: AGGREGATE FUNCTIONS
CONSTRAINT worn_flag_yes_no

CHECK (worn_flag IN ('Y', 'N')),
PRIMARY KEY (id, seq_nbr));
INSERT INTO Clothes
VALUES (1, 1, 'Hat', 'Y'),
(1, 2, 'Coat', 'N'),
(1, 3, 'Glove', 'Y'),
(2, 1, 'Hat', 'Y'),
(2, 2, 'Coat', 'Y'),
(3, 1, 'Shoes', 'N'),
(4, 1, 'Pants', 'N'),
(4, 2, 'Socks', 'Y');

Using the LIST() function, we could get an output of the outfits of
the people with the simple query:
SELECT P0.person_id, P0.person_name, LIST(item_name) AS fashion
FROM People AS P0, Clothes AS C0
WHERE P0.person_id = C0.clothes_id
AND C0.worn_flag = 'Y'
GROUP BY P0.person_id, P0.person_name;
Result
id name fashion
=======================
1 'John' 'Hat, Glove'
2 'Mary' 'Hat, Coat'
4 'Jane' 'Socks'
21.5.1 The LIST() Function with a Procedure
To do this without an aggregate function, you must first know the
highest sequence number, so you can create the query. In this case, the
query is a simple “
SELECT MAX(seq_nbr) FROM Clothes”

statement, but you might have to use a
COUNT(*) for other tables.
SELECT DISTINCT P0.person_id, P0.person_name,
SUBSTRING ((SELECT CASE WHEN C1.worn_flag = 'Y'
THEN (', ' || item_name) ELSE '' END
FROM Clothes AS C1
WHERE C1.clothes_id = C0.clothes_id
21.5 The LIST() Aggregate Function 467
AND C1.seq_nbr = 1) ||
(SELECT CASE WHEN C2.worn_flag = 'Y'
THEN (', ' || item_name) ELSE '' END
FROM Clothes AS C2
WHERE C2.id = C0.clothes_id
AND C2.seq_nbr = 2) ||
(SELECT CASE WHEN C3.worn_flag = 'Y'
THEN (', ' || item_name) ELSE '' END
FROM Clothes AS C3
WHERE C3.clothes_id = C0.clothes_id
AND C3.seq_nbr = 3) FROM 3) AS list
FROM People AS P0, Clothes AS C0
WHERE P0.person_id = C0.clothes_id;
id name list
===========================
1 John Hat, Glove
2 Mary Hat, Coat
3 Fred
4 Jane Socks
Again, the CASE expression on worn_flag can be replaced with an IS
NULL to replace NULLs with an empty string. If you don’t want to see that
Fred is naked—has an empty string of clothing—then change the

outermost
WHERE clause to read:

WHERE P0.person_id = C0.clothes_id AND C0.worn_flag = 'Y';
Since you don’t want to see a leading comma, remember to TRIM() it
off or to use the
SUBSTRING() function to remove the first two
characters. I opted for the
SUBSTRING(), because the TRIM() function
requires a scan of the string.
21.5.2 The LIST() Function by Crosstabs
Carl Federl used this to get a similar result:
CREATE TABLE Crosstabs
(seq_nbr INTEGER NOT NULL PRIMARY KEY,
seq_nbr_1 INTEGER NOT NULL,
seq_nbr_2 INTEGER NOT NULL,
468 CHAPTER 21: AGGREGATE FUNCTIONS
seq_nbr_3 INTEGER NOT NULL,
seq_nbr_4 INTEGER NOT NULL,
seq_nbr_5 INTEGER NOT NULL);
INSERT INTO Crosstabs
VALUES (1, 1, 0, 0, 0, 0),
(2, 0, 1, 0, 0, 0),
(3, 0, 0, 1, 0, 0),
(4, 0, 0, 0, 1, 0),
(5, 0, 0, 0, 0, 1);
SELECT Clothes.id,
TRIM (MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_1 * 10))
|| ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_2 * 10))
|| ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_3 * 10))

|| ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_4 * 10))
|| ' ' || MAX(SUBSTRING(item_name FROM 1 FOR seq_nbr_5 * 10)))
FROM Clothes, Crosstabs
WHERE Clothes.seq_nbr = Crosstabs.seq_nbr
AND Clothes.worn_flag = 'Y'
GROUP BY Clothes.id;
21.6 The PRD() Aggregate Function
Bob McGowan sent me a message on CompuServe asking for help with a
problem. His client, a financial institution, tracks investment
performance with a table something like this:
CREATE TABLE Performance
(portfolio_id CHAR(7) NOT NULL,
execute_date DATE NOT NULL,
rate_of_return DECIMAL(13,7) NOT NULL);
To calculate a rate of return over a date range, you use the formula:
(1 + rate_of_return [day_1])
* (1 + rate_of_return [day_2])
* (1 + rate_of_return [day_3])
* (1 + rate_of_return [day_4])

* (1 + rate_of_return [day_N])
21.6 The PRD() Aggregate Function 469
How would you construct a query that would return one row for each
portfolio’s return over the date range? What Mr. McGowan really wants is
an aggregate function in the
SELECT clause to return a columnar
product, like the
SUM() returns a columnar total.
If you were a math major, you would write these functions as capital
Sigma (

∑) for summation and capital Pi for product (π). If such an
aggregate function existed in SQL, the syntax for it would look
something like:
PRD ([DISTINCT] <expression>)
While I am not sure that there is any use for the DISTINCT option,
the new aggregate function would let us write his problem simply as:
SELECT portfolio_id, PRD(1.00 + rate_of_return)
FROM Performance
WHERE execute_date BETWEEN start_date AND end_date
GROUP BY portfolio_id;
21.6.1 PRD() Function by Expressions
There is a trick to doing this, but you need a second table that looks like
this and covers a period of five days:
CREATE TABLE BigPi
(execute_date DATE NOT NULL,
day_1 INTEGER NOT NULL,
day_2 INTEGER NOT NULL,
day_3 INTEGER NOT NULL,
day_4 INTEGER NOT NULL,
day_5 INTEGER NOT NULL);
Let’s assume we wanted to look at January 6 to 10, so we need to
update the execute_date column to that range, thus:
INSERT INTO BigPi
VALUES ('2006-01-06', 1, 0, 0, 0, 0),
('2006-01-07', 0, 1, 0, 0, 0),
('2006-01-08', 0, 0, 1, 0, 0),
('2006-01-09', 0, 0, 0, 1, 0),
('2006-01-10', 0, 0, 0, 0, 1);
470 CHAPTER 21: AGGREGATE FUNCTIONS
The idea is that there is a one in the column when BigPi.execute_date

is equal to the nth date in the range, and a zero otherwise. The query for
this problem is:
SELECT portfolio_id,
(SUM((1.00 + P1.rate_of_return) * M1.day_1) *
SUM((1.00 + P1.rate_of_return) * M1.day_2) *
SUM((1.00 + P1.rate_of_return) * M1.day_3) *
SUM((1.00 + P1.rate_of_return) * M1.day_4) *
SUM((1.00 + P1.rate_of_return) * M1.day_5)) AS product
FROM Performance AS P1, BigPi AS M1
WHERE M1.execute_date = P1.execute_date
AND P1.execute_date BETWEEN '2006-01-06' AND '2006-01-10'
GROUP BY portfolio_id;
If anyone is missing a rate_of_return entry on a date in that range, his
or her product will be zero. That might be fine, but if you needed to get a
NULL when you have missing data, then replace each SUM() expression
with a
CASE expression like this:
CASE WHEN SUM((1.00 + P1.rate_of_return) * M1.day_N) = 0.00
THEN CAST (NULL AS DECIMAL(6, 4))
ELSE SUM((1.00 + P1.rate_of_return) * M1.day_N)
END
Alternately, if your SQL has the full SQL set of expressions, use this
version:
COALESCE (SUM((1.00 + P1.rate_of_return) * M1.day_N), 0.00)
21.6.2 The PRD() Aggregate Function by Logarithms
Roy Harvey, another SQL guru who answered questions on
CompuServe, found a different solution—one that could only come
from someone old enough to remember slide rules and multiplication by
adding logs. The nice part of this solution is that you can also use the
DISTINCT option in the SUM() function.

But there are a lot of warnings about this approach. Some older SQL
implementation might have trouble with using an aggregate function
result as a parameter. This has always been part of the standard, but
21.6 The PRD() Aggregate Function 471
some SQL products use very different mechanisms for the aggregate
functions.
Another, more fundamental problem is that a log of zero or less is
undefined, so your SQL might return a
NULL or an error message. You
will also see some SQL products that use
LN() for the natural log and
LOG10() for the logarithm base ten, and some SQLs that use
LOG(<parameter>, <base>) for a general logarithm function.
Given all those warnings, the expression for the product of a column
from logarithm and exponential functions is:
SELECT ((EXP (SUM (LN (CASE WHEN nbr = 0.00
THEN CAST (NULL AS FLOAT)
ELSE ABS(nbr) END))))
* (CASE WHEN MIN (ABS (nbr)) = 0.00
THEN 0.00
ELSE 1.00 END)
* (CASE WHEN MOD (SUM (CASE WHEN SIGN(nbr) = -1
THEN 1
ELSE 0 END), 2) = 1
THEN -1.00
ELSE 1.00 END) AS big_pi
FROM NumberTable;
The nice part of this is that you can also use the SUM (DISTINCT
<expression>) option to get the equivalent of PRD (DISTINCT
<expression>).

You should watch the data type of the column involved and use either
integer 0 and 1 or decimal 0.00 and 1.00 as is appropriate in the
CASE
statements. It is worth studying the three
CASE expressions that make up
the terms of the Prod calculation.
The first
CASE expression is to ensure that all zeros and negative
numbers are converted to a nonnegative or
NULL for the SUM()
function, just in case your SQL raises an exception.
The second
CASE expression will return zero as the answer if there is
a zero in the nbr column of any selected row. The
MIN(ABS(nbr))
trick is handy for detecting the existence of a zero in a list of both
positive and negative numbers with an aggregate function.
The third
CASE expression will return −1 if there is an odd number of
negative numbers in the nbr column. The innermost
CASE expression
uses a
SIGN() function, which returns + 1 for a positive number, −1 for
a negative number and 0 for a zero. The
SUM() counts the −1 results,

×