Tải bản đầy đủ (.pdf) (10 trang)

Joe Celko s SQL for Smarties - Advanced SQL Programming P49 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (134.14 KB, 10 trang )

452 CHAPTER 21: AGGREGATE FUNCTIONS
'Charles' 900.00
'Delta' 800.00
'Eddy' 700.00
'Fred' 700.00
'George' 700.00
Able, Baker, and Charles are the three highest paid personnel, but
$1,000.00, $900.00, and $800.00 are the three highest salaries. The
highest salaries belong to Able, Baker, Charles and Delta—a set with four
elements.
The way that most new SQL programmers do this in other SQL
products is produce a result with an
ORDER BY clause, then read the
first so many rows from that cursor result. In Standard SQL, cursors
have an
ORDER BY clause but no way to return a fixed number of rows.
However, most SQL products have propriety syntax to clip the result
set at exactly some number of rows. Oh, yes, did I mention that the
whole table has to be sorted, and that this can take some time if the
table is large?
The best algorithm for this problem is the Partition algorithm by C.
A. R. Hoare. This is the procedure in QuickSort that splits a set of values
into three partitions—those greater than a pivot value, those less than
the pivot and those values equal to the pivot. The expected run time is
only (2*n) operations.
In practice, it is a good idea to start with a pivot at or near the kth
position you seek, because real data tends to have some ordering already
in it. If the file is already in sorted order, this trick will return an answer
in one pass. Here is the algorithm in Pascal.
CONST
list_length = { some large number };



TYPE
LIST = ARRAY [1 list_length] OF REAL;

PROCEDURE FindTopK (Kth INTEGER, records : LIST);
VAR pivot, left, right, start, finish: INTEGER;
BEGIN
start := 1;
finish := list_length;
WHILE start < finish
DO BEGIN
21.4 Extrema Functions 453
pivot := records[Kth];
left := start;
right := finish;
REPEAT
WHILE (records[left] > pivot) DO left := left + 1;
WHILE (records[right] < pivot) DO right := right - 1;
IF (left >= right)
THEN BEGIN { swap right and left elements }
Swap (records[left], records[right]);
left := left + 1;
right := right - 1;
END;
UNTIL (left < right);
IF (right < Kth) THEN start := left;
IF (left > Kth) THEN finish := right;
END;
{ the first k numbers are in positions 1 through kth, in no
particular order except that the kth highest number is in

position kth }
END.
The original articles in Explain magazine gave several solutions
(Murchison n.d.; Wankowski n.d.).
One involved
UNION operations on nested subqueries. The first result
table was the maximum for the whole table; the second result table was
the maximum for the table entries less than the first maximum; and so
forth. The pattern is extensible. It looked like this:
SELECT MAX(salary)
FROM Personnel
UNION
SELECT MAX(salary)
FROM Personnel
WHERE salary < (SELECT MAX(salary)
FROM Personnel)
UNION
SELECT MAX(salary)
FROM Personnel
WHERE salary < (SELECT MAX(salary)
FROM Personnel
WHERE salary
< (SELECT MAX(salary) FROM Personnel));
454 CHAPTER 21: AGGREGATE FUNCTIONS
This answer can give you a pretty serious performance problem
because of the subquery nesting and the
UNION operations. Every
UNION will trigger a sort to remove duplicate rows from the results, since
salary is not a
UNIQUE column.

A special case of the use of the scalar subquery with the
MAX()
function is finding the last two values in a set to look for a change. This is
most often done with date values for time series work. For example, to
get the last two reviews for an employee:
SELECT :search_name, MAX(P1.review_date), P2.review_date
FROM Personnel AS P1, Personnel AS P2
WHERE P1.review_date < P2.review_date
AND P1.emp_name = :search_name
AND P2.review_date = (SELECT MAX(review_date) FROM
Personnel)
GROUP BY P2.review_date;
The scalar subquery is not correlated, so it should run pretty fast and
be executed only once.
An improvement on the
UNION approach was to find the third
highest salary with a subquery, then return all the records with salaries
that were equal or higher; this would handle ties. It looked like this:
SELECT DISTINCT salary
FROM Personnel
WHERE salary >=
(SELECT MAX(salary)
FROM Personnel
WHERE salary < (SELECT MAX(salary)
FROM Personnel
WHERE salary <
(SELECT MAX(salary) FROM Personnel)));
Another answer was to use correlation names and return a single-row
result table. This pattern is more easily extensible to larger groups; it also
presents the results in sorted order without requiring the use of an

ORDER BY clause. The disadvantage of this answer is that it will return a
single row, not a column result. That might make it unusable for joining
to other queries. It looked like this:
21.4 Extrema Functions 455
SELECT MAX(P1.salary_amt), MAX(P2.salary_amt),
MAX(P3.salary_amt)
FROM Personnel AS P1, Personnel AS P2, Personnel AS P3
WHERE P1.salary_amt > P2.salary_amt
AND P2.salary_amt > P3.salary_amt;

This approach will return the three highest salaries.
The best variation on the single row approach is done with the scalar
subquery expressions in SQL. The query becomes:
SELECT (SELECT MAX (salary)
FROM Personnel) AS s1,
(SELECT MAX (salary)
FROM Personnel
WHERE salary NOT IN (s1)) AS s2,
(SELECT MAX (salary)
FROM Personnel
WHERE salary NOT IN (s1, s2)) AS s3,

(SELECT MAX (salary)
FROM Personnel
WHERE salary NOT IN (s1, s2, s[n-1])) AS sn,
FROM Dummy;
In this case, the table Dummy is anything, even an empty table.
There are single column answers based on the fact that SQL is a set-
oriented language, so we ought to use a set-oriented specification. We
want to get a subset of salary values that has a count of (n), has the

greatest value from the original set as an element, and includes all values
greater than its least element.
The idea is to take each salary and build a group of other salaries that
are greater than or equal to it—this value is the boundary of the subset.
The groups with three or fewer rows are what we want to see. The third
element of an ordered list is also the maximum or minimum element of a
set of three unique elements, depending on the ordering. Think of
concentric sets, nested inside each other. This query gives a columnar
answer, and the query can be extended to other numbers by changing
the constant in the
HAVING clause.
SELECT MIN(P1.salary_amt) the element on the boundary
FROM Personnel AS P1, P2 gives the elements of the subset
456 CHAPTER 21: AGGREGATE FUNCTIONS
Personnel AS P2 P1 gives the boundary of the subset
WHERE P1.salary_amt >= P2.salary_amt
GROUP BY P2.salary_amt
HAVING COUNT(DISTINCT P1.salary_amt) <= 3;
This can also be written as:
SELECT P1.salary_amt
FROM Personnel AS P1
WHERE (SELECT COUNT(*)
FROM Personnel AS P2
WHERE P2.salary_amt >= P1.salary_amt) <= 3;
However, the correlated subquery might be more expensive than the
GROUP BY clause.
If you would like to know how many ties you have for each value, the
query can be modified to this:
SELECT MIN(P1.salary_amt) AS top,
COUNT (CASE WHEN P1.salary_amt = P2.salary_amt

THEN 1 ELSE NULL END) / 2 AS ties
FROM Personnel AS P1, Personnel AS P2
WHERE P1.salary_amt >= P2.salary_amt
GROUP BY P2.salary_amt
HAVING COUNT(DISTINCT P1.salary_amt) <= 3;
If the salary is unique, the ties column will return a zero; otherwise,
you will get the number of occurrences of that value on each row of the
result table.
Or if you would like to see the ranking next to the employees, here is
another version using a
GROUP BY:
SELECT P1.emp_name,
SUM (CASE WHEN (P1.salary_amt || P1.emp_name)
< (P2.salary_amt || P1.emp_name)
THEN 1 ELSE 0 END) + 1 AS rank
FROM Personnel AS P1, Personnel AS P2
WHERE P1.emp_name <> P2.emp_name
GROUP BY P1.emp_name
HAVING (CASE WHEN (P1.salary_amt || P1.emp_name)
< (P2.salary_amt || P1.emp_name)
THEN 1 ELSE 0 END) <= (:n - 1);
21.4 Extrema Functions 457
The concatenation is to make ties in salary different by adding the
key to a string conversion. This query assumes automatic data type
conversion, but you can use an explicit
CAST() function. This also
assumes that the collation has a particular ordering of digits and
letters—the old “ASCII versus EBCDIC” problem. You can use nested
CASE expressions to get around.
SELECT P1.emp_name,

SUM (CASE WHEN P1.salary_amt < P2.salary_amt THEN 1
WHEN P1.salary_amt > P2.salary_amt THEN 0
ELSE CASE WHEN P1.emp_name < P2.emp_name
THEN 1 ELSE 0 END
END) + 1 AS rank
FROM
Here is another version that will produce the ties on separate lines
with the names of the personnel who made the cut. This answer is due to
Pierre Boutquin.
SELECT P1.emp_name, P1.salary_amt
FROM Personnel AS P1, Personnel AS P2
WHERE P1.salary_amt >= P2.salary_amt
GROUP BY P1.emp_name, P1.salary_amt
HAVING (SELECT COUNT(*) FROM Personnel) - COUNT(*) + 1 <= :n;
The idea is to use a little algebra. If we want to find (n of k) things,
then the rejected subset of the set is of size (k-n). Using the sample data,
we would get this result.
Results
name salary
==================
'Able' 1000.00
'Baker' 900.00
'Charles' 900.00
If we add a new employee at $900, we would also get him, but we
would not get a new employee at $800 or less. In many ways, this is the
most satisfying answer.
Here are two more versions of the solution:
458 CHAPTER 21: AGGREGATE FUNCTIONS
SELECT P1.emp_name, P1.salary_amt
FROM Personnel AS P1, Personnel AS P2

GROUP BY P1.emp_name, P1.salary_amt
HAVING COUNT(CASE WHEN P1.salary_amt < P2.salary_amt
THEN 1
ELSE NULL END) + 1 <= :n;
SELECT P1.emp_name, P1.salary_amt
FROM Personnel AS P1
LEFT OUTER JOIN
Personnel AS P2
ON P1.salary_amt < P2.salary_amt
GROUP BY P1.emp_name, P1.salary_amt
HAVING COUNT(P2.salary_amt) + 1 <= :n;
The subquery is unnecessary and can be eliminated with either of the
above solutions.
As an aside, if you were awake during your college set theory course,
you will remember that John von Neumann’s definition of ordinal
numbers is based on nested sets. You can get a lot of ideas for self-joins
from set theory theorems. John von Neumann was one of the greatest
mathematicians of this century; he was the inventor of the modern
stored program computer and Game Theory. Know your nerd heritage!
It should be obvious that any number can replace three in the query.
A subtle point is that the predicate “
P1.salary_amt <=
P2.salary_amt” will include the boundary value, and therefore
implies that if we have three or fewer personnel, then we still have a
result. If you want to call off the competition for lack of a quorum, then
change the predicate to “
P1.salary_amt < P2.salary_amt”
instead.
Another way to express the query would be:


SELECT Elements.name, Elements.salary_amt
FROM Personnel AS Elements
WHERE (SELECT COUNT(*)
FROM Personnel AS Boundary
WHERE Elements.salary_amt < Boundary.salary_amt) < 3;
Likewise, the COUNT(*) and comparisons in the scalar subquery
expression can be changed to give slightly different results.
21.4 Extrema Functions 459
You might want to test each version to see which one runs faster on
your particular SQL product. If you want to swap the subquery and the
constant for readability, you may do so in SQL, but not in SQL-89.
What if I want to allow ties? Then just change
COUNT() to a
COUNT(DISTINCT) function of the HAVING clause, thus:
SELECT Elements.name, Elements.salary_amt
FROM Personnel AS Elements, Personnel AS Boundary
WHERE Elements.salary_amt <= Boundary.salary_amt
GROUP BY Elements.name, Elements.salary_amt
HAVING COUNT(DISTINCT Boundary.salary_amt) <= 3;
This says that I want to count the values of salary, not the
salespersons, so that if two or more of the crew hit the same total, I will
include them in the report as tied for a particular position. This also
means that the results can be more than three rows, because I can have
ties. As you can see, it is easy to get a subtle change in the results with
just a few simple changes in predicates.
Notice that you can change the comparisons from “
<=” to “<” and the

COUNT(*)” to “COUNT(DISTINCT P2.salary_amt)” to change the
specification.

Ken Henderson came up with another version that uses derived tables
and scalar subquery expressions in SQL:
SELECT P2.salary_amt
FROM (SELECT (SELECT COUNT(DISTINCT P1.salary_amt)
FROM Personnel AS P1
WHERE P3.salary_amt >= P1.salary_amt) AS
ranking,
P3.salary_amt
FROM Personnel AS P3) AS P2
WHERE P2.ranking <= 3;
You can get other aggregate functions by using this query with the IN
predicate. Assume that I have a SalaryHistory table from which I wish to
determine the average pay for the three most recent pay changes of each
employee. I am going to further assume that if you had three or fewer old
salaries, you would still want to average the first, second, or third values
you have on record.
460 CHAPTER 21: AGGREGATE FUNCTIONS
SELECT S0.emp, AVG(S0.last_salary)
FROM SalaryHistory AS S0
WHERE S0.change_date
IN (SELECT P1.change_date
FROM SalaryHistory AS P1, SalaryHistory AS P2
WHERE P1.change_date <= P2.change_date
GROUP BY P1.change_date
HAVING COUNT(*) <= 3)
GROUP BY S0.emp_nbr;
21.4.3 Multiple Criteria Extrema Functions
Since the generalized extrema functions are based on sorting the data, it
stands to reason that you could further generalize them to use multiple
columns in a table. This can be done by changing the

WHERE search
condition. For example, to locate the top (n) tall and heavy employees
for the basketball team, we could write:
SELECT P1.emp_id
FROM Personnel AS P1, Personnel AS P2
WHERE P2.height >= P1.height major sort term
OR (P2.height = P1.height next sort term
AND P2.weight >= P1.weight)
GROUP BY P1.emp_id
HAVING COUNT(*) <= :n;
Procedural programmers will recognize this predicate, because it is
what they used to write to do a sort on more than one field in a file
system. Now it is very important to look at the predicates at each level of
nesting to be sure that you have the right theta operator. The ordering of
the predicates is also critical—there is a difference between ordering by
height within weight or by weight within height.
One improvement would be to use row comparisons:
SELECT P1.emp_id
FROM Personnel AS P1, Personnel AS P2
WHERE (P2.height, P2.weight) <= (P1.height, P1.weight)
GROUP BY P1.emp_id
HAVING COUNT(*) <= 4;
The down side of this approach is that you cannot easily mix
ascending and descending comparisons in the same comparison
21.4 Extrema Functions 461
predicate. The trick is to make numeric columns negative to reverse the
sense of the theta operator.
Before you attempt it, here is the scalar subquery version of the
multiple extrema problems:
SELECT

(SELECT MAX(P0.height)
FROM Personnel AS P0
WHERE P0.weight = (SELECT MAX(weight)
FROM Personnel AS P1)) AS s1,
(SELECT MAX(P0.height)
FROM Personnel AS P0
WHERE height NOT IN (s1)
AND P0.weight = (SELECT MAX(weight)
FROM Personnel AS P1
WHERE height NOT IN (s1))) AS s2,
(SELECT MAX(P0.height)
FROM Personnel AS P0
WHERE height NOT IN (s1, s2)
AND P0.weight = (SELECT MAX(weight)
FROM Personnel AS P1
WHERE height NOT IN (s1, s2))) AS s3
FROM Dummy;
Again, multiple criteria and their ordering would be expressed as
multiple levels of subquery nesting. This picks the tallest people and
decides ties with the greatest weight within that subset of personnel.
While this looks awful and is hard to read, it does run fairly fast, because
the predicates are repeated and can be factored out by the optimizer.
Another form of multiple criteria is finding the generalized extrema
functions within groupings; for example, finding the top three salaries in
each department. Adding the grouping constraints to the subquery
expressions gives us an answer.
SELECT dept_nbr, salary_amt
FROM Personnel AS P1
WHERE (SELECT COUNT(*)
FROM Personnel AS P2

WHERE P2.dept_nbr = P1.dept_nbr
AND P2.salary_amt < P1.salary_amt) < :n;

×