Tải bản đầy đủ (.pdf) (10 trang)

Joe Celko s SQL for Smarties - Advanced SQL Programming P48 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (138.38 KB, 10 trang )


442 CHAPTER 21: AGGREGATE FUNCTIONS

'Chester' 'A.' 'Arthur' 'R' 1881 1885
'Grover' ' ' 'Cleveland' 'D' 1885 1889
'Benjamin' ' ' 'Harrison' 'R' 1889 1893
'Grover' ' ' 'Cleveland' 'D' 1893 1897
'William' ' ' 'McKinley' 'R' 1897 1901
'Theodore' ' ' 'Roosevelt' 'R' 1901 1909
'William' 'H.' 'Taft' 'R' 1909 1913
'Woodrow' ' ' 'Wilson' 'D' 1913 1921
'Warren' 'G.' 'Harding' 'R' 1921 1923
'Calvin' ' ' 'Coolidge' 'R' 1923 1929
'Herbert' 'C.' 'Hoover' 'R' 1929 1933
'Franklin' 'D.' 'Roosevelt' 'D' 1933 1945
'Harry' 'S.' 'Truman' 'D' 1945 1953
'Dwight' 'D.' 'Eisenhower' 'R' 1953 1961
'John' 'F.' 'Kennedy' 'D' 1961 1963
'Lyndon' 'B.' 'Johnson' 'D' 1963 1969
'Richard' 'M.' 'Nixon' 'R' 1969 1974
'Gerald' 'R.' 'Ford' 'R' 1974 1977
'James' 'E.' 'Carter' 'D' 1977 1981
'Ronald' 'W.' 'Reagan' 'R' 1981 1989
'George' 'H.W.' 'Bush' 'R' 1989 1993
'William' 'J.' 'Clinton' 'D' 1993 2001
'George' 'W. ' 'Bush' 'R' 2001 NULL

Your civics teacher has just asked you to tell her how many people
have been President of the United States. So you write the query as

SELECT





COUNT(*)



FROM

Presidents; and get the wrong answer. For
those of you who have been out of high school too long, more than one
Adams, more than one John, and more than one Roosevelt have served
as president. Many people have had more than one term in office, and
Grover Cleveland served two discontinuous terms. In short, this
database is not a simple one-row, one-person system. What you really
want is not

COUNT(*)

, but something that is able to look at unique
combinations of multiple columns. You cannot do this in one column, so
you need to construct an expression that is unique. The point is that you
need to be very sure that the expression you are using as a parameter is
really what you wanted to count.
The

COUNT([ALL] <value expression>)

returns the number of
members in the


<value expression>

set. The

NULL

s were thrown
away before the counting took place, and an empty set returns zero. The
best way to read this is: “Count the number of known values in this

21.2 SUM() Functions 443

expression,” with stress on the word known. In this example you might
use

COUNT(first_name || ' ' || initial || ' ' ||
last_name)

.



The

COUNT(DISTINCT <value expression>)

returns the
number of unique members in the


<value expression>

set. The

NULL

s were thrown away before the counting took place, and then all
redundant duplicates are removed (i.e., we keep one copy). Again, an
empty set returns a zero, just as with the other counting functions.
Applying this function to a key or a unique column is the same as using
the

COUNT(*)

function, but the optimizer may not be smart enough to
spot it.
Notice that the use of the keywords

ALL

and

DISTINCT

follows the
same pattern here as they did in the [

ALL

|


DISTINCT

] options in the

SELECT

clause of the query expressions.

21.2 SUM() Functions

This function works only with numeric values. You should also consult
your particular product’s manuals to find out the precision of the results
for exact and approximate numeric data types.

SUM([ALL] <value expression>)

returns the numeric total of
all known values. The

NULL

s are removed before the summation takes
place. An empty set returns an empty result set, not a zero. If there are
other columns in the

SELECT

list, then that empty set will be converted
into a


NULL

.

SUM(DISTINCT <value expression>)

returns the numeric total
of all known, unique values. The

NULL

s and all redundant duplicates
were removed before the summation took place. Again, an empty set
returns an empty result set, not a zero.
That last rule is hard for people to understand. If there are other
columns in the

SELECT

list, then that empty result set will be converted
into a

NULL

. This is true for the rest of the Standard aggregate functions:

no rows
SELECT SUM(x)
FROM EmptyTable;

one row with (0, NULL) in it
SELECT COUNT(*), SUM(x)
FROM EmptyTable;

444 CHAPTER 21: AGGREGATE FUNCTIONS

The summation of a set of numbers looks as though it should be
easy, but it is not. Make two tables with the same set of positive and
negative approximate numeric values, but put one in random order
and have the other sorted by absolute value. The sorted table will give
more accurate results. The reason is simple: positive and negative
values of the same magnitude will be added together and will get a
chance to cancel each other out. There is also less chance of an
overflow or underflow error during calculations. Most PC SQL
implementations and a lot of mainframe implementations do not
bother with this trick, because it would require a sort for every

SUM()


statement, which would take a long time.
Whenever an exact or approximate numeric value is assigned to exact
numeric, it may not fit into the storage allowed for it. SQL says that the
database engine will use an approximation that preserves leading
significant digits of the original number after rounding or truncating.
The choice of whether to truncate or round is implementation-defined,
however. This can lead to some surprises when you have to shift data
among SQL implementations, or move storage values from a host
language program into an SQL table. It is probably a good idea to create
the columns with one more decimal place than you think you need.

Truncation is defined as truncation toward zero; this means that 1.5
would truncate to 1, and



1.5 would truncate to



1. This is not true for
all programming languages; everyone agrees on truncation toward zero
for the positive numbers, but you will find that negative numbers may
truncate away from zero (e.g.,



1.5 would truncate to



2). SQL is also
wishy-washy on rounding, leaving the implementation free to determine
its method. There are two major types of rounding, the scientific method
and the commercial method, which are discussed in Section 3.2.1 on
rounding and truncation math in SQL.

21.3 AVG() Functions

AVG([ALL] <value expression>)


returns the average of the values
in the value expression set. An empty set returns an empty result set. A
set of all

NULL

s will become an empty set. Remember that in general,

AVG(x)

is not the same as

(SUM(x)/COUNT(*))

; the

SUM(x)

function
has thrown away the

NULL

s, but the

COUNT(*)

has not.
Likewise,


AVG(DISTINCT <value expression>)

returns the
average of the distinct known values in the

<value expression> set.
Applying this function to a key or a unique column is the same as the
using
AVG(<value expression>) function.
21.3 AVG() Functions 445
Remember that in general AVG(DISTINCT x) is not the same as
AVG(x) or (SUM(DISTINCT x)/COUNT(*)). The SUM(DISTINCT x)
function has thrown away the duplicate values and
NULLs, but the
COUNT(*) has not. An empty set returns an empty result set.
The SQL engine is probably using the same code for the totaling in
the
AVG() that it used in the SUM() function. This leads to the same
problems with rounding and truncation, so you should experiment a
little with your particular product to find out what happens.
But even more troublesome than those problems is the problem with
the average itself, because it does not really measure central tendency
and can be very misleading. Consider the chart below, from Darrell
Huff’s superlative little book, How to Lie with Statistics (Huff 1954). The
Sample Company has 25 employees, earning the following salaries:
Number of
Employees Salary Statistic
===================================
12 $2,000 Mode, Minimum
1 $3,000 Median

4 $3,700
3 $5,000
1 $5,700 Average
2 $10,000
1 $15,000
1 $45,000 Maximum
The average salary (or, more properly, the arithmetic mean) is
$5,700. When the boss is trying to look good to the unions, he uses this
figure. When the unions are trying to look impoverished, they use the
mode, which is the most frequently occurring value, to show that the
exploited workers are making $2,000 (which is also the minimum salary
in this case).
A better measure in this case is the median, which will be discussed
later; that is, the employee with just as many cases above him as below
him. That gives us $3,000. The rule for calculating the median is that if
there is no actual entity with that value, you fake it.
Most people take an average of the two values on either side of where
the median would be; others jump to the higher or lower value. The
mode also has a problem, because not every distribution of values has
one mode. Imagine a country in which there are as many very poor
people as there are very rich people, and there is nobody in between.
446 CHAPTER 21: AGGREGATE FUNCTIONS
This would be a bimodal distribution. If there were sharp classes of
incomes, that would be a multimodal distribution.
Some SQL products have median and mode aggregate functions as
extensions, but they are not part of the standard. We will discuss in
detail how to write them in pure SQL in Chapter 23.
21.3.1 Averages with Empty Groups
The query used here is a bit tricky, so this section can be skipped on
your first reading. Sometimes you need to count an empty set as part of

the population when computing an average.
This is easier to explain with an example that was posted on
CompuServe. A fish and game warden is sampling different bodies of
water for fish populations. Each sample falls into one or more groups
(muddy bottoms, clear water, still water, and so on) and she is trying to
find the average of something that is not there. This is neither quite as
strange as it first sounds, nor quite as simple, either. She is collecting
sample data on fish in a table like this:
CREATE TABLE Samples
(sample_id INTEGER NOT NULL,
fish CHAR(20) NOT NULL,
found_cnt INTEGER NOT NULL,
PRIMARY KEY (sample_id, fish));
CREATE TABLE SampleGroups
(group_id INTEGER NOT NULL,
sample_id INTEGER NOT NULL,
PRIMARY KEY (group_id, sample_id);
Assume some of the data looks like this:
Samples
sample_id fish found_cnt
============================
1 'Seabass' 14
1 'Minnow' 18
2 'Seabass' 19
21.3 AVG() Functions 447
SampleGroups
group_id sample_id
=====================
1 1
1 2

2 2
She needs to get the average number of each species of fish in the
sample groups. For example, using sample group 1 as shown, which has
samples 1 and 2, we could use the parameters
:my_fish =‘Minnow’
and
:my_group = 1 to find the average number of minnows in sample
group 1, thus:
SELECT fish, AVG(found_cnt)
FROM Samples
WHERE sample_id
IN (SELECT sample_id
FROM SampleGroups
WHERE group_id = :my_group)
AND fish = :my_fish
GROUP BY fish;
But this query will give us an average of 18 minnows, which is wrong.
There were no minnows for sample_id = 2, so the average is ((18 + 0)/2)
= 9. The other way is to do several steps to get the correct answer—first
use a
SELECT statement to get the number of samples involved, then
another
SELECT to get the sum, and then manually calculate the
average.
The obvious answer is to enter a count of zero for each animal under
each sample_id, instead of letting it be missing, so you can use the
original query. You can create the missing rows with:
INSERT INTO Samples
SELECT M1.sample_id, M2.fish, 0
FROM Samples AS M1, Samples AS M2

WHERE NOT EXISTS (SELECT *
FROM Samples AS M3
WHERE M1.sample_id = M3.sample_id
AND M2.fish = M3.fish);
448 CHAPTER 21: AGGREGATE FUNCTIONS
Unfortunately, it turns out that we have over 100,000 different
species of fish and thousands of samples. This trick will fill up more disk
space than we have on the machine. The best trick is to use this
statement:
SELECT fish, SUM(found_cnt)/
(SELECT COUNT(sample_id)
FROM SampleGroups
WHERE group_id = :my_group)
FROM Samples
WHERE fish = :my_fish
GROUP BY fish;
This query is using the rule that the average is the sum of values
divided by the count of the set. Another way to do this would be to use
an
OUTER JOIN and preserve all the group IDs, but that would create
NULLs for the fish that are not in some of the sample groups, and you
would have to handle them.
21.3.2 Averages across Columns
The sum of several columns can be done with COALESCE() function to
effectively remove the
NULLs by replacing them with zeros:
SELECT (COALESCE(c1, 0.0)
+ COALESCE(c2, 0.0)
+ COALESCE(c3, 0.0)) AS c_total
FROM Foobar;

Likewise, the minimum and maximum values of several columns can
be done with a
CASE expression, or the GREATEST() and LEAST()
functions.
Taking an average across several columns is easy if none of the
columns are
NULL. You simply add the values and divide by the number
of columns. However, getting rid of
NULLs is a bit harder. The first trick
is to count the
NULLs:
SELECT (COALESCE(c1-c1, 1)
+ COALESCE(c2-c2, 1)
+ COALESCE(c3-c3, 1)) AS null_cnt
FROM Foobar;
21.4 Extrema Functions 449
The trick is to watch out for a row with all NULLs in it. This could lead
to a division by zero error.
SELECT CASE WHEN COALESCE(c1, c2, c3) IS NULL
THEN NULL
ELSE (COALESCE(c1, 0.0)
+ COALESCE(c2, 0.0)
+ COALESCE(c3, 0.0))
/ (3 - (COALESCE(c1-c1, 1)
+ COALESCE(c2-c2, 1)
+ COALESCE(c3-c3, 1))
END AS hortizonal_avg
FROM Foobar;
21.4 Extrema Functions
The MIN() and MAX() functions are known as extrema functions in

mathematics. They assume that the elements of the set have an ordering,
so it makes sense to select a first or last element based on its value. SQL
provides two simple extrema functions, and you can write queries to
generalize these to (n) elements.
21.4.1 Simple Extrema Functions
MAX([ALL | DISTINCT] <value expression>) returns the
greatest known value in the
<value expression> set. This function
will work on character and temporal values, as well as numeric values.
An empty set returns an empty result set. Technically, you can write
MAX(DISTINCT <value expression>), but it is the same as
MAX(<value expression>); this form exists only for completeness,
and nobody ever uses it.
MIN([ALL | DISTINCT] <value expression>) returns the
smallest known value in the
<value expression> set. This function
will also work on character and temporal values, as well as numeric
values. An empty set returns a
NULL. Likewise, MIN(DISTINCT
<value expression>) exists, but it is defined only for completeness
and nobody ever uses it.
The
MAX() for a set of numeric values is the largest. The MAX() for a
set of temporal data types is the one closest to
9999-12-31, which is
the final data in the ISO-8601 Standard. The
MAX() for a set of character
strings is the last one in the ascending sort order. Likewise, the
MIN()
for a set of numeric values is the smallest. The

MIN() for a set of
450 CHAPTER 21: AGGREGATE FUNCTIONS
temporal data types is the one furthest from 9999-12-31. The MIN()
for a set of character strings is the first one in the ascending sort order,
but you have to know the collation used.
People have a hard time understanding the
MAX() and MIN()
aggregate functions when they are applied to temporal data types. They
seem to expect the
MAX() to return the date closest to the current date.
Likewise, if the set has no dates before the current date, they seem to
expect the
MIN() function to return the date closest to the current date.
Human psychology wants to use the current time as an origin point for
temporal reasoning.
Consider the predicate “
billing_date < (CURRENT_DATE -
INTERVAL '90' DAY)” as an example. Most people have to stop and
figure out that this is looking for billings that are over 90 days past due.
This same thing happens with
MIN() and MAX() functions.
SQL also has funny rules about comparing
VARCHAR strings, which
can cause problems. When two strings are compared for equality, the
shortest one is right-padded with blanks; then they are compared
position for position. Thus, the strings ‘
John ’ and ‘John ’ are equal.
You will have to check your implementation of SQL to see which string is
returned as the
MAX() and which as the MIN(), or whether there is any

pattern to it at all.

There are some tricks with extrema functions in subqueries that differ
from product to product. For example, to find the current employee
status in a table of Salary Histories, the obvious query is:
SELECT *
FROM SalaryHistory AS S0
WHERE S0.change_date
= (SELECT MAX(S1.change_date)
FROM SalaryHistory AS S1
WHERE S0.emp_id = S1.emp_id);
But you can also write the query as:
SELECT *
FROM SalaryHistory AS S0
WHERE NOT EXISTS
(SELECT *
FROM SalaryHistory AS S1
WHERE S0.emp_id = S1.emp_id
AND S0.change_date < S1.change_date);
21.4 Extrema Functions 451
The correlated subquery with a MAX() will be implemented by going
to the subquery and building a working table, which is grouped by
emp_id. Then for each group you will keep track of the maximum and
save it for the final result.
However, the
NOT EXISTS version will find the first row that meets
the criteria and, when found, return
TRUE. Therefore, the NOT
EXISTS() predicate might run faster.
21.4.2 Generalized Extrema Functions

This is known as the Top (or Bottom) (n) values problem, and it
originally appeared in Explain magazine; it was submitted by Jim
Wankowski of Hawthorne, CA (Wankowski n.d.). You are given a table
of Personnel and their salaries. Write a single SQL query that will display
the three highest salaries from that table. It is easy to find the maximum
salary with the simple query
SELECT MAX(salary) FROM
Personnel; but SQL does not have a maximum function that will
return a group of high values from a column. The trouble with this query
is that the specification is bad, for several reasons.
1. How do we define “best salary” in terms of an ordering? Is it
base pay or does it include commissions? For the rest of this
section, assume that we are using a simple table with a column
that has the salary for each employee.
2. What if we have three or fewer personnel in the company? Do
we report all the personnel we do have? Or do we return a
NULL, empty result set or error message? This is the equivalent
of calling the contest for lack of entries.
3. How do we handle two personnel who tied? Include them all
and allow the result set to be bigger than three? Pick an
arbitrary subset and exclude someone? Or do we return a
NULL, empty result set, or error message?
To make these problems more explicit, consider this table:
Personnel
emp_name salary
==================
'Able' 1000.00
'Baker' 900.00

×