Tải bản đầy đủ (.pdf) (10 trang)

Joe Celko s SQL for Smarties - Advanced SQL Programming P58 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (244.64 KB, 10 trang )

542 CHAPTER 23: STATISTICS IN SQL
This approach works fine for two variables and would produce a table
that could be sent to a report writer program to give a final version. But
where are your column and row totals? This means you also need to write
these two queries:
SELECT race, COUNT(*) FROM Personnel GROUP BY race;
SELECT sex, COUNT(*) FROM Personnel GROUP BY sex;
However, what I wanted was a table with a row for males and a row
for females, with columns for each of the racial groups, just as I drew it.
But let us assume that we want to get this information broken down
within a third variable, such as a job code. I want to see the job_nbr and
the total by sex and race within each job code. Our query set starts to get
bigger and bigger. A crosstab can also include other summary data, such
as total or average salary within each cell of the table.
23.7.1 Crosstabs by Cross Join
A solution proposed by John M. Baird of Datapoint in San Antonio,
Texas involves creating a matrix table for each variable in the crosstab,
thus:
SexMatrix
sex Male Female
==================
'M' 1 0
'F' 0 1
RaceMatrix
race asian black caucasian latino Other
========================================================
asian 1 0 0 0 0
black 0 1 0 0 0
caucasian 0 0 1 0 0
latino 0 0 0 1 0
Other 0 0 0 0 1


The query then constructs the cells by using a CROSS JOIN
(Cartesian product) and summation for each one, thus:
23.7 Cross Tabulations 543
SELECT job_nbr,
SUM(asian * male) AS AsianMale,
SUM(asian * female) AS AsianFemale,
SUM(black * male) AS BlackMale,
SUM(black * female) AS BlackFemale,
SUM(cauc * male) AS CaucMale,
SUM(cauc * female) AS CaucFemale,
SUM(latino * male) AS LatinoMale,
SUM(latino * female) AS LatinoFemale,
SUM(other * male) AS OtherMale,
SUM(other * female) AS OtherFemale
FROM Personnel, SexMatrix, RaceMatrix
WHERE (RaceMatrix.race = Personnel.race)
AND (SexMatrix.sex = Personnel.sex)
GROUP BY job_nbr;
Numeric summary data can be obtained from this table. For example,
the total salary for each cell can be computed by
SUM(<race> *
<sex> * salary) AS <cell name> in place of what we have here.
23.7.2 Crosstabs by Outer Joins
Another method, due to Jim Panttaja, uses a series of temporary tables or
VIEWs and then combines them with OUTER JOINs.
CREATE VIEW Guys (race, maletally)
AS SELECT race, COUNT(*)
FROM Personnel
WHERE sex = 'M'
GROUP BY race;

Correspondingly, you could have written:
CREATE VIEW Dolls (race, femaletally)
AS SELECT race, COUNT(*)
FROM Personnel
WHERE sex = 'F'
GROUP BY race;
But they can be combined for a crosstab, without column and row
totals, like this:
544 CHAPTER 23: STATISTICS IN SQL
SELECT Guys.race, maletally, femaletally
FROM Guys LEFT OUTER JOIN Dolls
ON Guys.race = Dolls.race;
The idea is to build a starting column in the crosstab, then
progressively add columns to it. You use the
LEFT OUTER JOIN to avoid
missing-data problems.
23.7.3 Crosstabs by Subquery
Another method takes advantage of the orthogonality of correlated
subqueries in SQL-92. Think about what each row or column in the
crosstab wants.
SELECT DISTINCT race,
(SELECT COUNT(*)
FROM Personnel AS P1
WHERE P0.race = P1.race
AND sex = 'M') AS MaleTally,
(SELECT COUNT(*)
FROM Personnel AS P2
WHERE P0.race = P2.race
AND sex = 'F') AS FemaleTally
FROM Personnel AS P0;

An advantage of this approach is that you can attach another column
to get the row tally by adding
(SELECT COUNT(*)
FROM Personnel AS P3
WHERE P0.race = P3.race) AS RaceTally
Likewise, to get the column tallies, union the previous query with:
SELECT 'Summary',
(SELECT COUNT(*)
FROM Personnel
WHERE sex = 'M') AS GrandMaleTally,
(SELECT COUNT(*)
FROM Personnel
WHERE sex = 'F') AS GrandFemaleTally,
23.8 Harmonic Mean and Geometric Mean 545
(SELECT COUNT(*)
FROM Personnel) AS GrandTally
FROM Personnel;
23.7.4 Crosstabs by CASE Expression
Probably the best method is to use the CASE expression. If you need to
get the final row of the traditional crosstab, you can add:
SELECT sex,
SUM(CASE race WHEN 'caucasian' THEN 1 ELSE 0 END) AS
caucasian,
SUM(CASE race WHEN 'black' THEN 1 ELSE 0 END) AS black,
SUM(CASE race WHEN 'asian' THEN 1 ELSE 0 END) AS asian,
SUM(CASE race WHEN 'latino' THEN 1 ELSE 0 END) AS latino,
SUM(CASE race WHEN 'other' THEN 1 ELSE 0 END) AS other,
COUNT(*) AS row_total
FROM Personnel
GROUP BY sex

UNION ALL
SELECT ' ',
SUM(CASE race WHEN 'caucasian' THEN 1 ELSE 0 END),
SUM(CASE race WHEN 'black' THEN 1 ELSE 0 END),
SUM(CASE race WHEN 'asian' THEN 1 ELSE 0 END),
SUM(CASE race WHEN 'latino' THEN 1 ELSE 0 END),
SUM(CASE race WHEN 'other' THEN 1 ELSE 0 END),
COUNT(*) AS column_total
FROM Personnel;
23.8 Harmonic Mean and Geometric Mean
The harmonic mean is defined as the reciprocal of the arithmetic mean
of the reciprocals of the values of a set. It is appropriate when dealing
with rates and prices. Of limited use, it is found mostly in averaging
rates.
SELECT COUNT(*)/SUM(1.0/x) AS harmonic_mean
FROM Foobar;
The geometric mean is the exponential of the mean of the logs of the
data items. You can also express it as the nth root of the product of the
(n) data items. This second form is more subject to rounding errors than
546 CHAPTER 23: STATISTICS IN SQL
the first. The geometric mean is sometimes a better measure of central
tendency than the simple arithmetic mean when you are analyzing
change over time.
SELECT EXP (AVG (LOG (nbr))) AS geometric_mean
FROM NumberTable;
If you have negative numbers this will blow up, because the logarithm
is not defined for values less than or equal to zero.
23.9 Multivariable Descriptive Statistics in SQL
More and more SQL products are adding more complicated descriptive
statistics to their aggregate function library. For example, CA-Ingres

comes with a very nice set of such tools.
Many of the single-column aggregate functions for which we just gave
code are built-in functions. If you have that advantage, then use them.
They will have corrections for floating-point rounding errors and be
more accurate.
Descriptive statistics are not all single-column computations. You
often want to know relationships among several variables for prediction
and description. Let’s pick one statistic that is representative of this class
of functions and see what problems we have writing our own aggregate
function for it.
23.9.1 Covariance
The covariance is defined as a measure of the extent to which two
variables move together. Financial analysts use it to determine the degree
to which return on two securities is related over time. A high covariance
indicates similar movements. This code is due to Steve Kass:
CREATE TABLE Samples
(sample_nbr INTEGER NOT NULL PRIMARY KEY,
x FLOAT NOT NULL,
y FLOAT NOT NULL);
INSERT INTO Samples
VALUES (1, 3, 9), (2, 2, 7), (3, 4, 12), (4, 5, 15), (5, 6, 17);
SELECT sample_nbr, x, y,
((1.0/n) * SUM((x - xbar)*(y - ybar))) AS covariance
23.9 Multivariable Descriptive Statistics in SQL 547
FROM Samples
CROSS JOIN
(SELECT COUNT(*), AVG(x), AVG(y) FROM Samples)
AS A (n, xbar, ybar)
GROUP BY n;
23.9.2 Pearson’s r

One of the most useful covariants is Pearson’s r, or the linear correlation
coefficient. It measures the strength of the linear association between two
variables. In English, given a set of observations (x1, y1), (x2, y2), . . . ,
(xn, yn), I want to know: when one variable goes up or down, how well
does the other variable follow it?
The correlation coefficient always takes a value between +1 and -1.
Positive one means that they match to each other exactly. Negative one
means that increasing values in one variable correspond to decreasing
values in the other variable. A correlation value close to zero indicates no
association between the variables. In the real world, you will not see +1
or −1 very often—this would mean that you are looking at a natural law,
and not a statistical relationship. The values in between are much more
realistic, with 0.70 or greater being a strong correlation.
The formula translates into SQL in a straightforward manner.
CREATE TABLE Samples
(sample_name CHAR(3) NOT NULL PRIMARY KEY,
x REAL, y REAL);
INSERT INTO Samples
VALUES ('a', 1.0, 2.0), ('b', 2.0, 5.0), ('c', 3.0, 6.0);
r= 0.9608
SELECT (SUM(x - AVG(x))*(y - AVG(y)))
/ SQRT(SUM((x - AVG(x))^2) * SUM((y - AVG(y))^2))
AS pearson_r
FROM Samples;
SQRT() is the square root function, which is quite common in SQL
today, and
^2 is the square of the number. Some products use
POWER(x, n) instead of the exponent notation. Alternately, or you can
use repeated multiplication.
548 CHAPTER 23: STATISTICS IN SQL

23.9.3 NULLs in Multivariable Descriptive Statistics
If (x, y) = (NULL, NULL), then the query will drop the pair in the
aggregate functions, as per the usual rules of SQL. But what is the correct
(or reasonable) behavior if (x, y) has one and only one
NULL in the pair?
We can make several arguments.
1. Drop the pairs that contain any NULLs. That is quick and easy
with a “
WHERE x IS NOT NULL AND y IS NOT NULL”
clause added to the query. The argument is that if you don’t
know one or both values, how can you know what their rela-
tionship is?
2. Convert
(x, NULL) to (x, AVG(y)) and (NULL, y) to
(AVG(x), y). The idea is to “smooth out” the missing values
with a reasonable replacement that is based on the whole set
from which known values were drawn. There might be better
replacement values in a particular situation, but that idea
would still hold.
3. Replace (
NULL, NULL) with (a, a) for some value to say that the
NULLs are in the same grouping. This kind of “pseudo-
equality” is the basis for putting
NULLs into one group in a
GROUP BY operation. I am not sure what the correct practice
for the
(x, NULL) and (y, NULL) pairs are.
4. First calculate a linear regression with the known pairs, say y =
(a + b*x), and then fill in the expected values. If you forgot
your high school algebra, that would be

y[i] = a + b *
x[i] for the pair (x[i], NULL), and x[i] = (y - a) / b.
5. Catch the
SQLSTATE warning code message (found in
Standard SQL) to show that an aggregate function has dropped
NULLs before doing the computations, and use the message to
report to the user about the missing data.
I can also use COUNT(*) and COUNT(x+y) to determine how much
data is missing. I think we would all agree that if I have a small subset of
non-
NULL pairs, then my correlation is less reliable than if I obtained it
from a large subset of non-
NULL pairs.
There is no right answer to this question. You will need to know the
nature of your data to make a good decision.

CHAPTER

24

Regions, Runs, Gaps, Sequences,
and Series

T

ABLES DO NOT HAVE an ordering to their rows. Yes, the physical storage
of the rows in many SQL products might be ordered if the product is
built on an old file system. More modern implementations might not
construct and materialize the result set rows until the end of the query
execution.

The first rule in a relational database is that all relationships are
shown in tables by values in columns. This means that things
involving an ordering must have a table with at least two columns.
One column, the sequence number, is the primary key; the other
column has the value that holds that position in the sequence.
The sequence column has consecutive unique integers, without any
gaps in the numbering. Examples of this sort of data would be ticket
numbers, time series data taken at fixed intervals, and the like. The
ordering of those identifiers carries some information, such as physical
or temporal location. A subsequence is a set of consecutive unique
identifiers within a larger containing sequence that has some property.
This property is usually consecutive numbering.
For example, given the data

CREATE TABLE List
(seq_nbr INTEGER NOT NULL UNIQUE,
val INTEGER NOT NULL UNIQUE);

550 CHAPTER 24: REGIONS, RUNS, GAPS, SEQUENCES, AND SERIES

INSERT INTO List
VALUES (1, 99), (2, 10), (3, 11), (4, 12), (5, 13), (6, 14), (7,
0);

You can find subsequences of size three that follow the rule—(10, 11,
12), (11, 12, 13), and (12, 13, 14)—but the longest sequence is (10, 11,
12, 13, 14), and it is of size five.
A run is like a sequence, but the numbers do not have to be
consecutive, just increasing and contiguous. For example, given the run
{(1, 1), (2, 2), (3, 12), (4, 15), (5, 23)}, you can find subruns of size

three: (1, 2, 12), (2, 12, 15), and (12, 15, 23).
A region is contiguous, and all the values are the same. For example,
{(1, 1), (2, 0), (3, 0), (4, 0), (5, 25)} has a region of zeros that is three
items long.
In procedural languages, you would simply sort the data and scan it.
In SQL, you have to define everything in terms of sets and nested sets.
Some of these queries can be done with the OLAP addition to SQL-99,
but they are not yet common in SQL products.

24.1 Finding Subregions of Size (

n

)

This example is adapted from

SQL and Its Applications

(Lorie and
Daudenarde 1991). You are given a table of theater seats:

CREATE TABLE Theater
(seat_nbr INTEGER NOT NULL PRIMARY KEY, sequencing number
occupancy_status CHAR(1) NOT NULL values
CONSTRAINT valid_occupancy_status
CHECK (occupancy_status IN ('A', 'S'));

In this table, an occupancy_status code of ‘A’ means available, and ‘S’
means sold. Your problem is to write a query that will return the

subregions of (

n

) consecutive seats still available. Assume that
consecutive seat_nbrs means that the seats are also consecutive for a
moment, ignoring rows of seating where seat_nbr(

n

) and seat_nbr((

n

) +
1) might be on different physical theater rows. For (

n

) = 3, we can write a
self-

JOIN

query, thus:

SELECT T1.seat_nbr, T2.seat_nbr, T3.seat_nbr
FROM Theater AT T1, Theater AT T2, Theater AT T3
WHERE T1.occupancy_status = 'A'


24.2 Numbering Regions 551

AND T2.occupancy_status = 'A'
AND T3.occupancy_status = 'A'
AND T2.seat_nbr = T1.seat_nbr + 1
AND T3.seat_nbr = T2.seat_nbr + 1;

The trouble with this answer is that it works only for (

n

= 3). This
pattern can be extended for any (

n

), but what we really want is a
generalized query where we can use (

n

) as a parameter to the query.
The solution given by Lorie and Daudenarde starts with a given
seat_nbr and looks at all the available seats between it and ((

n

) - 1) seats
further up. The real trick is switching from the English-language
statement “All seats between here and there are available” to the passive-

voice version, “Available is the occupancy_status of all the seats between
here and there,” so that you can see the query.



SELECT seat_nbr, ' thru ', (seat_nbr + (:(n) - 1))
FROM Theater AS T1
WHERE occupancy_status = 'A'
AND 'A' = ALL (SELECT occupancy_status
FROM Theater AS T2
WHERE T2.seat_nbr > T1.seat_nbr
AND T2.seat_nbr <= T1.seat_nbr + (:(n) - 1));

Please notice that this returns subregions. That is, if seats (1, 2, 3, 4,
5) are available, this query will return (1, 2, 3), (2, 3, 4), and (3, 4, 5) as
its result set.

24.2 Numbering Regions

Instead of looking for a region, we want to number the regions in the
order in which they appear. For example, given a view or table with a
payment history, we want to break it into groupings of behavior—for
example, whether or not the payments were on time or late.

CREATE TABLE PaymentHistory
(payment_nbr INTEGER NOT NULL PRIMARY KEY,
paid_on_time CHAR(1) DEFAULT 'Y' NOT NULL
CHECK(paid_on_time IN ('Y', 'N')));
INSERT INTO PaymentHistory
VALUES (1006, 'Y'), (1005, 'Y'),

×