Tải bản đầy đủ (.pdf) (10 trang)

Joe Celko s SQL for Smarties - Advanced SQL Programming P57 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (128 KB, 10 trang )

532 CHAPTER 23: STATISTICS IN SQL
CREATE TABLE Sales
(salesman CHAR(10),
client_name CHAR(10),
sales_amount DECIMAL (9,2) NOT NULL,
PRIMARY KEY (salesman, client_name));
The problem is to show each salesman, his client, the amount of that
sale, what percentage of his total sales volume that one sale represents,
and the cumulative percentage of his total sales we have reached at that
point. We will sort the clients from the largest amount to the smallest.
This problem is based on a salesman’s report originally written for a
small commercial printing company. The idea was to show the salesmen
where their business was coming from and to persuade them to give up
their smaller accounts (defined as the lower 20%) to new salesmen. The
report lets the salesman run his finger down the page and see which
customers represented the top 80% of his income.
We can use derived tables to build layers of aggregation in the same
query.
SELECT S0.salesman, S0.client_name, S0.sales_amt,
((S0.sales_amt * 100)/ ST.salesman_total)
AS percent_of_total,
(SUM(S1.sales_amt)/((S0.sales_amt * 100)/
ST.salesman_total))
AS cum_percent
FROM Sales AS S0
INNER JOIN
Sales AS S1
ON (S0.salesman, S0.client_name) <= (S1.salesman,
S1.client_name)
INNER JOIN
(SELECT S2.salesman, SUM(S1.sales_amt)


FROM Sales AS S2
GROUP BY S2.salesman) AS ST(salesman, salesman_total)
ON S0.salesman = ST.salesman
GROUP BY S0.salesman, S0.client_name, S0.sales_amt;
However, if your SQL allows subqueries in the SELECT clause but not
in the
FROM clause, you can fake it with this query:
23.6 Cumulative Statistics 533
SELECT S0.salesman, S0.client_name, S0.sales_amt
(S0.sales_amt * 100.00/ (SELECT SUM(S1.sales_amt)
FROM Sales AS S1
WHERE S0.salesman = S1.salesman))
AS percentage_of_total,
(SELECT SUM(S3.sales_amt)
FROM Sales AS S3
WHERE S0.salesman = S3.salesman
AND (S3.sales_amt > S0.sales_amt
OR (S3.sales_amt = S0.sales_amt
AND S3.client_name >= S0.client_name))) *
100.00
/ (SELECT SUM(S2.sales_amt)
FROM Sales AS S2
WHERE S0.salesman = S2.salesman) AS cum_percent
FROM Sales AS S0;
This query will probably run like glue.
23.6.4 Rankings and Related Statistics
Martin Tillinger posted this problem on the MSACCESS forum of
CompuServe in early 1995. How do you rank your salesmen in each
territory, given a SalesReport table that looks like this?
CREATE TABLE SalesReport

(salesman CHAR(20) NOT NULL PRIMARY KEY
REFERENCES Salesforce(salesman),
territory INTEGER NOT NULL,
sales_tot DECIMAL (8,2) NOT NULL);
This statistic is called a ranking. A ranking is shown as integers that
represent the ordinal values (first, second, third, and so on) of the
elements of a set based on one of the values. In this case, sales personnel
are ranked by their total sales within a territory. The one with the highest
total sales is in first place, the next highest is in second place, and so
forth.
The hard question is how to handle ties. The rule is that if two
salespersons have the same value, they have the same ranking, and there
are no gaps in the rankings. This is the nature of ordinal numbers—there
cannot be a third place without a first and a second place. A query that
will do this for us is:
534 CHAPTER 23: STATISTICS IN SQL
SELECT S1.salesman, S1.territory, S1.sales_tot,
(SELECT COUNT(DISTINCT sales_tot)
FROM SalesReport AS S2
WHERE S2.sales_tot >= S1.sales_tot
AND S2.territory = S1.territory) AS rank
FROM SalesReport AS S1;
You might also remember that this is really a version of the
generalized extrema functions we already discussed. Another way to
write this query is thus:
SELECT S1.salesman, S1.territory, MAX(S1.sales_tot),
SUM (CASE
WHEN (S1.sales_tot || S1.name)
<= (S2.sales_tot || S2.name)
THEN 1 ELSE 0 END) AS rank

FROM SalesReport AS S2, SalesReport AS S2
WHERE S1.salesman <> S2.salesman
AND S1.territory = S2.territory
GROUP BY S1.salesman, S1.territory;
This query uses the MAX() function on the nongrouping columns in
the SalesReport to display them so that the aggregation will work.
It is worth looking at the four possible variations on this basic query
to see what each change does to the result set.
Version 1:
COUNT(DISTINCT) and >= yields a ranking.
SELECT S1.salesman, S1.territory, S1.sales_tot,
(SELECT COUNT(DISTINCT sales_tot)
FROM SalesReport AS S2
WHERE S2.sales_tot >= S1.sales_tot
AND S2.territory = S1.territory) AS rank
FROM SalesReport AS S1;
salesman territory sales_tot rank
=============================================
'Wilson' 1 990.00 1
'Smith' 1 950.00 2
'Richards' 1 800.00 3
'Quinn' 1 700.00 4
'Parker' 1 345.00 5
'Jones' 1 345.00 5
23.6 Cumulative Statistics 535
'Hubbard' 1 345.00 5
'Date' 1 200.00 6
'Codd' 1 200.00 6
'Blake' 1 100.00 7
Version 2: COUNT(DISTINCT) and > yields a ranking, but it starts at

zero.
SELECT S1.salesman, S1.territory, S1.sales_tot,
(SELECT COUNT(DISTINCT sales_tot)
FROM SalesReport AS S2
WHERE S2.sales_tot > S1.sales_tot
AND S2.territory = S1.territory) AS rank
FROM SalesReport AS S1;
salesman territory sales_tot rank
=============================================
'Wilson' 1 990.00 0
'Smith' 1 950.00 1
'Richard' 1 800.00 2
'Quinn' 1 700.00 3
'Parker' 1 345.00 4
'Jones' 1 345.00 4
'Hubbard' 1 345.00 4
'Date' 1 200.00 5
'Codd' 1 200.00 5
'Blake' 1 100.00 6
Version 3: COUNT(ALL) and >= yields a standing which starts at one.
SELECT S1.salesman, S1.territory, S1.sales_tot,
(SELECT COUNT(sales_tot)
FROM SalesReport AS S2
WHERE S2.sales_tot >= S1.sales_tot
AND S2.territory = S1.territory) AS standing
FROM SalesReport AS S1;
salesman territory sales_tot standing
=============================================
'Wilson' 1 990.00 1
'Smith' 1 950.00 2

536 CHAPTER 23: STATISTICS IN SQL
'Richard' 1 800.00 3
'Quinn' 1 700.00 4
'Parker' 1 345.00 7
'Jones' 1 345.00 7
'Hubbard' 1 345.00 7
'Date' 1 200.00 9
'Codd' 1 200.00 9
'Blake' 1 100.00 10
Version 4: COUNT(ALL) and > yields a standing that starts at zero.
SELECT S1.salesman, S1.territory, S1.sales_tot,
(SELECT COUNT(sales_tot)
FROM SalesReport AS S2
WHERE S2.sales_tot > S1.sales_tot
AND S2.territory = S1.territory) AS standing
FROM SalesReport AS S1;
salesman territory sales_tot standing
==============================================
'Wilson' 1 990.00 0
'Smith' 1 950.00 1
'Richard' 1 800.00 2
'Quinn' 1 700.00 3
'Parker' 1 345.00 4
'Jones' 1 345.00 4
'Hubbard' 1 345.00 4
'Date' 1 200.00 7
'Codd' 1 200.00 7
'Blake' 1 100.00 9
Another system, used in some British schools and in horse racing,
will also leave gaps in the numbers, but in a different direction. For

example given this set of Marks:
Marks class_standing
======================
100 1
90 2
90 2
70 4
23.6 Cumulative Statistics 537
Both students with 90 were second because only one person had a
higher mark. The student with 70 was fourth because there were three
people ahead of him. With our data, that would be:
SELECT S1.salesman, S1.territory, S1.sales_tot,
(SELECT COUNT(S2. sales_tot)
FROM SalesReport AS S2
WHERE S2.sales_tot > S1.sales_tot
AND S2.territory = S1.territory) + 1 AS british
FROM SalesReport AS S1;
salesman territory sales_tot british
=============================================
'Wilson' 1 990.00 1
'Smith' 1 950.00 2
'Richard' 1 800.00 3
'Quinn' 1 700.00 4
'Parker' 1 345.00 5
'Jones' 1 345.00 5
'Hubbard' 1 345.00 5
'Date' 1 200.00 8
'Codd' 1 200.00 8
'Blake' 1 100.00 10
As an aside for the mathematicians among the readers, I always use

the heuristics that it helps solve an SQL problem to think in terms of
sets. What we are looking for in these ranking queries is how to assign an
ordinal number to a subset of the SalesReport table. This subset is the
rows that have an equal or higher sales volume than the salesman at
whom we are looking. Or in other words, one copy of the SalesReport
table provides the elements of the subsets, and the other copy provides
the boundary of the subsets. This count is really a sequence of nested
subsets.
If you happen to have had a good set theory course, you would
remember John von Neumann’s definition of the nth ordinal number; it
is the set of all ordinal numbers less than the nth number.
23.6.5 Quintiles and Related Statistics
Once you have the ranking, it is fairly easy to classify the data set into
percentiles, quintiles, or dectiles. These are coarser versions of a ranking
that use subsets of roughly equal size. A quintile is 1/5 of the population,
538 CHAPTER 23: STATISTICS IN SQL
a dectile is 1/10 of the population, and a percentile is 1/100 of the
population. I will present quintiles here, since whatever we do for them
can be generalized to other partitionings. This statistic is popular with
schools, so I will use the SAT scores for an imaginary group of students
for my example.
SELECT T1.student_id, T1.score, T1.rank,
CASE WHEN T1.rank <= 0.2 * T2.population_size THEN 1
WHEN T1.rank <= 0.4 * T2.population_size THEN 2
WHEN T1.rank <= 0.6 * T2.population_size THEN 3
WHEN T1.rank <= 0.8 * T2.population_size THEN 4
ELSE 5 END AS quintile
FROM (SELECT S1.student_id, S1.score,
(SELECT COUNT(*)
FROM SAT_Scores AS S2

WHERE S2.score >= S1.score)
FROM SAT_Scores AS S1) AS T1(student_id, score, rank)
CROSS JOIN
(SELECT COUNT(*) FROM SAT_Scores)
AS T2(population_size);
The idea is straightforward: compute the rank for each element and
then put it into a bucket whose size is determined by the population
size. There are the same problems with ties that we had with rankings, as
well as problems about what to do when the population is skewed.
23.7 Cross Tabulations
A cross tabulation, or crosstab for short, is a common statistical report. It
can be done in IBM’s QMF tool, using the
ACROSS summary option, and
in many other SQL-based reporting packages. SPSS, SAS, and other
statistical packages have library procedures or language constructs for
crosstabs. Many spreadsheets can load the results of SQL queries and
perform a crosstab within the spreadsheet.
If you can use a reporting package on the server in a client/server
system instead of the following method, do so. It will run faster and in
less space than the method discussed here.
However, if you have to use the reporting package on the client side,
the extra time required to transfer data will make these methods on the
server side much faster.
23.7 Cross Tabulations 539
A one-way crosstab “flattens out” a table to display it in a report
format. Assume that we have a table of sales by product and the dates the
sales were made. We want to print out a report of the sales of products
by years for a full decade. The solution is to create a table and populate it
to look like an identity matrix (all elements on the diagonal are one, all
others zero) with a rightmost column of all ones to give a row total, then

JOIN the Sales table to it.
CREATE TABLE Sales
(product_name CHAR(15) NOT NULL,
product_price DECIMAL(5,2) NOT NULL,
qty INTEGER NOT NULL,
sales_year INTEGER NOT NULL);
CREATE TABLE Crosstabs
(year INTEGER NOT NULL,
year1 INTEGER NOT NULL,
year2 INTEGER NOT NULL,
year3 INTEGER NOT NULL,
year4 INTEGER NOT NULL,
year5 INTEGER NOT NULL,
row_total INTEGER NOT NULL);
The table would be populated as follows:
Sales_year year1 year2 year3 year4 year5 row_total
========================================================
1990 1 0 0 0 0 1
1991 0 1 0 0 0 1
1992 0 0 1 0 0 1
1993 0 0 0 1 0 1
1994 0 0 0 0 1 1
The query to produce the report table is
SELECT S1.product_name,
SUM(S1.qty * S1.product_price * C1.year1),
SUM(S1.qty * S1.product_price * C1.year2),
SUM(S1.qty * S1.product_price * C1.year3),
SUM(S1.qty * S1.product_price * C1.year4),
SUM(S1.qty * S1.product_price * C1.year5),
540 CHAPTER 23: STATISTICS IN SQL

SUM(S1.qty * S1.product_price * C1.row_total)
FROM Sales AS S1, Crosstabs AS C1
WHERE S1.year = C1.year
GROUP BY S1.product_name;
Obviously, (S1.product_price * S1.qty) is the total dollar
amount of each product in each year. The year n column will be either a
one or a zero. If it is a zero, the total dollar amount in the
SUM() is zero;
if it is a one, the total dollar amount in the
SUM() is unchanged.
This solution lets you adjust the time frame being shown in the report
by replacing the values in the year column to whatever consecutive years
you wish. A two-way crosstab takes two variables and produces a
spreadsheet with all values of one variable on the rows and all values of
the other represented by the columns. Each cell in the table holds the
COUNT of entities that have those values for the two variables. NULLs will
not fit into a crosstab very well, unless you decide to make them a group
of their own or to remove them.
Another trick is to use the
POSITION() function to convert a string
into a one or a zero. For example, assume we have a “day of the week”
function that returns a three-letter abbreviation and we want to report
the sales of items by day of the week in a horizontal list.
CREATE TABLE Weekdays
(day_name CHAR(3) NOT NULL PRIMARY KEY,
mon INTEGER NOT NULL,
tue INTEGER NOT NULL,
wed INTEGER NOT NULL,
thu INTEGER NOT NULL,
fri INTEGER NOT NULL,

sat INTEGER NOT NULL,
sun INTEGER NOT NULL);
INSERT INTO WeekDays
VALUES ('MON', 1, 0, 0, 0, 0, 0, 0),
('TUE', 0, 1, 0, 0, 0, 0, 0),
('WED', 0, 0, 1, 0, 0, 0, 0),
('THU', 0, 0, 0, 1, 0, 0, 0),
('FRI', 0, 0, 0, 0, 1, 0, 0),
('SAT', 0, 0, 0, 0, 0, 1, 0),
('SUN', 0, 0, 0, 0, 0, 0, 1);
23.7 Cross Tabulations 541
SELECT item,
SUM(amt * qty *
* mon * POSITION('MON' IN DOW(sales_date))) AS mon_tot,
SUM(amt * qty
* tue * POSITION('TUE' IN DOW(sales_date))) AS tue_tot,
SUM(amt * qty
* wed * POSITION('WED' IN DOW(sales_date))) AS wed_tot,
SUM(amt * qty
* thu * POSITION('THU' IN DOW(sales_date))) AS thu_tot,
SUM(amt * qty
* fri * POSITION('FRI' IN DOW(sales_date))) AS fri_tot,
SUM(amt * qty
* sat * POSITION('SAT' IN DOW(sales_date))) AS sat_tot,
SUM(amt * qty
* sun * POSITION('SUN' IN DOW(sales_date))) AS sun_tot
FROM Weekdays, Sales;
There are also totals for each column and each row, as well as a grand
total. Crosstabs of (n) variables are defined by building an n-dimensional
spreadsheet. But you cannot easily print (n) dimensions on two-

dimensional paper. The usual trick is to display the results as a two-
dimensional grid with one or both axes as a tree structure. The way the
values are nested on the axis is usually under program control; thus,
“race within sex” shows sex broken down by race, whereas “sex within
race” shows race broken down by sex.
Assume that we have a table, Personnel (emp_nbr, sex, race, job_nbr,
salary_amt), keyed on employee number, with no
NULLs in any
columns. We wish to write a crosstab of employees by sex and race,
which would look like this:
asian black caucasian latino Other TOTALS
===========================================================
Male 3 2 12 5 5 27
Female 1 10 20 2 9 42
TOTAL 4 12 32 7 14 69
The first thought is to use a GROUP BY and write a simple query, thus:
SELECT sex, race, COUNT(*)
FROM Personnel
GROUP BY sex, race;

×