Joe Celko s SQL for Smarties - Advanced SQL Programming P64 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (239.27 KB, 10 trang )

602 CHAPTER 26: SET OPERATIONS
CREATE TABLE B (i INTEGER NOT NULL);
INSERT INTO B VALUES (2), (2), (3), (3);
The UNION and INTERSECT operations have regular behavior in that:
(A UNION B) = SELECT DISTINCT (A UNION ALL B) = ((1), (2), (3))
and
(A INTERSECT B) = SELECT DISTINCT (A INTERSECT ALL B) = (2)
However,
(A EXCEPT B) <> SELECT DISTINCT (A EXCEPT ALL B)
Or, more literally, (1) <> ((1), (2)) for the tables given in the example.
Likewise, we have:
(B EXCEPT A) = SELECT DISTINCT (B EXCEPT ALL A) = (3)
by a coincidence of the particular values used in these tables.
26.4 Equality and Proper Subsets
At one point, when SQL was still in the laboratory at IBM, there was a
CONTAINS operator that would tell you if one table was a subset of
another. It disappeared in later versions of the language and no vendor
picked it up. Set equality was never part of SQL as an operator, so you
would have to have used the two expressions
((A CONTAINS B) AND
(B CONTAINS A)) to find out.
Today, you can use the methods shown in the section on Relational
Division to determine containment or equality. However, Itzik Ben-Gan
came up with a novel approach for finding containment and equality
that is worth a mention.
SELECT SUM(DISTINCT match_col)
FROM (SELECT CASE
WHEN S1.col
IN (SELECT S2.col FROM S2)
THEN 1 ELSE -1 END
FROM S1) AS X(match_col)

HAVING SUM(DISTINCT match_col) = :n;
26.4 Equality and Proper Subsets 603
You can set (:n) to 1, 0, or −1 for each particular test.
When I find a matching row in S1, I get a +1; when I find a
mismatched row in S1, get a −1 and they sum together to give me a zero.
Therefore, S1 is a proper subset of S2. If they sum to +1, then they are
equal. If they sum to −1, they are disjoint.

CHAPTER

27

Subsets

I

AM DEFINING SUBSET operations as queries, which extract a particular
subset from a given set, as opposed to set operations, which work
among sets. The obvious way to extract a subset from a table is just to
use a

WHERE

clause, which will pull out the rows that meet that
criterion. But not all the subsets we want are easily defined by such a
simple predicate. This chapter is a collection of tricks for constructing
useful, but not obvious, subsets from a table.

27.1 Every

n

th Item in a Table

SQL is a set-oriented language, which cannot identify individual rows
by their physical positions in a disk file that holds a table. Instead, a
unique logical key is detected by logical expressions, and a row is
retrieved. If you are given a file of employees in which the ordering of
the file is based on their employee numbers, and you want to pick out
every

n

th employee record for a survey, the job is easy. You write a
procedure that loops through the file and writes every

n

th one to a
second file.
The immediate thought of how this should be done in SQL is to
simply compute

MOD (emp_nbr, :n)

, where

MOD()

is the modulo
function found in most SQL implementations, and save those
employee rows where this function is zero. The trouble is that

606 CHAPTER 27: SUBSETS

employees are not issued consecutive identification numbers. The
identification numbers are unique.
Vendor extensions often include an exposed physical row locator that
gives a sequential numbering to the physical records; this sequential
numbering can be used to perform these functions. This practice is a
complete violation of Dr. Codd’s definition of a relational database, and it
requires that the underlying physical implementation use a contiguous
sequential record for each row. Such things are highly proprietary, but
because these features are so low-level, they will run very fast on that one
particular product.
Row numbers have more problems than being nonstandard. If the
physical storage is rearranged, then the row numbers have to change.
Users logged on and looking at the same base table through different

VIEW

s may or may not get the same row number for the same physical
row. One of the advantages of an RDBMS was supposed to be that the
logical view of the data would be consistent, even when the physical
storage changed.
You can get similar results with a self-

JOIN

on the Personnel table
to partition it into a nested series of grouped tables, just as we did for
the “to top

n

” problem. You then pick out the largest value in each
group. There may be an index or a uniqueness constraint on the
emp_nbr column to ensure uniqueness, so the

EXISTS

predicate will
get a performance boost.

SELECT P1.emp_nbr
FROM Personnel AS P1
WHERE EXISTS
(SELECT MAX(emp_nbr)
FROM Personnel AS P2
WHERE P1.emp_nbr >= P2.emp_nbr
HAVING MOD (COUNT(*), :n) = 0);

A nonnested version of the same query looks like this:

SELECT P1.emp_nbr
FROM Personnel AS P1, Personnel AS P2
WHERE P1.emp_nbr >= P2.emp_nbr

GROUP BY P1.emp_nbr
HAVING MOD (COUNT(*), :n) = 0;

27.2 Picking Random Rows from a Table 607

Both queries count the number of P2 rows with a value less than the
P1 row.

27.2 Picking Random Rows from a Table

The answer is that, basically, you cannot directly pick a set of random
rows from a table in SQL. There is no randomize operator in the
standard, and you don’t often find the same pseudo-random number
generator function in various vendor extensions, either.
Picking random rows from a table for a statistical sample is a handy
thing, and you do it in other languages with a pseudo-random number
generator. There are two kinds of random drawings from a set, with or
without replacement. If SQL had random number functions, I suppose
they would be shown as

RANDOM(x)

and

RANDOM(DISTINCT x)

. But
there is no such function in SQL, and none is planned. Examples from
the real world include dealing a poker hand (a random with no
replacement situation) and shooting craps (a random with replacement

situation). If two players in a poker game get identical cards, you are
using a pinochle deck. In a craps game, each roll of the dice is
independent of the previous one and can repeat it.
The problem is that SQL is a set-oriented language, and wants to do
an operation “all at once” on a well-defined set of rows. Random sets are
defined by a nondeterministic procedure by definition, instead of a
deterministic logic expression.
The SQL/PSM language does have an option to declare or create a
procedure that is

DETERMINISTIC

or

NOT DETERMINISTIC

. The

DETERMINISTIC

option means that the optimizer can compute this
function once for a set of input parameter values and then use that result
everywhere in the current SQL statement that a call to the procedure
with those parameters appears. The

NOT DETERMINISTIC

option
means given the same parameters, you might not get the same results for
each call to the procedure within the same SQL statement.

Unfortunately, most SQL products do not have this feature in their
proprietary procedural languages. Thus, the random number function in
Oracle is nondeterministic and the one in SQL Server is deterministic.
For example,

CREATE TABLE RandomNbrs
(seq_nbr INTEGER NOT NULL PRIMARY KEY,
randomizer FLOAT NOT NULL);

608 CHAPTER 27: SUBSETS

INSERT INTO RandomNbrs
VALUES (1, RANDOM()),
(2, RANDOM()),
(3, RANDOM());

This query will result in the three rows all getting the same value in
the randomizer column in a version of SQL Server, but three different
numbers in a version of Oracle.
While subqueries are not allowed in

DEFAULT

clauses, system-related
functions such as

CURRENT_TIMESTAMP

and

CURRENT_USER

are
allowed. In some SQL implementations, this includes the

RANDOM()

function.

CREATE TABLE RandomNbrs2
(seq_nbr INTEGER PRIMARY KEY,
randomizer FLOAT warning !! not standard SQL
DEFAULT (
(CASE (CAST(RANDOM() + 0.5 AS INTEGER) * -1)
WHEN 0.0 THEN 1.0 ELSE -1.0 END)
* MOD (CAST(RANDOM() * 100000 AS INTEGER), 10000)
* RANDOM())
NOT NULL);
INSERT INTO RandomNbrs2
VALUES (1, DEFAULT);
(2, DEFAULT),
(3, DEFAULT),
(4, DEFAULT),
(5, DEFAULT),
(6, DEFAULT),
(7, DEFAULT),
(8, DEFAULT),
(9, DEFAULT),
(10, DEFAULT);

Here is a sample output from an SQL Server 7.0 implementation.

seq_nbr randomizer
============================
1 -121.89758452446999
2 -425.61113508053933

27.2 Picking Random Rows from a Table 609

3 3918.1554683876675
4 9335.2668286173412
5 54.463890640027664
6 -5.0169085346410522
7 -5430.63417246276
8 915.9835973796487
9 28.109161998753301
10 741.79452047043048

The best way to do this is to add a column to the table to hold a
random number, then use an external language with a good pseudo-
random number generator in its function library to load the new column
with random values with a cursor in a host language. You have to do it
this way, because random number generators work differently from
other function calls. They start with an initial value called a “seed”
(shown as Random[0] in the rest of this discussion) provided by the user
or the system clock. The seed is used to create the first number in the
sequence, Random[1]. Then each call, Random

[

n

], to the function uses
the previous number to generate the next one, Random[

n

+1].
There is no way to do a sequence of actions in SQL without a cursor,
so you are in procedural code.
The term “pseudo-random number generator” is often referred to as a
just “random number generator,” but this is technically wrong. All of the
generators will eventually return a value that appeared in the sequence
earlier and the procedure will hang in a cycle. Procedures are
deterministic, and we are living in a mathematical heresy when we try to
use them to produce truly random results. However, if the sequence has
a very long cycle and meets some other tests for randomness over the
range of the cycle, then we can use it.
There are many kinds of generators. The linear congruence pseudo-
random number generator family has generator formulas of the form:

Random[n+1] := MOD ((x * Random[n] + y), m);

There are restrictions on the relationships among

x

,

y,

and

m

that deal
with their relative primality. Knuth gives a proof that if

Random[0] is not a multiple of 2 or 5
m = 10^e where (e >= 5)
y = 0
MOD (x, 200) is in the set (3, 11, 13, 19, 21, 27, 29, 37, 53,

610 CHAPTER 27: SUBSETS

59, 61, 67, 77, 83, 91, 109, 117, 123, 131, 133, 139, 141, 147,
163, 171, 173, 179, 181, 187, 189, 197)

then the period will be 5 * 10^(

e

-2).
There are old favorites that many C programmers use from this
family, such as:

Random(n+1) := (Random(n) * 1103515245) + 12345;

Random(n+1) := MOD ((16807 * Random(n)), ((2^31) - 1));

The first formula has the advantage of not requiring a

MOD

function,
so it can be written in standard SQL. However, the simplest generator
that can be recommended (Park and Miller) uses:

Random(n+1) := MOD ((48271 * Random(n)), ((2^31) - 1));

Notice that the modulus is a prime number; this is important.
The period of this generator is ((2^31)

−

2), which is 2,147,483,646,
or more than two billion numbers before this generator repeats. You
must determine whether this is long enough for your application.
If you have an

XOR

function in your SQL, then you can also use shift
register algorithms. The

XOR

is the bitwise exclusive

OR

that works on an
integer as it is stored in the hardware; I would assume 32 bits on most
small computers. Some usable shift register algorithms are:

Random(n+1) := Random(n-103) XOR Random(n-250);
Random(n+1) := Random(n-1063) XOR Random(n-1279);

One method for writing a random number generator on the fly when
the vendor’s library does not have one is to pick a seed using one or more
key columns and a call to the system clock’s fractional seconds, such as

RANDOM(keycol + EXTRACT (SECOND FROM CURRENT_TIME)) *
1000

. This avoids problems with patterns in the keys, while the key
column values ensure uniqueness of the seed values.
Another method is to use a

PRIMARY KEY

or

UNIQUE

column(s) and
apply a hashing algorithm. You can pick one of the random number
generator functions already discussed and use the unique value, as if it

were the seed, as a quick way to get a hashing function. Hashing
algorithms try to be uniformly distributed, so if you can find a good one,
you will approach nearly unique random selection. The trick is that the

27.2 Picking Random Rows from a Table 611

hashing algorithm has to be simple enough to be written in the limited
math available in SQL.
Once you have a column of random numbers, you can convert the
random numbers into a randomly ordered sequence with this statement:

UPDATE RandomNbrs
SET randomizer = (SELECT COUNT(*)
FROM Sequence AS S1
WHERE S1.randomizer <= Sequence.seq_nbr);

To get one random row from a table, you can use this approach:

CREATE VIEW LotteryDrawing (keycol, , spin)
AS SELECT LotteryTickets.*,
(RANDOM(<keycol> + <fractional seconds from clock>))
FROM LotteryTickets
GROUP BY spin
HAVING COUNT(*) = 1;

Then simply use this query:

SELECT *
FROM LotteryDrawing
WHERE spin = (SELECT MAX(spin)

FROM LotteryDrawing)

The pseudo-random number function is not standard SQL, but it is
common enough. Using the keycol as the seed

probably

means that you
will get a different value for each row, but we can avoid duplicates with
the

GROUP BY HAVING

. Adding the fractional seconds will change
the result every time, but it might be illegal in some SQL products,
which disallow variable elements in

VIEW

definitions.
Let’s assume you have a function called

RANDOM()

that returns a
random number between 0.00 and 1.00. If you just want one random
row out of the table, and you have a numeric key column, Tom Moreau
proposed that you could find the

MAX()

and

MIN()

, then calculate a
random number between them.

SELECT L1.*
FROM LotteryDrawing AS L1
WHERE col_1

Joe Celko s SQL for Smarties - Advanced SQL Programming P64 pps

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về