Joe Celko s SQL for Smarties - Advanced SQL Programming P7 pot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (134.04 KB, 10 trang )

32 CHAPTER 1: DATABASE DESIGN
they should be one table with a column for a sex code. I would have split
a table on sex. This is very obvious, but it can also be subtler.
Consider a subscription database that has both organizational and
individual subscribers. There are two tables with the same structure and
a third table that holds the split attribute, subscription type.
CREATE TABLE OrgSubscriptions
(subscr_id INTEGER NOT NULL PRIMARY KEY
REFERENCES SubscriptionTypes(subscr_id),
org_name CHAR(35),
last_name CHAR(15),
first_name CHAR(15),
address1 CHAR(35)NOT NULL,
);
CREATE TABLE IndSubscriptions
(subscr_id INTEGER NOT NULL PRIMARY KEY
REFERENCES SubscriptionTypes(subscr_id),
org_name CHAR(35),
last_name CHAR(15),
first_name CHAR(15),
address1 CHAR(35)NOT NULL,
);
CREATE TABLE SubscriptionTypes
(subscr_id INTEGER NOT NULL PRIMARY KEY,
subscr_type CHAR(1) DEFAULT 'I' NOT NULL
CHECK (subscr_type IN ('I', 'O'));
An organizational subscription can go to just a person (last_name,
first_name), or just the organization name (org_name), or both. If an
individual subscription has no particular person, it is sent to an
organization called {Current Resident} instead.
The original specifications enforce a condition that subscr_id be

universally unique in the schema.
The first step is to replace the three tables with one table for all
subscriptions and move the subscription type back into a column of its
own, since it is an attribute of a subscription. Next, we need to add
constraints to deal with the constraints on each subscription.
1.1 Schema and Table Creation 33
CREATE TABLE Subscriptions
(subscr_id INTEGER NOT NULL PRIMARY KEY
REFERENCES SubscriptionTypes(subscr_id),
org_name CHAR(35) DEFAULT '{Current Resident}',
last_name CHAR(15),
first_name CHAR(15),
subscr_type CHAR(1) DEFAULT 'I' NOT NULL
CHECK (subscr_type IN ('I', 'O'),
CONSTRAINT known_addressee
CHECK (COALESCE (org_name, first_name, last_name) IS NOT NULL);

CONSTRAINT junkmail
CHECK (CASE WHEN subscr_type = 'I' AND org_name = '{Current
Resident}'
THEN 1
WHEN subscr_type = 'O' AND org_name = '{Current
Resident}'
THEN 0 ELSE 1 END = 1),
address1 CHAR(35)NOT NULL,
);
The known_addressee constraint means that we have to have a line
with some addressee for this to be a valid subscription. The junk mail
constraint ensures that anything not aimed at a known person is
classified as an individual subscription.

Attribute Split Rows
Consider this table, which directly models a sign-in/sign-out sheet.
CREATE TABLE RegisterBook
(emp_name CHAR(35) NOT NULL,
sign_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
sign_action CHAR (3) DEFAULT 'IN' NOT NULL
CHECK (sign_action IN ('IN', 'OUT')),
PRIMARY KEY (emp_name, sign_time));
To answer any basic query, you need to use two rows in a self-join to
get the sign-in and sign-out pairs for each employee. The correction
design would have been:

34 CHAPTER 1: DATABASE DESIGN
CREATE TABLE RegisterBook
(emp_name CHAR(35) NOT NULL,
sign_in_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
sign_out_time TIMESTAMP, null means current
PRIMARY KEY (emp_name, sign_in_time));
The single attribute, duration, has to be modeled as two columns in
Standard SQL, but it was split into rows identified by a code to tell which
end of the duration each one represented. If this were longitude and
latitude, you would immediately see the problem and put the two parts
of the one attribute (geographical location) in the same row.
1.1.11 Modeling Class Hierarchies in DDL
The classic scenario in an object-oriented (OO) model calls for a root
class with all of the common attributes and then specialized subclasses
under it. As an example, let’s take the class of Vehicles and find an
industry standard identifier (the Vehicle Identification Number, or VIN),
and add two mutually exclusive subclasses, sport utility vehicles and
sedans ('SUV', 'SED').

CREATE TABLE Vehicles
(vin CHAR(17) NOT NULL PRIMARY KEY,
vehicle_type CHAR(3) NOT NULL
CHECK(vehicle_type IN ('SUV', 'SED')),
UNIQUE (vin, vehicle_type),
);
Notice the overlapping candidate keys. I then use a compound
candidate key (vin, vehicle_type) and a constraint in each subclass table
to ensure that the vehicle_type is locked and agrees with the Vehicles
table. Add some DRI actions and you are done:
CREATE TABLE SUV
(vin CHAR(17) NOT NULL PRIMARY KEY,
vehicle_type CHAR(3) DEFAULT 'SUV' NOT NULL
CHECK(vehicle_type = 'SUV'),
UNIQUE (vin, vehicle_type),
FOREIGN KEY (vin, vehicle_type)
REFERENCES Vehicles(vin, vehicle_type)
ON UPDATE CASCADE
ON DELETE CASCADE,
1.1 Schema and Table Creation 35
);
CREATE TABLE Sedans
(vin CHAR(17) NOT NULL PRIMARY KEY,
vehicle_type CHAR(3) DEFAULT 'SED' NOT NULL
CHECK(vehicle_type = 'SED'),
UNIQUE (vin, vehicle_type),
FOREIGN KEY (vin, vehicle_type)
REFERENCES Vehicles(vin, vehicle_type)
ON UPDATE CASCADE
ON DELETE CASCADE,

);
I can continue to build a hierarchy like this. For example, if I had a
Sedans table that broke down into two-door and four-door sedans, I
could build a schema like this:

CREATE TABLE Sedans
(vin CHAR(17) NOT NULL PRIMARY KEY,
vehicle_type CHAR(3) DEFAULT 'SED' NOT NULL
CHECK(vehicle_type IN ('2DR', '4DR', ‘SED')),
UNIQUE (vin, vehicle_type),
FOREIGN KEY (vin, vehicle_type)
REFERENCES Vehicles(vin, vehicle_type)
ON UPDATE CASCADE
ON DELETE CASCADE,
);
CREATE TABLE TwoDoor
(vin CHAR(17) NOT NULL PRIMARY KEY,
vehicle_type CHAR(3) DEFAULT '2DR' NOT NULL
CHECK(vehicle_type = '2DR'),
UNIQUE (vin, vehicle_type),
FOREIGN KEY (vin, vehicle_type)
REFERENCES Sedans(vin, vehicle_type)
ON UPDATE CASCADE
ON DELETE CASCADE,
);
CREATE TABLE FourDoor
(vin CHAR(17) NOT NULL PRIMARY KEY,
36 CHAPTER 1: DATABASE DESIGN
vehicle_type CHAR(3) DEFAULT '4DR' NOT NULL
CHECK(vehicle_type = '4DR'),

UNIQUE (vin, vehicle_type),
FOREIGN KEY (vin, vehicle_type)
REFERENCES Sedans (vin, vehicle_type)
ON UPDATE CASCADE
ON DELETE CASCADE,
);
The idea is to build a chain of identifiers and types in a UNIQUE()
constraint that goes up the tree when you use a
REFERENCES constraint.
Obviously, you can do variants of this trick to get different class
structures.
If an entity doesn’t have to be exclusively one subtype, you play with
the root of the class hierarchy:
CREATE TABLE Vehicles
(vin CHAR(17) NOT NULL,
vehicle_type CHAR(3) NOT NULL
CHECK(vehicle_type IN ('SUV', 'SED')),
PRIMARY KEY (vin, vehicle_type),
);
Now, start hiding all this stuff in VIEWs immediately and add an
INSTEAD OF trigger to those VIEWs.
1.2 Generating Unique Sequential Numbers for Keys
One common vendor extension is using some method of generating a
sequence of integers to use as primary keys. These are very nonrelational
extensions that are highly proprietary, and have major disadvantages.
They all are based on exposing part of the physical state of the machine
during the insertion process, in violation of Dr. E. F. Codd’s rules for
defining a relational database (i.e., rule 8, physical data independence).
Dr. Codd’s rules are discussed in Chapter 2.
Early SQL products were built on existing file systems. The data was

kept in physically contiguous disk pages, in physically contiguous rows,
made up of physically contiguous columns, in short, just like a deck of
punch cards or a magnetic tape. Most of these sequence generators are
an attempt to regain the physical sequence that SQL took out of its
1.2 Generating Unique Sequential Numbers for Keys 37
logical model, so we can pretend that we have physically contiguous
storage.
But physically contiguous storage is only one way of building a
relational database, and it is not always the best one. Aside from that, the
whole idea of a relational database is that user is not supposed to know
how things are stored at all, much less write code that depends on the
particular physical representation in a particular release of a particular
product.
The exact method used to generate sequences of integers varies from
product to product, but the results are all the same, their behavior is
unpredictable.
Another major disadvantage of sequential numbers as keys is that
they have no check digits, so there is no way to determine if they are
valid or not (for a discussion of check digits, see Joe Celko’s Data and
Databases: Concepts in Practice).
So why do people use them? System-generated values are a fast and
easy answer to the problem of obtaining a unique primary key. It
requires no research and no real data modeling. Drug abuse is also a fast
and easy answer to problems. I do not recommend either.
1.2.1 IDENTITY Columns
The Sybase/SQL Server family allows you to declare an exact numeric
column with the property
IDENTITY in Sybase and DB2 or
AUTOINCREMENT in SQL Anywhere attached to it. These columns will
autoincrement with every row that is inserted into the table. The

numbering is totally dependent on the order in which the rows were
physically inserted into the table, even if they came into the table as a
single statement (i.e.,
INSERT INTO Foobar SELECT ;).
Since this “feature” is highly proprietary, you can get all kinds of
implementations. For example, if the next value to be used causes an
overflow, then you might get a wraparound to negative values. This
occurs with numbers larger than (2^31 - 1) in SQL Anywhere, while
Sybase allows the user to set a
NUMERIC(p, 0) column to any desired
size. Some products increment the internal counter before inserting a
row, so a rollback can cause gaps in the sequence. You have to know the
current release of your product and never expect your code to port to
even consider this “feature” in production code.
Let’s look at the logical problems. First, try to create a table with two
columns and try to make them both
IDENTITY columns. If you cannot
declare more than one column to be of a certain data type, then that
thing is not a data type at all, by definition.
38 CHAPTER 1: DATABASE DESIGN
Next, create a table with one column and make it an IDENTITY
column. Now try to insert, update, and delete different numbers from it.
If you cannot insert, update, and delete rows from a table, then it is not a
table by definition.
Finally create a simple table with one
IDENTITY column and a few
other columns. Use a few statements such as:
INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

These statements put a few rows into the table. Notice that the
IDENTITY column sequentially numbered them in the order they were
presented. If you delete a row, the gap in the sequence is not filled, and
the sequence continues from the highest number that has ever been used
in that column in that particular table.
But now use a statement with a query expression in it, like this:
INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;
Since a query result is a table, and a table is a set that has no ordering,
what should the
IDENTITY numbers be? The whole completed set is
presented to Foobar all at once, not a row at a time. There are (n!) ways
to number (n) rows, so which one do you pick? The answer has been to
use whatever the physical order of the result set happened to be. There’s
that nonrelational phrase “physical order” again.
But it is actually worse than that. If the same query is executed again,
but with new statistics, or after an index has been dropped or added, the
new execution plan could bring the result set back in a different physical
order. Can you explain from a logical model why the same rows in the
second query get different
IDENTITY numbers? In the relational model,
they should be treated the same if all the values of all the attributes are
identical.
Think about trying to do replication on two databases that differ only
by an index, or by cache size, or by something that occasionally gives
them different execution plans for the same statements.
Want to try to maintain such a system?
1.2 Generating Unique Sequential Numbers for Keys 39
1.2.2 ROWID and Physical Disk Addresses

Oracle has the ability to expose the physical address of a row on the hard
drive as a special variable called
ROWID. This is the fastest way to locate a
row in a table, since the read-write head is positioned to the row
immediately. This exposure of the underlying physical storage at the
logical level means that Oracle is committed to using contiguous storage
for the rows of a table, which in turn means that Oracle cannot use
hashing, distributed databases, dynamic bit vectors, or any of several
newer techniques for VLDB (Very Large Databases). When the database
is moved or reorganized for any reason, the
ROWID is changed.
1.2.3 Sequential Numbering in Pure SQL
The proper way to do this operation is to insert one row at a time with
this Standard SQL statement:
INSERT INTO Foobar (keycol, )
VALUES (COALESCE((SELECT MAX(keycol) FROM Foobar), 0) + 1, );
Notice the use of the COALESCE() function to handle the empty
table and to get the numbering started with one. This approach
generalizes from a row insertion to a table insertion:
INSERT INTO Foobar (keycol, )
VALUES (COALESCE((SELECT MAX(keycol) FROM Foobar), 0) + 1, ),
(COALESCE((SELECT MAX(keycol) FROM Foobar), 0) + 2, ),

(COALESCE((SELECT MAX(keycol) FROM Foobar), 0) + n, );
Another approach is to put a TRIGGER on the table. Here is the code
for SQL-99
TRIGGERs; actual products may have a slightly different
syntax:
CREATE TRIGGER Autoincrement
BEFORE INSERT ON Foobar

REFERENCING NEW AS N1
FOR EACH ROW
BEGIN
UPDATE N1
SET keycol = (SELECT COALESCE(MAX(F1.keycol), 0) + 1
40 CHAPTER 1: DATABASE DESIGN
FROM Foobar AS F1);
COMMIT; put each row into the table as it is processed
END;
Notice the use of the COALESCE() function to handle the first row
inserted into an empty table.
Umachandar Jayachandran (www.umachandar.com) suggested the
following method for generating unique identifiers in SQL. His original
note was for SQL Server, but it can be generalized to any product with a
random number function. The idea is to first split the counters into
several distinct ranges:
CREATE TABLE Counters
(id_nbr_set INTEGER NOT NULL PRIMARY KEY,
low_val INTEGER NOT NULL,
high_val INTEGER NOT NULL,
CHECK (low_val < high_val), properly ordered
CHECK (NOT EXISTS no overlaps
(SELECT *
FROM Counters AS C1
WHERE Counters.low_val BETWEEN C1.low_val AND C1.high_val
OR Counters.high_val BETWEEN C1.low_val AND C1.high_val))
);
INSERT INTO Counters VALUES (0, 0000000, 0999999);
INSERT INTO Counters VALUES (1, 1000000, 1999999);
INSERT INTO Counters VALUES (2, 2000000, 2999999);

INSERT INTO Counters VALUES (9, 9000000, 9999999);
and so on
The ranges can be any size you wish. However, uniform sizes have the
advantage of matching the uniform random number generator we will be
using in the code. The important thing is that the ranges should not
overlap each other. Here is a skeleton procedure body:
CREATE PROCEDURE GenerateCounters()
LANGUAGE SQL
IF (SELECT SUM(high_val - low_val) FROM Counters) > 0
THEN BEGIN
DECLARE new_id_nbr INTEGER;
1.2 Generating Unique Sequential Numbers for Keys 41
DECLARE random_set INTEGER;
SET new_id_nbr = NULL;
WHILE (new_id_nbr IS NULL)
DO SET random_set = CEILING(RAND() * 10);
This will randomly pick one row
SET new_id_nbr
= (SELECT low_val
FROM Counters
WHERE id_set_nbr = random_set
AND low_val < high_val);
UPDATE Counters
SET low_val = low_val + 1
WHERE id_nbr_set = random_set
AND low_val < high_val;
END WHILE;
code to create a check digit can go here
END;

ELSE BEGIN you are out of numbers
You can reset the Counters table with an UPDATE.
If you take no action, the new id number will be NULL
END;
END IF;
1.2.4 GUIDs
Global Unique Identifiers (GUIDs) are unique exposed physical locators
generated by a combination of UTC time and the network address of the
device creating it. Microsoft says that they should be unique for about a
century. According to Wikipedia (
“The algorithm used for generating new GUIDs has been
widely criticized. At one point, the user’s network card MAC
address was used as a base for several GUID digits, which
meant that, e.g., a document could be tracked back to the com-
puter that created it. After this was discovered, Microsoft
changed the algorithm so that it no longer contains the MAC
address. This privacy hole was used when locating the creator
of the Melissa worm.”

Joe Celko s SQL for Smarties - Advanced SQL Programming P7 pot

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về