Tải bản đầy đủ (.pdf) (10 trang)

Joe Celko s SQL for Smarties - Advanced SQL Programming P12 potx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (133.68 KB, 10 trang )

82 CHAPTER 2: NORMALIZATION
Given:
1) (day, hour, gate) -> pilot
2) (day, hour, pilot) -> flight
prove that:
(day, hour, gate) -> flight.
3) (day, hour) -> (day, hour); Reflexive
4) (day, hour, gate) -> (day, hour); Augmentation on 3
5) (day, hour, gate) -> (day, hour, pilot); Union 1 & 4
6) (day, hour, gate) -> flight; Transitive 2 and 5
Q.E.D.
The answer is to start by attempting to derive each of the FDs from
the rest of the set. What we get is several short proofs, each requiring
different “given” FDs in order to get to the derived FD.
Here is a list of each of the proofs used to derive the ten fragmented
FDs in the problem. With each derivation, we include every derivation
step and the legal FD calculus operation that allows us to make that step.
An additional operation that we include here, which was not included in
the axioms we listed earlier, is left reduction. Left reduction says that if
XX → Y then X → Y. The reason it was not included is that this is actually
a theorem, and not one of the basic axioms (a side problem: can you
derive left reduction?).
Prove: (day, hour, pilot) -> gate
a) day -> day; Reflexive
b) (day, hour, pilot) -> day; Augmentation (a)
c) (day, hour, pilot) -> (day, flight); Union (6, b)
d) (day, hour, pilot) -> gate; Transitive (c, 3)
Q.E.D.
Prove: (day, hour, gate) -> pilot
a) day -> day; Reflexive
b) day, hour, gate -> day; Augmentation (a)


c) day, hour, gate -> (day, flight); Union (9, b)
d) day, hour, gate -> pilot; Transitive (c, 4)
Q.E.D.
Prove: (day, flight) -> gate
a) (day, flight, pilot) -> gate; Pseudotransitivity (2, 5)
2.9 Domain-Key Normal Form (DKNF) 83
b) (day, flight, day, flight) -> gate; Pseudotransitivity (a, 4)
c) (day, flight) -> gate; Left reduction (b)
Q.E.D.
Prove: (day, flight) -> pilot
a) (day, flight, gate) -> pilot; Pseudotransitivity (2, 8)
b) (day, flight, day, flight) -> pilot; Pseudotransitivity (a,
3)
c) (day, flight) -> pilot; Left reduction (b)
Q.E.D.
Prove: (day, hour, gate) -> flight
a) (day, hour) -> (day, hour); Reflexivity
b) (day, hour, gate) -> (day, hour); Augmentation (a)
c) (day, hour, gate) -> (day, hour, pilot); Union (b, 8)
d) (day, hour, gate) -> flight; Transitivity (c, 6)
Q.E.D.
Prove: (day, hour, pilot) -> flight
a) (day, hour) -> (day, hour); Reflexivity
b) (day, hour, pilot) -> (day, hour); Augmentation (a)
c) (day, hour, pilot) -> day, hour, gate; Union (b, 5)
d) (day, hour, pilot) -> flight; Transitivity (c, 9)
Q.E.D.
Prove: (day, hour, gate) -> destination
a) (day, hour, gate) -> destination; Transitivity (9, 1)
Q.E.D.

Prove: (day, hour, pilot) -> destination
a) (day, hour, pilot) -> destination; Transitivity (6, 1)
Q.E.D.
Now that we’ve shown you how to derive eight of the ten FDs from
other FDs, you can try mixing and matching the FDs into sets so that
each set meets the following criteria:
1. Each attribute must be represented on either the left or right
side of at least one FD in the set.
2. If a given FD is included in the set, then all the FDs needed to
derive it cannot also be included.
84 CHAPTER 2: NORMALIZATION
3. If a given FD is excluded from the set, then the FDs used to
derive it must be included.
This produces a set of “nonredundant covers,” which can be found
through trial and error and common sense. For example, if we exclude
(day, hour, gate) → flight, we must then include (day, hour, gate) →
pilot, and vice versa, because each is used in the other’s derivation. If
you want to be sure your search was exhaustive, however, you may want
to apply a more mechanical method, which is what the CASE tools do
for you.
The algorithm for accomplishing this task is basically to generate all
the combinations of sets of the FDs. (flight → destination) and (flight →
hour) are excluded in the combination generation because they cannot
be derived. This gives us (2^8), or 256, combinations of FDs. Each
combination is then tested against the criteria.
Fortunately, a simple spreadsheet does all the tedious work. In this
problem, the first criterion eliminates only 15 sets. Then the second
criterion eliminates 152 sets, and the third criterion drops another 67.
This leaves us with 22 possible covers, 5 of which are the answers we are
looking for (we will explain the other 17 later).

These five nonredundant covers are:
Set I:
flight -> destination
flight -> hour
(day, hour, gate) -> flight
(day, hour, gate) -> pilot
(day, hour, pilot) -> gate
Set II:
flight -> destination
flight -> hour
(day, hour, gate) -> pilot
(day, hour, pilot) -> flight
(day, hour, pilot) -> gate
Set III:
flight -> destination
flight -> hour
(day, flight) -> gate
(day, flight) -> pilot
(day, hour, gate) -> flight
2.9 Domain-Key Normal Form (DKNF) 85
Set IV:
flight -> destination
flight -> hour
(day, flight) -> gate
(day, hour, gate) -> pilot
(day, hour, pilot) -> flight
Set V:
flight -> destination
flight -> hour
(day, flight) -> pilot

(day, hour, gate) -> flight
(day, hour, pilot) -> gate
(day, hour, pilot) -> flight
At this point, we perform unions on FDs with the same left-hand side
and make tables for each grouping with the left-hand side as a key. We
can also eliminate symmetrical FD’s (defined as X → Y and Y → X, and
written with a two headed arrow, X ↔ Y) by collapsing them into the
same table.
These possible schemas are at least in 3NF. They are given in
shorthand SQL DDL (Data Declaration Language) without data type
declarations.
Solution 1:
CREATE TABLE R1 (flight, destination, hour,
PRIMARY KEY (flight));
CREATE TABLE R2 (day, hour, gate, flight, pilot,
PRIMARY KEY (day, hour, gate),
UNIQUE (day, hour, pilot),
UNIQUE (day, flight),
UNIQUE (flight, hour));
Solution 2:
CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY
(flight));
CREATE TABLE R2 (day, flight, gate, pilot,
PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, flight,
PRIMARY KEY (day, hour, gate),
UNIQUE (day, flight),
86 CHAPTER 2: NORMALIZATION
UNIQUE (flights, hour));
CREATE TABLE R4 (day, hour, pilot, flight,

PRIMARY KEY (day, hour, pilot));
Solution 3:
CREATE TABLE R1 (flight, destination, hour, flight
PRIMARY KEY (flight));
CREATE TABLE R2 (day, flight, gate, PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, pilot,
PRIMARY KEY (day, hour, gate),
UNIQUE (day, hour, pilot),
UNIQUE (day, hour, gate));
CREATE TABLE R4 (day, hour, pilot, flight
PRIMARY KEY (day, hour, pilot),
UNIQUE(day, flight),
UNIQUE (flight, hour));
Solution 4:
CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY
(flight));
CREATE TABLE R2 (day, flight, pilot, PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, flight,
PRIMARY KEY (day, hour, gate),
UNIQUE (flight, hour));
CREATE TABLE R4 (day, hour, pilot, gate,
PRIMARY KEY (day, hour, pilot));
These solutions are a mess, but they are a 3NF mess! Is there a better
answer? Here is one in BCNF and only two tables, proposed by Chris
Date (Date 1995, p. 224).
CREATE TABLE DailySchedules (flight, destination, hour PRIMARY
KEY (flight));
CREATE TABLE PilotSchedules (day, flight, gate, pilot, PRIMARY
KEY (day, flight));
This is a workable schema, but we could expand the constraints to

give us better performance and more precise error messages, since
schedules are not likely to change:
CREATE TABLE DailySchedules
(flight, hour, destination,
2.10 Practical Hints for Normalization 87
UNIQUE (flight, hour, destination),
UNIQUE (flight, hour),
UNIQUE (flight));
CREATE TABLE PilotSchedules
(day, flight, day, hour, gate, pilot,
UNIQUE (day, flight, gate),
UNIQUE (day, flight, pilot),
UNIQUE (day, flight),
FOREIGN KEY (flight, hour) REFERENCES R1(flight, hour));
2.10 Practical Hints for Normalization
CASE tools implement formal methods for doing normalization. In
particular, E-R (entity-relationship) diagrams are very useful for this
process. However, a few informal hints can help speed up the process
and give you a good start.
Broadly speaking, tables represent either entities or relationships,
which is why E-R diagrams work so well as a design tool. Tables that
represent entities should have a simple, immediate name suggested by
their contents—a table named Students has student data in it, not
student data and bowling scores. It is also a good idea to use plural or
collective nouns as the names of such tables to remind you that a table is
a set of entities; the rows are the single instances of them.
Tables that represent many-to-many relationships should be named
by their contents, and should be as minimal as possible. For example,
Students are related to Classes by a third (relationship) table for their
attendance. These tables might represent a pure relationship, or they

might contain attributes that exist within the relationship, such as a
grade for the class attended. Since the only way to get a grade is to attend
the class, the relationship is going to have a column for it, and will be
named “ReportCards,” “Grades” or something similar. Avoid naming
entities based on many-to-many relationships by combining the two
table names. For example, Student_Course is a bad name for the
Enrollment entity.
Avoid
NULLs whenever possible. If a table has too many NULL-able
columns, it is probably not normalized properly. Try to use a
NULL only
for a value that is missing now, but which will be resolved later. Even
better, you can put missing values into the encoding schemes for that
column, as discussed in as discussed in Section 5.2 of SQL Programming
Style, ISBN 0-12-088797-5, on encoding schemes.
88 CHAPTER 2: NORMALIZATION
A normalized database will tend to have a lot of tables with a small
number of columns per table. Don’t panic when you see that happen.
People who first worked with file systems (particularly on computers
that used magnetic tape) tend to design one monster file for an
application and do all the work against those records. This made sense
in the old days, since there was no reasonable way to
JOIN a number of
small files together without having the computer operator mount and
dismount lots of different tapes. The habit of designing this way carried
over to disk systems, since the procedural programming languages were
still the same.
The same nonkey attribute in more than one table is probably a
normalization problem. This is not a certainty, just a guideline. The key
that determines that attribute should be in only one table, and therefore

the attribute should be with it.
As a practical matter, you are apt to see the same attribute under
different names, and you will need to make the names uniform
throughout the entire database. The columns date_of_birth, birthdate,
birthday, and dob are very likely the same attribute for an employee.
2.11 Key Types
The logical and physical keys for a table can be classified by their
behavior and their source. Table 2.1 is a quick table of my classification
system.
Table 2.1 Classification System for Key Types
Natural Artificial "Exposed Surrogate
Physical
Locator"
=====================================================================
Constructed from attributes |
in the reality |
of the data model | Y N N Y
|
Verifiable in reality | Y N N N
|
Verifiable in itself | Y Y N N
|
Visible to the user | Y Y Y N
2.11 Key Types 89
2.11.1 Natural Keys
A natural key is a subset of attributes that occur in a table and act as a
unique identifier. The user sees them. You can go to external reality and
verify them. Examples of natural keys include the UPC codes on consumer
goods (read the package barcode) and coordinates (get a GPS).
Newbies worry about a natural compound key becoming very long.

My answer is, “So what?” This is the 21st century; we have much better
computers than we did in the 1950s, when key size was a real physical
issue. To replace a natural two- or three-integer compound key with a
huge GUID that no human being or other system can possibly
understand, because they think it will be faster, only cripples the system
and makes it more prone to errors. I know how to verify the (longitude,
latitude) pair of a location; how do you verify the GUID assigned to it?
A long key is not always a bad thing for performance. For example, if
I use (city, state) as my key, I get a free index on just (city) in many
systems. I can also add extra columns to the key to make it a super-key,
when such a super-key gives me a covering index (i.e., an index that
contains all of the columns required for a query, so that the base table
does not have to be accessed at all).
2.11.2 Artificial Keys
An artificial key is an extra attribute added to the table that is seen by the
user. It does not exist in the external reality, but can be verified for
syntax or check digits inside itself. One example of an artificial key is the
open codes in the UPC/EAN scheme that a user can assign to his own
stuff. The check digits still work, but you have to verify them inside your
own enterprise.
Experienced database designers tend toward keys they find in
industry standard codes, such as UPC/EAN, VIN, GTIN, ISBN, etc. They
know that they need to verify the data against the reality they are
modeling. A trusted external source is a good thing to have. I know why
this VIN is associated with this car, but why is an auto-number value of
42 associated with this car? Try to verify the relationship in the reality
you are modeling. It makes as much sense as locating a car by its parking
space number.
2.11.3 Exposed Physical Locators
An exposed physical locator is not based on attributes in the data model

and is exposed to the user. There is no way to predict it or verify it. The
system obtains a value through some physical process totally unrelated
90 CHAPTER 2: NORMALIZATION
to the logical data model. The user cannot change the locators without
destroying the relationships among the data elements.
Examples of exposed physical locators would be physical row
locations encoded as a number, string or proprietary data type. If
hashing tables were accessible in an SQL product they would qualify, but
they are usually hidden from the user.
Many programmers object to putting
IDENTITY and other auto-
numbering devices into this category. To convert the number into a
physical location requires a search rather than a hashing table lookup or
positioning a read/writer head on a disk drive, but the concept is the
same. The hardware gives you a way to go to a physical location that has
nothing to do with the logical data model, and that cannot be changed in
the physical database or verified externally.
Most of the time, exposed physical locators are used for faking a
sequential file’s positional record number, so I can reference the physical
storage location—a 1960s ISAM file in SQL. You lose all the advantages
of an abstract data model and SQL set-oriented programming, because
you carry extra data and destroy the portability of code.
The early SQLs were based on preexisting file systems. The data was
kept in physically contiguous disk pages, in physically contiguous rows,
made up of physically contiguous columns—in short, just like a deck of
punch cards or a magnetic tape. Most programmers still carry that
mental model, which is why I keep ranting about file versus table, row
versus record and column versus field.
But physically contiguous storage is only one way of building a
relational database—and it is not the best one. The basic idea of a

relational database is that the user is not supposed to know how or
where things are stored at all, much less write code that depends on the
particular physical representation in a particular release of a particular
product on particular hardware at a particular time. This is discussed
further in Section 1.2.1, “IDENTITY Columns.”
Finally, an appeal to authority, with a quote from Dr. Codd:
“Database users may cause the system to generate or delete a surrogate,
but they have no control over its value, nor is its value ever displayed to
them. . .”
This means that a surrogate ought to act like an index: created by the
user, managed by the system, and never seen by a user. That means never
used in code, DRI, or anything else that a user writes.
Codd also wrote the following:
2.11 Key Types 91
“There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.
1. The actual values of user-controlled keys are determined by
users and must therefore be subject to change by them (e.g., if
two companies merge, the two employee databases might be
combined, with the result that some or all of the serial numbers
might be changed).
2. Two relations may have user-controlled keys defined on
distinct domains (e.g., one uses Social Security numbers, while
the other uses employee serial numbers), and yet the entities
denoted are the same.
3. It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value, or after
it has ceased to have one (e.g., an applicant for a job and a
retiree).”
These difficulties have the important consequence that an equi-join

on common key values may not yield the same result as a join on
common entities. One solution—proposed in Chapter 4 and more fully
in Chapter 14—is to introduce entity domains, which contain system-
assigned surrogates. “Database users may cause the system to generate or
delete a surrogate, but they have no control over its value, nor is its value
ever displayed to them. . .” (Codd 1979).
2.11.4 Practical Hints for Denormalization
The subject of denormalization is a great way to get into religious wars.
At one extreme, you will find relational purists who think that the idea of
not carrying a database design to at least 3NF is a crime against nature.
At the other extreme, you will find people who simply add and move
columns all over the database with
ALTER statements, never keeping the
schema stable.
The reason given for denormalization is performance. A fully
normalized database requires a lot of
JOINs to construct common VIEWs
of data from its components.
JOINs used to be very costly in terms of
time and computer resources, so “preconstructing” the
JOIN in a
denormalized table can save quite a bit.

Today, only data warehouses should be denormalized—never a
production OLTP system.

×