The choice of primary key is largely a matter of convenience and what is easiest to use. We’ll
discuss primary keys later in this chapter in the context of relationships. The important thing to
remember is that when you have values that should exist only once in the database, you need to
protect against duplicates.
Choosing Keys
While keys can consist of any number of columns, it is best to try to limit the number of columns in
a key as much as possible. For example, you may have a
Book table with the columns
Publisher_Name, Publisher_City, ISBN_Number, Book_Name, and Edition. From these attributes, the
following three keys might be defined:
•
Publisher_Name, Book_Name, Edition: A publisher will likely publish more than one book.
Also, it is safe to assume that book names are not unique across all books. However, it is
probably true that the same publisher will not publish two books with the same title and the
same edition (at least, we assume that this is true!).
•
ISBN_Number: The ISBN number is the unique identification number assigned to a book when
it is published.
•
Publisher_City, ISBN_Number: Because ISBN_Number is unique, it follows that Publisher_City
and ISBN_Number combined is also unique.
The choice of (
Publisher_Name, Book_Name) as a composite candidate key seems valid, but the
(
Publisher_City, ISBN_Number) key requires more thought. The implication of this key is that in
every city,
ISBN_Number can be used again, a conclusion that is obviously not appropriate. This is a
common problem with composite keys, which are often not thought out properly. In this case, you
might choose
ISBN_Number as the PK and (Publisher_Name,
Book_Name
) as the AK.
■
Note
It is important to not confuse unique indexes with keys. There may be valid performance-based reasons
to implement the
Publisher_City, ISBN_Number
index in your SQL Server database. However, this would not
be identified as a key of a table. In Chapter 6, we’ll discuss implementing keys, and in Chapter 8, we’ll cover
implementing indexes for data access enhancement.
Having established what keys are, we’ll next discuss the two main types of keys: natural keys
(including smart keys) and surrogate keys.
Natur
al Keys
W
ikipedia (
) defines the ter
m
natur
al key
as
“
a candidate key that has a
logical relationship to the attributes within that row” (at least it did when this chapter was written).
In other words, it is a “real” attribute of an entity that the user logically uses to uniquely identify
each instance of an entity. From our previous examples, all of our candidate keys so far—employee
number, Social Security number (SSN), ISBN, and the (
Publisher_Name, Book_Name) composite
key—have been examples of natural keys.
S
ome common examples of good natur
al keys
ar
e as follo
ws:
•
F
or people
: D
r
iv
er
’
s license numbers (including the state of issue), company identification
number, or other assigned IDs (e.g., customer numbers or employee numbers).
•
For transactional documents (e.g., invoices, bills, and computer-generated notices): These usu-
ally have some sort of number assigned when they are printed.
•
For products for sale: These could be product numbers (product names are likely not unique).
CHAPTER 1
■
INTRODUCTION TO DATABASE CONCEPTS18
8662Ch01.qxp 7/28/08 3:37 PM Page 18
• For companies that clients deal with: These are commonly assigned a customer/client number
for tracking.
•
For buildings: This is usually the complete address, including the postal code.
•
F
or mail
:
These could be the addressee’s name and address and the date the item was sent.
Be careful when choosing a natural key. Ideally, you are looking for something that is stable,
that you can control, and that is definitely going to allow you to uniquely identify every row in your
database.
One thing of interest here is that what might be considered a natural key in your database is
often not actually a natural key in the place where it is defined, for example, the driver’s license
number of a person. In the example database, this is a number that every person has (or may need
before inclusion in our database, perhaps). However, the value of the driver’s license number is just
a series of integers. This number did not actually occur in nature tattooed on the back of the per-
son’s neck at birth. In the database where that number was created, it was actually more of a
surrogate key (which we will define in a later section).
Given that three-part names are common in the United States, it is usually relatively rare that
you’ll have two people working in the same company or attending the same school who have the
same three names. (Of course, if you work in a company with 200,000 people, the odds will go up
that you will hav
e duplicates.) If you include prefixes and suffixes, it is a bit less likely, but “rare” or
even “extremely rare” cannot be implemented in a manner that makes it a safe key. If you happen to
hire two people called Sir Lester James Fredingston III, then the second of them probably isn’t going
to take kindly to being called Les for short just so your database system isn’t compromised.
One notable profession where names must be unique is acting. No two actors who have their
union cards can have the same name. Some change their names from Archibald Leach to some-
thing more pleasant like Cary Grant, but in some cases the person wants to keep his or her name, so
in the actors database they add a
uniquifier to the name to make it unique.
A uniquifier might be some meaningless value added to a column or set of columns to give you
a unique key. For example, five people (up from four, last edition) are listed on the Internet Movie
Database site (
) with the name Gary Grant (not Cary, but Gary). Each has a dif-
ferent number associated with his name to make him a unique Gary Grant. (Of course, none of
these people has hit the big time, but watch out—it could be happening soon!)
■
Tip
We tend to think of names in most systems as a kind of semiunique natural key. This isn’t good enough for
identifying a single row, but it’s great for a human to find a value. The phone book is a good example of this. Say
you need to find Ray Janakowski in the phone book. There might be more than one person with this name, but it
might be a “good enough” way to look up a person’s phone number. This semiuniqueness is a very interesting
attribute of a table and should be documented for later use, but only in rare cases would you use the semiunique
values and make a key from them using a uniquifier.
Smart Keys
A commonly occurring type of natural key in computer systems is a smart or intelligent key. Some
identifiers will have additional information embedded in them, often as an easy way to build a
unique value for helping a human identify some real-world thing. In most cases, the smart key can
be disassembled into its par
ts
. I
n some cases
, ho
wever, the data will probably not jump out at you.
Take the following example of the fictitious product serial number XJV102329392000123:
•
X: Type of product (LCD television)
•
JV: S
ubtype of pr
oduct (32-inch console)
CHAPTER 1
■
INTRODUCTION TO DATABASE CONCEPTS 19
8662Ch01.qxp 7/28/08 3:37 PM Page 19
• 1023: Lot that the product was produced in (the 1023rd batch produced)
•
293: Day of year
•
9: Last digit of year
•
2: Color
•
000123: Order of production
T
he simple-to-use smart key values serve an important purpose to the end user, in that the
technician who received the product can decipher the value and see that in fact this product was
built in a lot that contained defective whatchamajiggers, and he needs to replace it. The essential
thing for us during the logical design phase is to find all the bits of information that make up the
smart keys because each of these values is likely going to need to be stored in its own column.
Smart keys, while useful in some cases, often present the database implementor with problems
that will occur over time. When at all possible, instead of implementing a single column with all of
these values, consider having multiple column values for each of the different pieces of information
and calculating the value of the smar
t key
. The end user gets what they need, and you in turn get
what you need, a column value that never needs to be broken down into parts to work with.
A big problem with smart keys is that it is possible to run out of unique values for the con-
stituent parts, or some part of the key (e.g., the product type or subtype) may change
. It is
imperative that you be very careful and plan ahead if you use smart keys to represent multiple
pieces of information. When you have to change the format of smart keys, it often becomes a large
validation problem to make sure that different values of the smart key are actually valid.
■
Note
Smart keys are useful tools to communicate a lot of information to the user in a small package. However,
all the bits of information that make up the smart key need to be identified, documented, and implemented in a
straightforward manner. Optimum SQL code expects the data to all be stored in individual columns, and as such, it
is of great importance that you needn’t ever base computing decisions on decoding the value. We will talk more
about the subject of choosing implementation keys in Chapter 5.
Surrogate Keys
Surrogate keys (sometimes described as artificial keys) are kind of the opposite of natural keys. The
word
surrogate means “
something that substitutes for
,” and in this case, a surrogate key substitutes
for a natural key. Sometimes there may not be a natural key that you think is stable or reliable
enough to use, in which case you may decide to use a surrogate key. In reality, many of our exam-
ples of natur
al keys w
ere actually surrogate keys in their original database but were elevated to a
natural status by usage in the “real” world.
A surrogate key can uniquely identify each instance of an entity, but it has no actual meaning
with regard to that entity other than to represent existence. Surrogate keys are usually maintained
by the system. Common methods for creating surrogate key values are using a monotonically
increasing number (e.g., an
Identity column), some form of hash function, or even a globally
unique identifier (GUID), which is a v
er
y long identifier that is unique on all machines in the world.
The concept of a surrogate key can be troubling to purists. Since the surrogate key does not
describe the row at all, can it really be an attribute of the row? Nevertheless, an exceptionally nice
aspect of a surr
ogate key is that the value of the key should never change. This, coupled with the fact
that surrogate keys are always a single column, makes several aspects of implementation far easier.
The only reason for the existence of the surrogate key is to identify a row. The main reason for
an artificial key is to provide a key that an end user never has to view and never has to interact with.
Think of it like y
our dr
iv
er’s license number, an ID number that is given to you when you begin to
CHAPTER 1
■
INTRODUCTION TO DATABASE CONCEPTS20
8662Ch01.qxp 7/28/08 3:37 PM Page 20
drive. It may have no other meaning than a number that helps a police officer look up who you are
when you’ve been testing to see just how fast you can go in sixth gear (although in the United King-
dom it is a scrambled version of the date of birth). The surrogate key should always have some
element that is just randomly chosen, and it should never be based on data that can change. If your
driver’s license number were a smart key and decoded to include your hair color, the driver’s license
number might change frequently (for some youth and we folks whose hair has turned a different
color). No, this value is good only for looking you up in a database.
Usually a true surrogate key is never shared with any users. It will be a value generated on the
computer system that is hidden from use, while the user directly accesses only the natural keys’ val-
ues. Probably the best reason for this definition is that once a user has access to a value, it then may
need to be modified. For example, if you were customer 0000013 or customer 00000666, you might
request a change.
■
Note
In some ways, surrogate keys should probably not even be mentioned in the logical design section of this
book, but it is important to know of their existence, since they will undoubtedly still crop up in some logical
designs. A typical flame war on the newsgroups (and amongst the tech reviewers of this book) is concerning
whether surrogate keys are a good idea. I’m a proponent of their use (as you will see), but I try to be fairly open in
my approach in the book to demonstrate both ways of doing things. Generally speaking, if a value is going to be
accessible to the end user, my preference is that it really needs to be modifiable and readable. You can also have
two surrogate keys in a table: one that is the unchanging “address” of a value, the other that is built for user con-
sumption (that is compact, readable, and changeable if it somehow offends your user).
Just as the driver’s license number probably has no meaning to the police officer other than a
means to quickly call up and check your records, the surrogate is used to make working with the
data programmatically easier. Since the source of the value for the surrogate key does not have any
correspondence to something a user might care about, once a value has been associated with a row,
there is not ever a reason to change the value. This is an exceptionally nice aspect of surrogate keys.
The fact that the value of the key does not change, coupled with the fact that it is always a single col-
umn, makes several aspects of implementation far easier. This will be made clearer later in the book
when choosing a pr
imary key.
Thinking back to the driver’s license analogy, if the driver’s license card has just a single value
(the surr
ogate key) on it, how would Officer Uberter Sloudoun determine whether you were actually
the person identified? He couldn’t, so there are other attributes listed, such as name, birth date, and
usually your pictur
e
, which is an excellent unique key for a human to deal with (except possibly for
identical twins, of course). In this very same way, a table ought to have other keys defined as well, or
it is not a proper table.
C
onsider the earlier example of a pr
oduct identifier consisting of seven parts:
•
X:
T
ype of product (LCD television)
•
JV: Subtype of product (32-inch console)
•
1023: Lot that the pr
oduct was pr
oduced in (the 1023r
d batch pr
oduced)
•
293: D
ay of year
•
9: Last digit of year
•
2: C
olor
•
000123: Or
der of production
A natur
al key would consist of these seven parts. There is also a product serial number, which is
the concatenation of the v
alues such as XJV102329392000123 to identify the r
o
w
. S
ay y
ou also hav
e
CHAPTER 1
■
INTRODUCTION TO DATABASE CONCEPTS 21
8662Ch01.qxp 7/28/08 3:37 PM Page 21
a surrogate key on the table that has a value of 3384038483. If the only key defined on the rows is the
surrogate, the following situation might occur:
SurrogateKey ProductSerialNumber
–––––––––––– –––––––––––––––––––
10 XJV102329392000123
3384038483 XJV102329392000123
3384434222 ZJV104329382043534
The first two rows are not duplicates, but since the surrogate key values have no real meaning,
in essence these are duplicate rows, since the user could not effectively tell them apart.
This sort of problem is common, because most people using surrogate keys do not understand
that only having a surrogate key opens them up to having rows with duplicate data in the columns
where the data has some logical relationship to each other. A user looking at the preceding table would
have no clue which row actually represented the product he or she was after, or if both rows did.
■
Note
When doing logical design, I tend to model each table with a surrogate key, since during the design
process I may not yet know what the final keys will in fact turn out to be. This approach will become obvious
throughout the book, especially in the case study presented throughout much of the book.
Missing Values (NULLs)
If you look up the definition of a “loaded subject” in a computer dictionary, you will likely find the
word
NULL. In the database, there must exist some way to say that the value of a given column is not
known or that the value is irrelevant. Often, a value outside of legitimate actual range (sometimes
referred to as a
sentinel value) is used to denote this value. For decades, programmers have used
ancient dates in a date column to indicate that a certain value does not matter, they use a negative
value where it does not make sense in the context of a column, or they simply use a text string of
'UNKNOWN' or 'N/A'. These approaches are fine, but special coding is required to deal with these val-
ues, for example:
IF (value<>'UNKNOWN') THEN ...
This is OK if it needs to be done only once. The problem, of course, is that this special coding is
needed
every time a new type of column is added. Instead, it is common to use a value of NULL,
which in relational theory means an empty set or a set with no value. Going back to Codd’s rules,
the third rule states the following:
NULL values (distinct from empty character string or a string of blank characters or zero) are
supported in the RDBMS for representing missing information in a systematic way, independ-
ent of data type.
Ther
e ar
e a couple of pr
oper
ties of
NULL that y
ou need to consider
:
•
Any v
alue concatenated with
NULL is NULL. NULL can r
epr
esent any v
alid v
alue
, so if an
unknown value is concatenated with a known value, the result is still an unknown value.
• All math operations with
NULL will return NULL, for the very same reason that any value con-
catenated with
NULL returns NULL.
• Logical comparisons can get tricky when
NULL is introduced.
CHAPTER 1
■
INTRODUCTION TO DATABASE CONCEPTS22
8662Ch01.qxp 7/28/08 3:37 PM Page 22