Tải bản đầy đủ (.pdf) (10 trang)

Joe Celko s SQL for Smarties - Advanced SQL Programming P4 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (136.2 KB, 10 trang )


2 CHAPTER 1: DATABASE DESIGN

types, denormalization, and missing or incorrect constraints. As Elbert
Hubbard (American author, 1856-1915) put it: “Genius may have its
limitations, but stupidity is not thus handicapped.”

1.1 Schema and Table Creation

The major problem in learning SQL is that programmers are used to
thinking in terms of files rather than tables.
Programming languages are usually based on some underlying
model; if you understand the model, the language makes much more
sense. For example, FORTRAN is based on algebra. This does not mean
that FORTRAN is exactly like algebra. But if you know algebra,
FORTRAN does not look all that strange to you. You can write an
expression in an assignment statement or make a good guess as to the
names of library functions you have never seen before.
Programmers are used to working with files in almost every other
programming language. The design of files was derived from paper
forms; they are very physical and very dependent on the host
programming language. A COBOL file could not easily be read by a
FORTRAN program, and vice versa. In fact, it was hard to share files even
among programs written in the same programming language!
The most primitive form of a file is a sequence of records, ordered
within the file and referenced by physical position. You open a file, then
read a first record, followed by a series of next records until you come to
the last record to raise the end-of-file condition. You navigate among
these records and perform actions one record at a time. The actions you
take on one file have no effect on other files that are not in the same
program. Only programs can change files.


The model for SQL is data kept in sets, not in physical files. The “unit
of work” in SQL is the whole schema, not individual tables.
Sets are those mathematical abstractions you studied in school. Sets
are not ordered, and the members of a set are all of the same type. When
you perform an operation on a set, the action happens “all at once” to the
entire membership of the set. That is, if I ask for the subset of odd
numbers from the set of positive integers, I get all of them back as a
single set. I do not build the set of odd numbers by sequentially
inspecting one element at a time. I define odd numbers with a rule “If the
remainder is 1 when you divide the number by 2, it is odd” that could
test any integer and classify it. Parallel processing is one of many, many
advantages of having a set-oriented model.

1.1 Schema and Table Creation 3

SQL is not a perfect set language any more than FORTRAN is a
perfect algebraic language, as we will see. But if you are in doubt about
something in SQL, ask yourself how you would specify it in terms of sets,
and you will probably get the right answer.

1.1.1 CREATE SCHEMA Statement

A

CREATE SCHEMA

statement, defined in the SQL Standard, brings an
entire schema into existence all at once. In practice, each product has
very different utility programs to allocate physical storage and define a
schema. Much of the proprietary syntax is concerned with physical

storage allocations.
A schema must have a name and a default character set, usually
ASCII or a simple Latin alphabet as defined in the ISO Standards. There
is an optional

AUTHORIZATION

clause that holds a

<schema
authorization identifier>

for security. After that the schema is a
list of schema elements:

<schema element> ::=
<domain definition> | <table definition> | <view definition>
| <grant statement> | <assertion definition>
| <character set definition>
| <collation definition> | <translation definition>

A schema is the skeleton of an SQL database; it defines the structures
of the schema objects and the rules under which they operate. The data
is the meat on that skeleton.
The only data structure in SQL is the table. Tables can be persistent
(base tables), used for working storage (temporary tables), or virtual
(

VIEW


s, common table expressions, and derived tables). The differences
among these types are in implementation, not performance. One
advantage of having only one data structure is that the results of all
operations are also tables, you never have to convert structures, write
special operators, or deal with any irregularity in the language.
The

<grant statement>

has to do with limiting user access to
certain schema elements. The

<assertion definition>

is not
widely implemented yet, but it works as a constraint that applies to the
schema as a whole. Finally, the

<character set definition>

,

<collation definition>

, and

<translation definition>


deal with the display of data. We are not really concerned with any of

these schema objects; they are usually set in place by the DBA (database

4 CHAPTER 1: DATABASE DESIGN

administrator) for the users, and we mere programmers do not get to
change them.
Conceptually, a table is a set of zero or more rows, and a row is a set
of one or more columns. Each column has a specific data type and
constraints that make up an implementation of an abstract domain.
The way a table is physically implemented does not matter, because
you only access it with SQL. The database engine handles all the details
for you and you never worry about the internals, as you would with a
physical file.
In fact, almost no two SQL products use the same internal structures.
SQL Server uses physically contiguous storage accessed by two kinds of
indexes; Teradata uses hashing; Nucleus (SAND Technology) uses
compressed bit vector; Informix and CA-Ingres use more than a dozen
different kinds of indexes.
There are two common conceptual errors made by programmers who
are accustomed to file systems or PCs. The first is thinking that a table is
a file; the second is thinking that a table is a spreadsheet. Tables do not
behave like either, and you will get surprises if you do not understand
the basic concepts.
It is easy to imagine that a table is a file, a row is a record, and a
column is a field. This concept is familiar, and when data moves from
SQL to the host language, it must be converted into host language data
types and data structures to be displayed and used.
The big differences between working with a file system and working
with SQL are in the way SQL fits into a host program. If you are using a
file system, your programs must open and close files individually. In

SQL, the whole schema is connected to or disconnected from the
program as a single unit. The host program might not be authorized to
see or manipulate all of the tables and other schema objects, but that is
established as part of the connection.
The program defines fields within a file, whereas SQL defines its
columns in the schema. FORTRAN uses the

FORMAT

and

READ


statements to get data from a file. Likewise, a COBOL program uses a
Data Division to define the fields and a

READ

to fetch it. And so it goes
for every 3GL’s programming; the concept is the same, though the syntax
and options vary.
A file system lets you reference the same data by a different name in
each program. If a file’s layout changes, you must rewrite all the
programs that use that file. When a file is empty, it looks exactly like all
other empty files. When you try to read an empty file, the

EOF

(end of

file) flag pops up and the program takes some action. Column names

1.1 Schema and Table Creation 5

and data types in a table are defined within the database schema. Within
reasonable limits, the tables can be changed without the knowledge of
the host program.
The host program only worries about transferring the values to its
own variables from the database. Remember the empty set from your
high school math class? It is still a valid set. When a table is empty, it still
has columns, but has zero rows. There is no

EOF

flag to signal an
exception, because there is no final record.
Another major difference is that tables and columns can have
constraints attached to them. A constraint is a rule that defines what
must be true about the database after each transaction. In this sense, a
database is more like a collection of objects than a traditional passive file
system.
A table is not a spreadsheet, even though they look very similar when
you view them on a screen or in a printout. In a spreadsheet you can
access a row, a column, a cell, or a collection of cells by navigating with a
cursor. A table has no concept of navigation. Cells in a spreadsheet can
store instructions, not just data. There is no real difference between a
row and column in a spreadsheet; you could flip them around
completely and still get valid results. This is not true for an SQL table.

1.1.2 Manipulating Tables


The three basic table statements in the SQL DDL are

CREATE TABLE

,

DROP TABLE

, and

ALTER TABLE

. They pretty much do what you would
think they do from their names: they bring a table into existence, remove
a table, and change the structure of an existing table in the schema,
respectively. We will explain them in detail shortly. Here is a simple list
of rules for creating and naming a table.

1. The table name must be unique in the schema, and the column
names must be unique within a table. SQL can handle a table
and a column with the same name, but it is a good practice to
name tables differently from their columns. (See items 4 and 6
in this list.)
2. The names in SQL can consist of letters, underscores, and
digits. Vendors commonly allow other printing characters, but
it is a good idea to avoid using anything except letters,
underscores, and digits. Special characters are not portable and
will not sort the same way in different products.


6 CHAPTER 1: DATABASE DESIGN

3. Standard SQL allows you to use spaces, reserved words, and
special characters in a name if you enclose them in double
quotation marks, but this should be avoided as much as
possible.
4. The use of collective, class, or plural names for tables helps you
think of them as sets. For example, do not name a table
“Employee” unless there really is only one employee; use
something like “Employees” or (better) “Personnel,” for the
table name.
5. Use the same name for the same attribute everywhere in the
schema. That is, do not name a column in one table “sex” and a
column in another table “gender” when they refer to the same
property. You should have a data dictionary that enforces this
on your developers.
6. Use singular attribute names for columns and other scalar
schema objects.
I have a separate book on SQL programming style that goes into more
detail about this, so I will not mention it again.
A table must have at least one column. Though it is not required, it is
also a good idea to place related columns in their conventional order in
the table. By default, the columns will print out in the order in which
they appear in the table. That means you should put name, address, city,
state, and ZIP code in that order, so that you can read them easily in a
display.
The conventions in this book are that keywords are in

UPPERCASE


,
table names are Capitalized, and column names are in lowercase. I also
use capital letter(s) followed by digit(s) for correlation names (e.g., the
table Personnel would have correlation names P0, P1, . . ., P

n

), where the
digit shows the occurrence.

DROP TABLE <table name>

The

DROP TABLE

statement removes a table from the database. This is
not the same as making the table an empty table. When a schema object
is dropped, it is gone forever. The syntax of the statement is:

<drop table statement> ::= DROP TABLE <table name> [<drop
behavior>]
<drop behavior> ::= RESTRICT | CASCADE

1.1 Schema and Table Creation 7

The

<drop behavior>


clause has two options. If

RESTRICT

is
specified, the table cannot be referenced in the query expression of any
view or the search condition of any constraint. This clause is supposed to
prevent the unpleasant surprise of having other things fail because they
depended on this particular table for their own definitions. If

CASCADE


is specified, then such referencing objects will also be dropped along
with the table.
Either the particular SQL product would post an error message, and
in effect do a

RESTRICT

, or you would find out about any dependencies
by having your database blow up when it ran into constructs that needed
the missing table.
The

DROP

keyword and

<drop behavior>


clause are also used in
other statements that remove schema objects, such as

DROP VIEW

,

DROP
SCHEMA

,

DROP CONSTRAINT

, and so forth.
This is usually a “DBA-only” statement that, for obvious reasons,
programmers are not typically allowed to use.

ALTER TABLE

The

ALTER TABLE

statement adds, removes, or changes columns and
constraints within a table. This statement is in Standard SQL; it existed in
most SQL products before it was standardized. It is still implemented in
many different ways, so you should see your product for details. Again,
your DBA will not want you to use this statement without permission.

The Standard SQL syntax looks like this:

ALTER TABLE <table name> <alter table action>
<alter table action> ::=
| DROP [COLUMN] <column name> <drop behavior>
| ADD [COLUMN] <column definition>
| ALTER [COLUMN] <column name> <alter column action>
| ADD <table constraint definition>
| DROP CONSTRAINT <constraint name> <drop behavior>

The

DROP COLUMN

clause removes the column from the table.
Standard SQL gives you the option of setting the drop behavior, which
most current products do not. The two options are

RESTRICT

and

CASCADE

.

RESTRICT

will not allow the column to disappear if it is
referenced in another schema object.


CASCADE

will also delete any
schema object that references the dropped column.

8 CHAPTER 1: DATABASE DESIGN

When this statement is available in your SQL product, I strongly
advise that you first use the

RESTRICT

option to see if there are
references before you use the

CASCADE

option.
As you might expect, the

ADD COLUMN

clause extends the existing
table by putting another column on it. The new column must have a
name that is unique within the table and that follows the other rules for a
valid column declaration. The location of the new column is usually at
the end of the list of the existing columns in the table.
The


ALTER COLUMN

clause can change a column and its definition.
Exactly what is allowed will vary from product to product, but usually
the data type can be changed to a compatible data type [e.g., you can
make a

CHAR(n)

column longer, but not shorter; change an

INTEGER

to
a

REAL

; and so forth].
The

ADD <table constraint definition>

clause lets you put
a constraint on a table. Be careful, though, and find out whether your
SQL product will check the existing data to be sure that it can pass the
new constraint. It is possible in some older SQL products to leave bad
data in the tables, and then you will have to clean them out with special
routines to get to the actual physical storage.
The


DROP CONSTRAINT

clause requires that the constraint be given
a name, so naming constraints is a good habit to get into. If the
constraint to be dropped was given no name, you will have to find what
name the SQL engine assigned to it in the schema information tables and
use that name. The Standard does not say how such names are to be
constructed, only that they must be unique within a schema. Actual
products usually pick a long random string of digits and preface it with
some letters to make a valid name that is so absurd no human being
would think of it. A constraint name will also appear in warnings and
error messages, making debugging much easier. The

<drop
behavior>

option behaves as it did for the

DROP COLUMN

clause.

CREATE TABLE

The

CREATE TABLE

statement does all the hard work. The basic syntax

looks like the following, but there are actually more options we will
discuss later.

CREATE TABLE <table name> (<table element list>)
<table element list> ::=
<table element> | <table element>, <table element list>

1.1 Schema and Table Creation 9

<table element> ::=
<column definition> | <table constraint definition>

The table definition includes data in the column definitions and rules
for handling that data in the table constraint definitions. As a result, a
table acts more like an object (with its data and methods) than like a
simple, passive file.

Column Definitions

Beginning SQL programmers often fail to take full advantage of the
options available to them, and they pay for it with errors or extra work in
their applications. A column is not like a simple passive field in a file
system. It has more than just a data type associated with it.

<column definition> ::=
<column name> <data type>
[<default clause>]
[<column constraint> ]
<column constraint> ::= NOT NULL
| <check constraint definition>

| <unique specification>
| <references specification>

The first important thing to notice here is that each column must
have a data type, which it keeps unless you
ALTER the table. The SQL
Standard offers many data types, because SQL must work with many
different host languages. The data types fall into three major categories:
numeric, character, and temporal data types. We will discuss the data
types and their rules of operation in other sections; they are fairly
obvious, so not knowing the details will not stop you from reading the
examples that follow.
DEFAULT Clause
The default clause is an underused feature, whose syntax is:
<default clause> ::=
[CONSTRAINT <constraint name>] DEFAULT <default option>
<default option> ::= <literal> | <system value> | NULL
10 CHAPTER 1: DATABASE DESIGN
<system value> ::= CURRENT_DATE | CURRENT_TIME |
CURRENT_TIMESTAMP | SYSTEM_USER | SESSION_USER | CURRENT_USER
The SQL 2003 Standard also added CURRENT_PATH and
<implicitly typed value specification>.
Whenever the SQL engine does not have an explicit value to put into
this column during an insertion, it will look for a
DEFAULT clause and
insert that value. The default option can be a literal value of the relevant
data type, the current timestamp, the current date, the current user
identifier, or so forth. If you do not provide a
DEFAULT clause and the
column is

NULL-able, the system will provide a NULL as the default. If all
that fails, you will get an error message about missing data.
This approach is a good way to make the database do a lot of work
that you would otherwise have to code into all the application programs.
The most common tricks are to use a zero in numeric columns; a string
to encode a missing value ('{{unknown}}') or a true default (“same
address”) in character columns; and the system timestamp to mark
transactions.
1.1.3 Column Constraints
Column constraints are rules attached to a table. All the rows in the table
are validated against them. File systems have nothing like this, since
validation is done in the application programs. Column constraints are
also one of the most underused features of SQL, so you will look like a
real wizard if you can master them.
Constraints can be given a name and some attributes. The SQL engine
will use the constraint name to alter the column and to display error
messages.
<constraint name definition> ::= CONSTRAINT <constraint name>
<constraint attributes> ::=
<constraint check time> [[NOT] DEFERRABLE]
| [NOT] DEFERRABLE [<constraint check time>]
<constraint check time> ::= INITIALLY DEFERRED | INITIALLY
IMMEDIATE
A deferrable constraint can be “turned off” during a transaction. The
initial state tells you whether to enforce it at the start of the transaction or
wait until the end of the transaction, before the
COMMIT. Only certain
combinations of these attributes make sense.
1.1 Schema and Table Creation 11
1. If INITIALLY DEFERRED is specified, then the constraint has

to be
DEFERRABLE.
2. If
INITIALLY IMMEDIATE is specified or implicit and neither
DEFERRABLE nor NOT DEFERRABLE is specified, then NOT
DEFERRABLE is implicit.
The transaction statement can then use the following statement to set
the constraints as needed.
<set constraints mode statement> ::=
SET CONSTRAINTS <constraint name list> {DEFERRED | IMMEDIATE}
<constraint name list>
::= ALL | <constraint name> [{<comma> <constraint name>} ]
This feature was new with full SQL-92, and it is not widely
implemented in the smaller SQL products. In effect, they use
'NOT
DEFERRABLE INITIALLY IMMEDIATE' on all the constraints.
NOT NULL Constraint
The most important column constraint is the
NOT NULL, which forbids
the use of
NULLs in a column. Use this constraint routinely, and remove
it only when you have good reason. It will help you avoid the
complications of
NULL values when you make queries against the data.
The other side of the coin is that you should provide a
DEFAULT value to
replace the
NULL that would have been created.
The
NULL is a special marker in SQL that belongs to all data types.

SQL is the only language that has such a creature; if you can understand
how it works, you will have a good grasp of SQL. It is not a value; it is a
marker that holds a place where a value might go. But it must be cast to a
data type for physical storage.
A
NULL means that we have a missing, unknown, miscellaneous, or
inapplicable value in the data. It can mean many other things, but just
consider those four for now. The problem is which of these four
possibilities the
NULL indicates depends on how it is used. To clarify
this, imagine that I am looking at a carton of Easter eggs and I want to
know their colors. If I see an empty hole, I have a missing egg, which I
hope will be provided later. If I see a foil-wrapped egg, I have an
unknown color value in my set. If I see a multicolored egg, I have a
miscellaneous value in my set. If I see a cue ball, I have an inapplicable
value in my set. The way you handle each situation is a little different.

×