Joe Celko s SQL for Smarties - Advanced SQL Programming P23 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (241.77 KB, 10 trang )

192 CHAPTER 6: NULLS: MISSING DATA IN SQL
4
5
Now insert a NULL and reexecute the same query:
INSERT INTO Table1 (col1) VALUES (NULL);
SELECT col1
FROM Table2
WHERE col1 NOT IN (SELECT col1 FROM Table1);
The result will be empty. This is counterintuitive, but correct. The
NOT IN predicate is defined as:
SELECT col1
FROM Table2
WHERE NOT (col1 IN (SELECT col1 FROM Table1));
The IN predicate is defined as:
SELECT col1
FROM Table2
WHERE NOT (col1 = ANY (SELECT col1 FROM Table1));
This becomes:
SELECT col1
FROM Table2
WHERE NOT ((col1 = 1)
OR (col1 = 2)
OR (col1 = 3)
OR (col1 = 4)
OR (col1 = 5)
OR (col1 = NULL));
The last expression is always UNKNOWN, so, applying DeMorgan’s
laws, the query is really:
SELECT col1
FROM Table2
WHERE ((col1 <> 1)

AND (col1 <> 2)
AND (col1 <> 3)
6.7 Functions and NULLs 193
AND (col1 <> 4)
AND (col1 <> 5)
AND UNKNOWN);
Look at the truth tables and you will see this always reduces to
UNKNOWN, and an UNKNOWN is always rejected in a search condition in a
WHERE clause.
6.5.2 Standard SQL Solutions
SQL-92 solved some of the 3VL (three-valued logic) problems by adding
a new predicate of the form:
<search condition> IS [NOT] TRUE | FALSE | UNKNOWN
This predicate will let you map any combination of three-valued logic
to two values. For example,
((age < 18) OR (gender =
‘Female’)) IS NOT FALSE will return TRUE if (age IS NULL) or
(gender IS NULL) and the remaining condition does not matter.
6.6 Math and NULLs
NULLs propagate when they appear in arithmetic expressions (+, −, *, /)
and return
NULL results. See Chapter 3 on numeric data types for more
details.
6.7 Functions and NULLs
Most vendors propagate NULLs in the functions they offer as extensions
of the standard ones required in SQL. For example, the cosine of a
NULL
will be
NULL. There are two functions that convert NULLs into values:
1.

NULLIF (V1, V2) returns a NULL when the first parameter
equals the second parameter. The function is equivalent to the
following case specification:
CASE WHEN (V1 = V2)
THEN NULL
ELSE V1 END
194 CHAPTER 6: NULLS: MISSING DATA IN SQL
2. COALESCE (V1, V2, V3, , Vn) processes the list from
left to right and returns the first parameter that is not
NULL. If
all the values are
NULL, it returns a NULL.
6.8 NULLs and Host Languages
This book does not discuss using SQL statements embedded in any
particular host language. For that information, you will need to pick up a
book for your particular language. However, you should know how
NULLs are handled when they have to be passed to a host program. No
standard host language for which an embedding is defined supports
NULLs, which is another good reason to avoid using them in your
database schema.
Roughly speaking, the programmer mixes SQL statements bracketed
by
EXEC SQL and a language-specific terminator (the semicolon in
Pascal and C,
END-EXEC in COBOL, and so on) into the host program.
This mixed-language program is run through an SQL preprocessor that
converts the SQL into procedure calls the host language can compile;
then the host program is compiled in the usual way.
There is an
EXEC SQL BEGIN DECLARE SECTION, EXEC SQL END

DECLARE SECTION pair that brackets declarations for the host
parameter variables that will get values from the database via
CURSORs.
This is the “neutral territory,” where the host and the database pass
information. SQL knows that it is dealing with a host variable, because
these have a colon prefix added to them when they appear in an SQL
statement. A
CURSOR is an SQL query statement that executes and
creates a structure that looks like a sequential file. The records in the
CURSOR are returned, one at a time, to the BEGIN DECLARE section of
the host program with the
FETCH statement. This avoids the impedance
mismatch between record processing in the host language and SQL’s set
orientation.

NULLs are handled by declaring INDICATOR variables in the host
language
BEGIN DECLARE section, which are paired with the host
variables. An
INDICATOR is an exact numeric data type with a scale of
zero—that is, some kind of integer in the host language.
The
FETCH statement takes one row from the cursor, then converts
each SQL data type into a host-language data type and puts that result
into the appropriate host variable. If the SQL value was a
NULL, the
INDICATOR is set to minus one; if no indicator was specified, an
exception condition is raised. As you can see, the host program must be
sure to check the
INDICATORs, because otherwise the value of the

6.9 Design Advice for NULLs 195
parameter will be garbage. If the parameter is passed to the host language
without any problems, the
INDICATOR is set to zero. If the value being
passed to the host program is a non-
NULL character string and has an
indicator, the indicator is set to the length of the SQL string and can be
used to detect string overflows or to set the length of the parameter.
Other SQL interfaces such as ODBC, JDBC, and so on have similar
mechanisms for telling the host program about
NULLs, even though they
might not use cursors.
6.9 Design Advice for NULLs
It is a good idea to declare all your base tables with NOT NULL
constraints on all columns whenever possible.
NULLs confuse people
who do not know SQL, and
NULLs are expensive. NULLs are usually
implemented with an extra bit somewhere in the row where the column
appears, rather than in the column itself. They adversely affect storage
requirements, indexing, and searching.
NULLs are not permitted in PRIMARY KEY columns. Think about
what a
PRIMARY KEY that was NULL (or partially NULL) would mean. A
NULL in a key means that the data model does not know what makes the
entities in that table unique from each other. That in turn says that
DBMS cannot decide whether the
PRIMARY KEY does or does not
duplicate a key that is already in the table.
NULLs should be avoided in FOREIGN KEYs. SQL allows this “benefit

of the doubt” relationship, but it can cause a loss of information in
queries that involve joins. For example, given a part number code in
Inventory that is referenced as a
FOREIGN KEY by an Orders table, you
will have problems getting a listing of the parts that have a
NULL. This is
a mandatory relationship; you cannot order a part that does not exist.

An example of an optional foreign key is a Personnel table having a
foreign key to a ParoleOfficer table; obviously a
NULL here means the
person does not (currently) have a parole officer. The
NULL can be
avoided by forcing the separation of the foreign key into its own table,
such that no row exists for a person who has no parole officer. However,
this degree of normalization is not always possible, nor would it always
be desirable to force the split. There is, too, the issue of what to return if
a join of the two tables is required, to return personnel information plus
parole officer, if any. There is also finally the issue of whether, when
multiple such splits have been made, the retrieval of consolidated
information will result in extremely slow queries to produce all the
196 CHAPTER 6: NULLS: MISSING DATA IN SQL
joined data (and to substitute whatever indicator has been chosen to
represent the “missing” data).
NULLs should not be allowed in encoding schemes that are known to
be complete. For example, employees are people and people are either
male or female. On the other hand, if you are recording the gender of
lawful persons (humans, corporations, and other legal entities), you need
the ISO sex codes, which use 0 = unknown, 1 = male, 2 = female, 9 = not
applicable. No, you have not missed a new gender; code 9 is for legal

persons, such as corporations.
The use of all zeros and all nines for “Unknown” and “N/A” is quite
common in numeric encoding schemes. This convention is a leftover
from the old punch card days, when a missing value was left as a field of
blanks (i.e., no punches) that could be punched into the card later.
Likewise, a field of all nines would sort to the end of the file, and it was
easy to hold the “nine” key down when the keypunch machine was in
numeric shift.
However, you have to use
NULLs in date fields when a DEFAULT date
does not make sense. For example, if you do not know someone’s
birthdate, a default date does not make sense; if a warranty has no
expiration date, then a
NULL can act as an “eternity” symbol.
Unfortunately, you often know relative times, but it is difficult to express
them in a database. For example, a pay raise occurs some time after you
have been hired, not before. A convict serving on death row should
expect a release date resolved by an event: his termination by execution
or by natural causes. This leads to extra columns to hold the status and
to control the transition constraints.

There is a proprietary extension to date values in MySQL. If you know
the year but not the month, you may enter ‘1949-00-00’. If you know the
year and month, but not the day, you may enter ‘1949-09-00’. You
cannot reliably use date arithmetic on these values, but they do help in
some instances, such as sorting people’s birthdates or calculating their
(approximate) age.
For people’s names, you are probably better off using a special
dummy string for unknown values rather than the general
NULL. In

particular, you can build a list of ‘John Doe #1’, ‘John Doe #2’, and so
forth to differentiate them; and you cannot do that with
NULL. Quantities
have to use a
NULL in some cases. There is a difference between an
unknown quantity and a zero quantity; it is the difference between an
empty gas tank and not having a car at all. Using negative numbers to
represent missing quantities does not work, because it makes accurate
calculations too complex.
6.9 Design Advice for NULLs 197
When programming languages had no DATE data types, this could
have been handled with a character string of
'9999-99-99
23:59:59.999999' for “eternity” or “the end of time.” When 4GL
products with a DATE data type came onto the market, programmers
usually inserted the maximum possible date for “eternity.” But again, this
will show up in calculations and in summary statistics. The best trick was
to use two columns, one for the date and one for a flag. But this made for
fairly complex code in the 4GL.

6.9.1 Avoiding NULLs from the Host Programs
You can avoid putting NULLs into the database from the Host Programs
with some programming discipline.
1. Initialize in the host program: Initialize all the data elements and
displays on the input screen of a client program before insert-
ing data into the database. Exactly how you can make sure that
all the programs use the same default values is another prob-
lem.
2. Use automatic defaults: The database is the final authority on the
default values.

3. Deduce values: Infer the missing data from the given values. For
example, patients reporting a pregnancy are female; patients
reporting prostate cancer are male. This technique can also be
used to limit choices to valid values for the user.
4. Track missing data: Data is tagged as missing, unknown, in error,
out-of-date, or whatever other condition makes it missing. This
will involve a companion column with special codes.
5. Determine impact of missing data on programming and reporting:
Numeric columns with
NULLs are a problem, because queries
using aggregate functions can provide misleading results.
Aggregate functions drop out the
NULLs before doing the
math, and the programmer has to trap the
SQLSTATE code for
this to make corrections.
6. Prevent missing data: Use batch process to scan and validate
data elements before it goes into the database. In the early
2000s, there was a sudden concern for data quality as CEOs
started going to jail for failing audits. This has lead to a niche in
the software trade for data quality tools.
198 CHAPTER 6: NULLS: MISSING DATA IN SQL
7. Ensure consistency: The data types and their NULL-ability
constraints have to be consistent across databases (e.g., the
chart of account should be defined the same way in both the
desktop and enterprise-level databases).
6.10 A Note on Multiple NULL Values
In a discussion on CompuServe in July 1996, Carl C. Federl came up
with an interesting idea for multiple missing value tokens in a database.
If you program in embedded SQL, you are used to having to work

with an
INDICATOR column. This column is used to pass information to
the host program, mostly about the
NULL or NOT NULL status of the
SQL column in the database. What the host program does with the
information is up to the programmer. So why not extend this concept a
bit and provide an indicator column in SQL? Let’s work out a simple
example:
CREATE TABLE Bob
(keycol INTEGER NOT NULL PRIMARY KEY,
valcol INTEGER NOT NULL,
multi_indicator INTEGER NOT NULL
CHECK (multi_indicator IN (0, Known value
1, Not applicable value
2, Missing value
3 Approximate value));
Let’s set up the rules: when all values are known, we do a regular total.
If a value is “not applicable,” then the whole total is “not applicable.” If
we have no “not applicable” values, then “missing value” dominates the
total; if we have no “not applicable” and no “missing” values, then we
give a warning about approximate values. The general form of the queries
will be:
SELECT SUM (valcol),
(CASE WHEN NOT EXISTS (SELECT multi_indicator
FROM Bob
WHERE multi_indicator > 0)
THEN 0
WHEN EXISTS (SELECT *
FROM Bob
WHERE multi_indicator = 1)

6.10 A Note on Multiple NULL Values 199
THEN 1
WHEN EXISTS (SELECT *
FROM Bob
WHERE multi_indicator = 2)
THEN 2
WHEN EXISTS (SELECT *
FROM Bob
WHERE multi_indicator = 3)
THEN 3
ELSE NULL END) AS totals_multi_indicator
FROM Bob;
Why would I muck with the valcol total at all? The status is over in
the multi_indicator column, just like it was in the original table. Here is
an exercise for the reader:
1. Make up a set of rules for multiple missing values and write a
query for the
SUM(), AVG(), MAX(), MIN(), and COUNT()
functions.
2. Set degrees of approximation (plus or minus five, plus or
minus ten, etc.) in the multi_indicator. Assume the valcol is
always in the middle. Make the multi_indicator handle the
fuzziness of the situation.
CREATE TABLE MultiNull
(groupcol INTEGER NOT NULL,
keycol INTEGER NOT NULL,
valcol INTEGER NOT NULL CHECK (valcol >= 0),
valcol_null INTEGER NOT NULL DEFAULT 0,
CHECK(valcol_null IN
(0, Known Value

1, Not applicable
2, Missing but applicable
3, Approximate within 1%
4, Approximate within 5%
5, Approximate within 25%
6 Approximate over 25% range)),
PRIMARY KEY (groupcol, keycol),
CHECK (valcol = 0 AND valcol_null NOT IN (1,2));
200 CHAPTER 6: NULLS: MISSING DATA IN SQL
CREATE VIEW Group_MultiNull
(groupcol, valcol_sum, valcol_avg, valcol_max, valcol_min,
row_cnt, notnull_cnt, na_cnt, missing_cnt, approximate_cnt,
appr_1_cnt, approx_5_cnt, approx_25_cnt, approx_big_cnt)
AS
SELECT groupcol, SUM(valcol), AVG(valcol), MAX(valcol),
MIN(valcol), COUNT(*),
SUM (CASE WHEN valcol_null = 0 THEN 1 ELSE 0 END)
AS notnull_cnt,
SUM (CASE WHEN valcol_null = 1 THEN 1 ELSE 0 END)
AS na_cnt,
SUM (CASE WHEN valcol_null = 2 THEN 1 ELSE 0 END)
AS missing_cnt,
SUM (CASE WHEN valcol_null IN (3,4,5,6) THEN 1 ELSE 0 END)
AS approximate_cnt,
SUM (CASE WHEN valcol_null = 3 THEN 1 ELSE 0 END)
AS appr_1_cnt,
SUM (CASE WHEN valcol_null = 4 THEN 1 ELSE 0 END)
AS approx_5_cnt,
SUM (CASE WHEN valcol_null = 5 THEN 1 ELSE 0 END)
AS approx_25_cnt,

SUM (CASE WHEN valcol_null = 6 THEN 1 ELSE 0 END)
AS approx_big_cnt
FROM MultiNull
GROUP BY groupcol;
SELECT groupcol, valcol_sum, valcol_avg, valcol_max, valcol_min,
(CASE WHEN row_cnt = notnull_cnt
THEN 'All are known'
ELSE 'Not all are known' END) AS warning_message,
row_cnt, notnull_cnt, na_cnt, missing_cnt,
approximate_cnt,
appr_1_cnt, approx_5_cnt, approx_25_cnt, approx_big_cnt
FROM Group_MultiNull;
While this is a bit complex for the typical application, it is not a bad
idea for a “staging area” database that attempts to scrub the data before it
goes to a data warehouse.

CHAPTER

7

Multiple Column Data Elements

T

HE CONCEPT OF A data element being atomic or scalar is usually taken to
mean that it is represented with a single column in a table. This is not
always true. A data element is atomic when it cannot be decomposed
into independent, meaningful parts. Doing so would result in attribute
splitting, a design flaw that we discussed in Section 1.1.11.
Consider an (

x

,

y

) coordinate system. A single

x

or

y

value identifies
a line of points, while the pair has to be taken together to give you a
location on the plane. It would be inconvenient to put both
coordinates into one column, so we model them in two columns.

7.1 Distance Functions

Since geographical data is important, you might find it handy to locate
places by their longitude and latitude, then calculate the distances
between two points on the globe. This is not a standard function in
any SQL, but it is handy to know.
Assume that we have values (Latitude1, Longitude1, Latitude2,
Longitude2) that locate the two points, and that they are in radians,
and we have trigonometry functions.
To convert decimal degrees to radians, multiply the number of

degrees by pi/180 = 0.017453293 radians/degree, where pi is
approximately 3.14159265358979:

Joe Celko s SQL for Smarties - Advanced SQL Programming P23 pptx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về