292 CHAPTER 14: THE [NOT] IN() PREDICATE
SELECT *
FROM JohnsBook AS J1
WHERE NOT EXISTS
(SELECT *
FROM QualityGuide AS Q1
WHERE Q1.restaurant_name = J1.restaurant_name);
The reason the second version will probably run faster is that it can
test for existence using the indexes on both tables. The
NOT IN()
version has to test all the values in the subquery table for inequality.
Many SQL implementations will construct a temporary table from the
IN()
predicate subquery, if it has a
WHERE
clause, but the temporary
table will not have any indexes. The temporary table can also have
duplicates and a random ordering of its rows, so that the SQL engine has
to do a full-table scan.
14.2 Replacing ORs with the IN() Predicate
A simple trick that beginning SQL programmers often miss is that an
IN()
predicate can often replace a set of
ORed predicates. For example:
SELECT *
FROM QualityControlReport
WHERE test_1 = 'passed'
OR test_2 = 'passed'
OR test_3 = 'passed'
OR test_4 = 'passed';
can be rewritten as:
SELECT *
FROM QualityControlReport
WHERE 'passed' IN (test_1, test_2, test_3, test_4);
The reason this is difficult to see is that programmers get used to
thinking of either a subquery or a simple list of constants. They miss the
fact that the
IN() predicate list can be a list of expressions. The
optimizer would have handled each of the original predicates separately
in the
WHERE clause, but it has to handle the IN() predicate as a single
item, which can change the order of evaluation. This might or might not
be faster than the list of
ORed predicates for a particular query. This
14.3 NULLs and the IN() Predicate 293
formulation might cause the predicate to become nonindexable; you
should check the indexability rules of your particular DBMS.
14.3 NULLs and the IN() Predicate
NULLs make some special problems in a NOT IN() predicate with a
subquery. Consider these two tables:
CREATE TABLE Table1 (x INTEGER);
INSERT INTO Table1 VALUES (1), (2), (3), (4);
CREATE TABLE Table2 (x INTEGER);
INSERT INTO Table2 VALUES (1), (NULL), (2);
Now execute the query:
SELECT *
FROM Table1
WHERE x NOT IN (SELECT x FROM Table2)
Let’s work it out step by painful step:
1. Do the subquery:
SELECT *
FROM Table1
WHERE x NOT IN (1, NULL, 2);
2. Convert the NOT IN() to its definition:
SELECT *
FROM Table1
WHERE NOT (x IN (1, NULL, 2));
3. Expand IN() predicate:
SELECT *
FROM Table1
WHERE NOT ((x = 1) OR (x = NULL) OR (x = 2));
4. Apply DeMorgan’s law:
SELECT *
FROM Table1
294 CHAPTER 14: THE [NOT] IN() PREDICATE
WHERE ((x <> 1) AND (x <> NULL) AND (x <> 2
5. Perform the constant logical expression:
SELECT *
FROM Table1
WHERE ((x <> 1) AND UNKNOWN AND (x <> 2));
6. Reduce OR to constant:
SELECT *
FROM Table1
WHERE UNKNOWN;
7. The results are always empty.
Now try this with another set of tables
CREATE TABLE Table3 (x INTEGER);
INSERT INTO Table3 VALUES (1), (2), (NULL), (4);
CREATE TABLE Table4 (x INTEGER);
INSERT INTO Table3 VALUES (1), (3), (2);
Let’s work out the same query step by painful step again.
1. Do the subquery
SELECT *
FROM Table3
WHERE x NOT IN (1, 3, 2);
2. Convert the NOT IN() to Boolean expression
SELECT *
FROM Table3
WHERE NOT (x IN (1, 3, 2));
3. Expand IN() predicate
SELECT *
FROM Table3
14.4 IN() Predicate and Referential Constraints 295
WHERE NOT ((x = 1) OR (x = 3) OR (x = 2));
4. Apply DeMorgan’s law:
SELECT *
FROM Table3
WHERE ((x <> 1) AND (x <> 3) AND (x <> 2));
5. Compute the result set; I will show it as a UNION with
substitutions:
SELECT *
FROM Table3
WHERE ((1 <> 1) AND (1 <> 3) AND (1 <> 2)) FALSE
UNION ALL
SELECT *
FROM Table3
WHERE ((2 <> 1) AND (2 <> 3) AND (2 <> 2)) FALSE
UNION ALL
SELECT * FROM Table3
WHERE ((CAST(NULL AS INTEGER) <> 1)
AND (CAST(NULL AS INTEGER) <> 3)
AND (CAST(NULL AS INTEGER) <> 2)) UNKNOWN
UNION ALL
SELECT *
FROM Table3
WHERE ((4 <> 1) AND (4 <> 3) AND (4 <> 2)); TRUE
6. The result is one row = (4).
14.4 IN() Predicate and Referential Constraints
One of the most popular uses for the IN() predicate is in a CHECK()
clause on a table. The usual form is a list of values that are legal for a
column, such as:
CREATE TABLE Addresses
(addressee_name CHAR(25) NOT NULL PRIMARY KEY,
street_loc CHAR(25) NOT NULL,
city_name CHAR(20) NOT NULL,
state_code CHAR(2) NOT NULL
CONSTRAINT valid_state_code
296 CHAPTER 14: THE [NOT] IN() PREDICATE
CHECK (state_code IN ('AL', 'AK', )),
);
This method works fine with a small list of values, but it has problems
with a longer list. It is very important to arrange the values in the order
that they are most likely to match to the two-letter state_code to speed
up the search.
In Standard SQL a constraint can reference other tables, so you could
write the same constraint as:
CREATE TABLE Addresses
(addressee_name CHAR(25) NOT NULL PRIMARY KEY,
street_loc CHAR(25) NOT NULL,
city_name CHAR(20) NOT NULL,
state_code CHAR(2) NOT NULL,
CONSTRAINT valid_state_code
CHECK (state_code
IN (SELECT state_code
FROM ZipCodes AS Z1
WHERE Z1.state_code = Addresses.state_code)),
);
The advantage of this is that you can change the ZipCodes table and
thereby change the effect of the constraint on the
Addresses table. This
is fine for adding more data in the outer reference (i.e., Quebec joins the
United States and gets the code ‘
QB’), but it has a bad effect when you try
to delete data in the outer reference (i.e., California secedes from the
United States and every row with ‘
CA’ for a state code is now invalid).
As a rule of thumb, use the
IN() predicate in a CHECK() constraint
when the list is short, static, and unique to one table. When the list is
short, static, but not unique to one table, then use a
CREATE DOMAIN
statement, and put the
IN() predicate in a CHECK() constraint on the
domain.
Use a
REFERENCES clause to a lookup table when the list is long and
dynamic, or when several other schema objects (
VIEWs, stored
procedures, etc.) reference the values. A separate table can have an
index, and that makes a big difference in searching and doing joins.
14.5 IN() Predicate and Scalar Queries 297
14.5 IN() Predicate and Scalar Queries
As mentioned before, the list of an IN() predicate can be any scalar
expression. This includes scalar subqueries, but most people do not
seem to know that this is possible. For example, given tables that model
warehouses, trucking centers, and so forth, we can find if we have a
product, identified by its UPC code, somewhere in the enterprise.
SELECT P.upc
FROM Picklist AS P
WHERE P.upc
IN ((SELECT upc FROM Warehouse AS W WHERE W.upc =
Picklist.upc),
(SELECT upc FROM TruckCenter AS T WHERE T.upc =
Picklist.upc),
(SELECT upc FROM Garbage AS G WHERE G.upc =
Picklist.upc));
The empty result sets will become NULLs in the list. The alternative to
this is usually a chain of
OUTER JOINs or an ORed list of EXISTS()
predicates.
CHAPTER
15
EXISTS() Predicate
T
HE EXISTS PREDICATE IS very natural. It is a test for a nonempty set. If
there are any rows in its subquery, it is
TRUE
; otherwise, it is
FALSE
.
This predicate does not give an
UNKNOWN
result. The syntax is:
<exists predicate> ::= EXISTS <table subquery>
It is worth mentioning that a
<table subquery>
is always inside
parentheses to avoid problems in the grammar during parsing.
In SQL-89, the rules stated that the subquery had to have a
SELECT
clause with one column or a
*
. If the
SELECT *
option was
used, the database engine would (in theory) pick one column and use
it. This fiction was needed because SQL-89 defined subqueries as
having only one column.
Some early SQL implementations would work better with
EXISTS(SELECT <column> )
,
EXISTS(SELECT <constant>
),
or
EXISTS(SELECT * )
versions of the predicate. Today,
there is no difference in the three forms in the major products, so the
EXISTS(SELECT * )
is the preferred form.
Indexes are very useful for
EXISTS()
predicates because they can
be searched while the base table is left alone completely. For example,
we want to find all employees who were born on the same day as any
famous person. The query could be:
300 CHAPTER 15: EXISTS() PREDICATE
SELECT P1.emp_name, ' has the same birthday as a famous person!'
FROM Personnel AS P1
WHERE EXISTS
(SELECT *
FROM Celebrities AS C1
WHERE P1.birthday = C1.birthday);
If the table
Celebrities
has an index on its birthday column, the
optimizer will get the current employee’s birthday
P1.birthday
and
look up that value in the index. If the value is in the index, the predicate
is
TRUE
and we do not need to look at the
Celebrities
table at all.
If it is not in the index, the predicate is
FALSE
and there is still no
need to look at the
Celebrities
table. This should be fast, since
indexes are smaller than their tables and are structured for very fast
searching.
However, if
Celebrities
has no index on its birthday column, the
query may have to look at every row to see if there is a birthday that
matches the current employee’s birthday. There are some tricks that a
good optimizer can use to speed things up in this situation.
15.1 EXISTS and NULLs
A
NULL
might not be a value, but it does exist in SQL. This is often a
problem for a new SQL programmer who is having trouble with
NULL
s
and how they behave.
Think of them as being like a brown paper bag—you know that
something is inside because you lifted it, but you do not know exactly
what that something is. For example, we want to find all the employees
who were not born on the same day as a famous person. This can be
answered with the negation of the original query, like this:
SELECT P1.emp_name, ' was born on a day without a famous person!'
FROM Personnel AS P1
WHERE NOT EXISTS
(SELECT *
FROM Celebrities AS C1
WHERE P1.birthday = C1.birthday);
But assume that among the celebrities we have a movie star who will
not admit her age, shown in the row
('Gloria Glamour', NULL)
. A
new SQL programmer might expect that Ms. Glamour would not match
15.1 EXISTS and NULLs 301
to anyone, since we do not know her birthday yet. Actually, she will
match to everyone, since there is a chance that they may match when
some tabloid newspaper finally gets a copy of her birth certificate. But
work out the subquery in the usual way to convince yourself:
WHERE NOT EXISTS
(SELECT *
FROM Celebrities
WHERE P1.birthday = NULL);
becomes:
WHERE NOT EXISTS
(SELECT *
FROM Celebrities
WHERE UNKNOWN);
becomes:
WHERE TRUE;
And you see that the predicate tests to
UNKNOWN
because of the
NULL
comparison, and therefore fails whenever we look at Ms. Glamour.
Another problem with
NULL
s is found when you attempt to convert
IN
predicates to
EXISTS
predicates. Using our example of matching our
employees to famous people, the query can be rewritten as:
SELECT P1.emp_name, ' was born on a day without a famous person!'
FROM Personnel AS P1
WHERE P1.birthday NOT IN
(SELECT C1.birthday
FROM Celebrities AS C1);
However, consider a more complex version of the same query, where
the celebrity has to have been born in New York City. The
IN
predicate
would be: