Tải bản đầy đủ (.pdf) (10 trang)

Joe Celko s SQL for Smarties - Advanced SQL Programming P21 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (119.79 KB, 10 trang )


172 CHAPTER 5: CHARACTER DATA TYPES IN SQL

5.1.3 Problems of String Grouping

Because the equality test has to pad out the shorter of the two strings,
you will often find doing a

GROUP BY

on a

VARCHAR(n)

has
unpredictable results:

CREATE TABLE Foobar (x VARCHAR(5) NOT NULL);
INSERT INTO Foobar VALUES ('a'), ('a '), ('a '), ('a ');

Now, execute the query:

SELECT x, CHAR_LENGTH(x)
FROM Foobar
GROUP BY x;

The value for

CHAR_LENGTH(x)

will vary for different products. The


most common answers are 1, 4, or 5 in this example. A length of 1 is
returned because it is the length of the shortest string or because it is the
length of the first string physically in the table. A length of 4 is returned
because it is the length of the longest string in the table, and a length of 5
because it is the greatest possible length of a string in the table.
You might want to add a constraint that makes sure to trim the
trailing blanks to avoid problems.

5.2 Standard String Functions

SQL-92 defines a set of string functions that appear in most products,
but with vendor-specific syntax. You will probably find that products will
continue to support their own syntax, but will also add the Standard
SQL syntax in new releases. String concatenation is shown with the ||
operator, taken from PL/I.
The

SUBSTRING(<string> FROM <start> FOR <length>)


function uses three arguments: the source string, the starting position of
the substring, and the length of the substring to be extracted. Truncation
occurs when the implied starting and ending positions are not both
within the given string.
DB2 and other products have a

LEFT

and a


RIGHT

function. The

LEFT

function returns a string consisting of the specified number of
leftmost characters of the string expression, and the

RIGHT

, well, that is
kind of obvious.

5.2 Standard String Functions 173

The fold functions are a pair of functions for converting all the
lowercase characters in a given string to uppercase,

UPPER(<string>)

,
or all the uppercase ones to lowercase,

LOWER(<string>)

.

TRIM([[<trim specification>] [<trim character>]
FROM] <trim source>)


produces a result string that is the source
string with an unwanted character removed. The

<trim source>

is the
original character value expression. The

<trim specification>

is
either

LEADING

or

TRAILING

or

BOTH

, and the

<trim character>

is
the single character that is to be removed.

The

TRIM()

function removes the leading and/or trailing
occurrences of a character from a string. The default character, if one is
not given, is a space. The SQL-92 version is a very general function, but
you will find that most SQL implementations have a version that works
only with spaces. DB2 instead has two functions:

LTRIM

for leftmost
(leading) blanks and

RTRIM

for rightmost (trailing) blanks.
A character translation is a function for changing each character of a
given string according to some many-to-one or one-to-one mapping
between two not necessarily distinct character sets. The syntax

TRANSLATE(<string expression> USING <translation>)


assumes that a special schema object, called a translation, has already
been created to hold the rules for doing all of this.

CHAR_LENGTH(<string>)


, also written

CHARACTER_LENGTH
(<string>)

determines the length of a given character string, as an
integer, in characters. In most current products, this function is usually
expressed as

LENGTH()

, and the next two functions do not exist at all;
they assume that the database will only hold ASCII or EBCDIC
characters.

BIT_LENGTH(<string>)

determines the length of a given
character string, as an integer, in bits.

OCTET_LENGTH(<string>)

determines the length of a given
character string, as an integer, in octets. Octets are units of eight bits that
are used by the one and two (Unicode) octet characters sets. This
function is the same as

TRUNCATE (BIT_LENGTH (<string>)/8)

.

The

POSITION(<search string> IN <source string>)


determines the first position, if any, at which the

<search string>


occurs within the <source string>. If the

<search string>

is of length
zero, then it occurs at position 1 for any value of the

<source
string>

. If the

<search string>

does not occur in the

<source
string>

, zero is returned. You will also see


LOCATE()

in DB2 and

CHAR_INDEX()

in SQL Server.

174 CHAPTER 5: CHARACTER DATA TYPES IN SQL

5.3 Common Vendor Extensions

The original SQL-89 standard did not define any functions for

CHAR(n)
data types. Standard SQL added the basic functions that have been
common to implementations for years. However, there are other
common or useful functions, and it is worth knowing how to implement
them outside of SQL.
Many vendors also have functions that will format data for display by
converting the internal format to a text string. A vendor whose SQL is
tied to a 4GL is much more likely to have these extensions, simply
because the 4GL can use them. The most common one converts a date
and time to a national format.
These functions generally use either a COBOL-style
PICTURE
parameter or a globally set default format. Some of this conversion work
is done with the
CAST() function in Standard SQL, but since SQL does

not have any output statements, such things will be vendor extensions
for some time to come.
Vendor extensions are varied, but there are some that are worth
mentioning. The names will be different in different products, but the
functionality will be the same:
SPACE(n) produces a string of (n) spaces.
REPLICATE (<string expression>, n) produces a string
of (n) repetitions of the
<string expression>. DB2 calls this
one
REPEAT(), and you will see other local names for it.
REPLACE (<target string>, <old string>, <new
string>) replaces the occurrences of the <old string>
with the <new string> in the <target string>.
As an aside, here is a nice trick to reduce several contiguous spaces in
a string to a single space to format text:
UPDATE Foobar
SET sentence
= REPLACE(
REPLACE(
REPLACE(sentence, SPACE(1), '<>')
'><', SPACE(0))
'<>', SPACE(1));
5.3 Common Vendor Extensions 175
REVERSE(<string expression>) reverses the order of the
characters in a string to make it easier to search. This function
is impossible to write with the standard string operators,
because it requires either iteration or recursion.
FLIP(<string expression>, <pivot>) will locate the
pivot character in the string, then concatenate all the letters to

the left of the pivot onto the end of the string and finally erase
the pivot character. This is used to change the order of names
from military format to civilian format—for example,
FLIP('Smith, John', ',') yields John Smith. This func-
tion can be written with the standard string functions, how-
ever.

NUMTOWORDS(<numeric expression>) will write out the
numeric value as a set of English words to be used on checks or
other documents that require both numeric and text versions
of the same value.
5.3.1 Phonetic Matching
People’s names are a problem for designers of databases. Names are
variable-length, can have strange spellings, and are not unique.
American names have a diversity of ethnic origins, which give us names
pronounced the same way but spelled differently, and vice versa.
Aside from this diversity of names, errors in reading or hearing a
name lead to mutations. Anyone who gets junk mail is aware of this. In
addition to mail addressed to “Celko,” I get mail addressed to “Selco,”
“Selko,” and “Celco,” which are phonetic errors. I also get some letters
with typing errors, such as “Cellro,” “Chelco,” and “Chelko” in my mail
stack. Such errors result in the mailing of multiple copies of the same
item to the same address. To solve this problem, we need phonetic
algorithms that can find similar-sounding names.
Soundex Functions
The Soundex family of algorithms is named after the original algorithm.
A Soundex algorithm takes a person’s name as input and produces a
character string that identifies a set of names that are (roughly)
phonetically alike.
SQL products often have a Soundex algorithm in their library

functions. It is also possible to compute a Soundex in SQL, using string
functions and the CASE expression in Standard SQL. Names that sound
alike do not always have the same Soundex code. For example, “Lee”
176 CHAPTER 5: CHARACTER DATA TYPES IN SQL
and “Leigh” are pronounced alike, but have different Soundex codes
because the silent ‘g’ in “Leigh” is given a code.
Names that sound alike but start with a different first letter will
always have a different Soundex, such as “Carr” and “Karr” will be
separate codes.
Finally, Soundex is based on English pronunciation, so European and
Asian names may not encode correctly. French surnames like “Beaux”
(with a silent ‘x’) and “Beau” (without it) will result in two different
Soundex codes.
Sometimes names that don’t sound alike have the same Soundex
code. The relatively common names “Powers,” “Pierce,” “Price,” “Perez,”
and “Park” all have the same Soundex code. Yet “Power,” a common way
to spell Powers 100 years ago, has a different Soundex code.
The Original Soundex
Margaret O’Dell and Robert C. Russell patented the original Soundex
algorithm in 1918. The method is based on the phonetic classification of
sounds by how they are made. In case you wanted to know, the six
groups are bilabial, labiodental, dental, alveolar, velar, and glottal.
The algorithm is fairly straightforward to code and requires no
backtracking or multiple passes over the input word. This should not be
too surprising, since it was in use before computers and had to be done
by hand by clerks. Here is the algorithm:
1. Capitalize all letters in the word. Pad the word with rightmost
blanks as needed during each procedure step.
2. Retain the first letter of the word.
3. Drop all occurrences of the following letters after the first

position: A, E, H, I, O, U, W, Y.
4. Change letters from the following sets into the corresponding
digits given:
1 = B, F, P, V
2 = C, G, J, K, Q, S, X, Z
3 = D, T
4 = L
5 = M, N
6 = R
5.3 Common Vendor Extensions 177
5. Retain only one occurrence of consecutive duplicate digits
from the string that resulted after step 4.0.
6. Pad the string that resulted from step 5.0 with trailing zeros
and return only the first four positions, which will be of the
form
<uppercase letter> <digit> <digit> <digit>.
An alternative version of the algorithm, due to Russell, changes the
letters in step 3.0 to 9s and retains them. Then step 5.0 is replaced by
two steps: 5.1, which removes redundant duplicates as before, followed
by 5.2, which removes all 9s and closes up the spaces. This allows pairs
of duplicate digits to appear in the result string. This version has more
granularity and will work better for a larger sample of names.
The problem with Soundex is that it was a manual operation used by
the Census Bureau long before computers. The algorithm used was not
always applied uniformly from place to place. Surname prefixes, such as
“La,” “De,” “von,” or “van,” are generally dropped from the last name for
Soundex, but not always.
If you are searching for surnames such as “DiCaprio” or “LaBianca,”
you should try the Soundex codes for both with and without the prefix.
Likewise, leading syllables like “Mc,” “Mac,” and “O” were also dropped.

Then there was a question about dropping H and W along with the
vowels. The United States Census Soundex did it both ways, so a name
like “Ashcraft” could be converted to “Ascrft” in the first pass, and finally
Soundexed to “A261,” as it is in the 1920 New York Census. The
Soundex code for the 1880, 1900, and 1910 censuses followed both
rules. In this case, Ashcraft would be “A226” in some places. The
reliability of Soundex is 95.99%, with a selectivity factor of 0.213% for a
name inquiry.
Metaphone
Metaphone is another improved Soundex that first appeared in Computer
Language magazine (Philips 1990). A Pascal version written by Terry
Smithwick (Smithwick 1991), based on the original C version by
Lawrence Philips, is reproduced with permission here:
FUNCTION Metaphone (p : STRING) : STRING;
CONST
VowelSet = ['A', 'E', 'I', 'O', 'U'];
FrontVSet = ['E', 'I', 'Y'];
VarSonSet = ['C', 'S', 'T', 'G'];
178 CHAPTER 5: CHARACTER DATA TYPES IN SQL
{ variable sound - modified by following 'h' }
FUNCTION SubStr (A : STRING;
Start, Len : INTEGER) : STRING;
BEGIN
SubStr := Copy (A, Start, Len);
END;
FUNCTION Metaphone (p : STRING) : STRING;
VAR
i, l, n: BYTE;
silent, new: BOOLEAN;
last, this, next, nnext : CHAR;

m, d: STRING;
BEGIN { Metaphone }
IF (p = '')
THEN BEGIN
Metaphone := '';
EXIT;
END;
{ Remove leading spaces }
FOR i := 1 TO Length (p)
DO p[i] := UpCase (p[i]);
{ Assume all alphas }
{ initial preparation of string }
d := SubStr (p, 1, 2);
IF d IN ('KN', 'GN', 'PN', 'AE', 'WR')
THEN p := SubStr (p, 2, Length (p) - 1);
IF (p[1] = 'X')
THEN p := 'S' + SubStr (p, 2, Length (p) - 1);
IF (d = 'WH')
THEN p := 'W' + SubStr (p, 2, Length (p) - 1);
{ Set up for Case statement }
l := Length (p);
m := '';
{ Initialize the main variable }
new := TRUE;
{ this variable only used next 10 lines!!! }
n := 1;
{ Position counter }
WHILE ((Length (m) < 6) AND (n <> l) )
DO BEGIN { Set up the 'pointers' for this loop-around }
IF (n > 1)

5.3 Common Vendor Extensions 179
THEN last := p[n-1]
ELSE last := #0;
{ use a nul terminated string }
this := p[n];
IF (n < l)
THEN next := p[n+1]
ELSE next := #0;
IF ((n+1) < l)
THEN nnext := p[n+2]
ELSE nnext := #0;
new := (this = 'C') AND (n > 1) AND (last = 'C');
{ 'CC' inside word }
IF (new)
THEN BEGIN
IF ((this IN VowelSet) AND (n = 1) )
THEN m := this;
CASE this OF
'B' : IF NOT ((n = l) AND (last = 'M') )
THEN m := m + 'B';
{ -mb is silent }
'C' : BEGIN { -sce, i, y = silent }
IF NOT ((last = 'S') AND (next IN FrontVSet) )
THEN BEGIN
IF (next = 'i') AND (nnext = 'A')
THEN m := m + 'X'{ -cia- }
ELSE IF (next IN FrontVSet)
THEN m := m + 'S' { -ce, i, y = 'S' }
ELSE IF (next = 'H') AND (last = 'S')
THEN m := m + 'K' { -sch- = 'K' }

ELSE IF (next = 'H')
THEN IF (n = 1) AND ((n+2) <= l)
AND NOT (nnext IN VowelSet)
THEN m := m + 'K'
ELSE m := m + 'X';
END { Else silent }
END;
{ Case C }
'D' : IF (next = 'G') AND (nnext IN FrontVSet)
THEN m := m + 'J'
ELSE m := m + 'T';
'G' : BEGIN
180 CHAPTER 5: CHARACTER DATA TYPES IN SQL
silent := (next = 'H') AND (nnext IN VowelSet);
IF (n > 1) AND (((n+1) = l) OR ((next = 'n') AND
(nnext = 'E') AND (p[n+3] = 'D') AND ((n+3) = l) )
{ Terminal -gned }
AND (last = 'i') AND (next = 'n') )
THEN silent := TRUE;
{ if not start and near -end or -gned.) }
IF (n > 1) AND (last = 'D'gnuw) AND (next IN FrontVSet)
THEN { -dge, i, y }
silent := TRUE;
IF NOT silent
THEN IF (next IN FrontVSet)
THEN m := m + 'J'
ELSE m := m + 'K';
END;
'H' : IF NOT ((n = l) OR (last IN VarSonSet) ) AND (next IN
VowelSet)

THEN m := m + 'H';
{ else silent (vowel follows) }
'F', 'J', 'L', 'M', 'N', 'R' : m := m + this;
'K' : IF (last <> 'C')
THEN m := m + 'K';
'P' : IF (next = 'H')
THEN BEGIN
m := m + 'F';
INC (n);
END { Skip the 'H' }
ELSE m := m + 'P';
'Q' : m := m + 'K';
'S' : IF (next = 'H')
OR ((n > 1) AND (next = 'i') AND (nnext IN ['O', 'A']) )
THEN m := m + 'X'
ELSE m := m + 'S';
'T' : IF (n = 1) AND (next = 'H') AND (nnext = 'O')
THEN m := m + 'T' { Initial Tho- }
ELSE IF (n > 1) AND (next = 'i') AND (nnext IN ['O', 'A'])
THEN m := m + 'X'
ELSE IF (next = 'H')
THEN m := m + '0'
5.3 Common Vendor Extensions 181
ELSE IF NOT ((next = 'C') AND (nnext = 'H') )
THEN m := m + 'T';
{ -tch = silent }
'V' : m := m + 'F';
'W', 'Y' : IF (next IN VowelSet)
THEN m := m + this;
{ else silent }

'X' : m := m + 'KS';
'Z' : m := m + 'S';
END;
{ Case }
INC (n);
END; { While }
END; { Metaphone }
Metaphone := m
END;
NYSIIS Algorithm
The New York State Identification and Intelligence System, or NYSIIS,
algorithm is more reliable and selective than Soundex, especially for
grouped phonetic sounds. It does not perform well with Y groups,
because Y is not translated. NYSIIS yields an alphabetic string key that is
filled or rounded to 10 characters.
(1) Translate first characters of name:
MAC => MCC
KN => NN
K => C
PH => FF
PF => FF
SCH => SSS
(2) Translate last characters of name:
EE => Y
IE => Y
DT,RT,RD,NT,ND => D
(3) The first character of key = first character of name.
(4) Translate remaining characters by following rules,
scanning one character at a time
a. EV => AF else A,E,I,O,U => A

b. Q => G Z => S M => N
c. KN => N else K => C

×