Tải bản đầy đủ (.pdf) (78 trang)

Beginning Regular Expressions 2005 phần 4 pot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.09 MB, 78 trang )


6. Click the Find First icon and then click the Find Next icon twice, observing each time what char-
acter sequence is or is not matched.
Figure 8-7 shows the appearance after the Find Next icon has been clicked twice. With the modi-
fication of the regular expression, all three occurrences of the character sequence
Andrew now
match.
Figure 8-7
7. Now that you know that each occurrence of the relevant string matches, you can modify the
regular expression to create two groups between which you can insert the desired apostrophe
to make
Andrew’s possessive.
Modify the regular expression in the Match tab to
(Andrew)(s)(?=\b).
8. Using the Find First and Find Next icons, confirm that all three occurrences of the slightly modi-
fied desired character sequence
Andrews match.
9. Click the Replace tab. In the lower pane on the Replace tab, type $1’$2.
10. On the Test tab, click the Replace All icon, and inspect the results in the lower pane on the Test
tab. (You may need to adjust the window size to see all the results.)
Figure 8-8 shows the appearance after this step.
207
Lookahead and Lookbehind
11_574892 ch08.qxd 1/7/05 11:02 PM Page 207
Figure 8-8
How It Works
The pattern Andrew((?=s )|(?=s\b)) matches the character sequence Andrew followed by either s
and a space character or by s and a word boundary.
In the first line, the character sequence
Andrew is followed by s and a space character, so it satisfies the
first lookahead constraint. Because the match is successful and the lookahead constraint is satisfied,


there is a match for the whole regular expression.
In the second line, the character sequence
Andrew is followed by s and a period character. The second
lookahead constraint is satisfied.
On the third line, the character sequence
Andrew is followed by an s and then a question mark. Because
the question mark is in neither lookahead, the lookahead constraint is not satisfied.
When the regular expression pattern is changed to
Andrew(?=s\b), when Andrew is matched, the
lookahead constraint is an
s followed by a word boundary. There is a word boundary following each
Andrews and the following character on all three lines. In Line 1, there is a word boundary before the
space character. In Line 2, there is a word boundary before the period character. In Line 3, there is a
word boundary before the question mark. So each occurrence of
Andrew matches.
208
Chapter 8
11_574892 ch08.qxd 1/7/05 11:02 PM Page 208
When the regular expression is modified to (Andrew)(s)(?=\b), you capture the character sequence
Andrew in $1 and capture the s in $2. The lookahead does not capture any characters. So to insert an
apostrophe, you want
$1 (Andrew) to be followed by an apostrophe to be followed by $2 (a lowercase s).
Lookbehind
Lookbehind tests whether a sequence of characters that is matched is preceded (positive lookbehind) or
not preceded (negative lookbehind) by another sequence of characters.
For example, if you wanted to match the surname
Jekyll only if it is preceded by the sequence of char-
acters
Dr. (an uppercase D, a lowercase r, a period, and a space character), you would use a pattern like
this:

(?<=Dr. )Jekyll
The component (?<=Dr. ) indicates the sequence of characters that is tested for as a lookbehind, and
the component
Jekyll matches literally.
Positive Lookbehind
A positive lookbehind is a constraint on matching. Matching occurs only if the pattern to be matched is
preceded by the pattern contained in the lookbehind assertion.
Try It Out Positive Lookbehind
1.
Open the Komodo Regular Expression Toolkit, and delete any residual regular expression and
sample text.
2. In the Enter a String to Match Against area, enter the test text, Mr. Hyde and Dr. Jekyll are char-
acters in a famous novel.
3. In the Enter a Regular Expression area, enter the pattern (?<=Dr. )Jekyll.
4. Inspect the highlighted text in the String to Match Against area and the description of the results
in the gray area below,
Match succeeded: 0 groups.
Figure 8-9 shows the appearance. Notice that the sequence of characters
Jekyll is highlighted.
5. Edit the regular expression pattern to read (?<=Mr. )Jekyll.
6. Inspect the description of the results in the gray area, No matches found.
7. Edit the regular expression pattern to read ((?<=Mr. )|(?<=Mister ))Hyde. Ensure that
there is a space character after the
r of Mister. If that is omitted, there will be no match.
209
Lookahead and Lookbehind
11_574892 ch08.qxd 1/7/05 11:02 PM Page 209
Figure 8-9
8. Inspect the description of the results in the gray area, Match succeeded: 1 group. Also
notice that the character sequence

Hyde is highlighted.
9. Edit the Mr. in the test text to read Mister.
10. Inspect the gray area again. Again, the description is Match succeeded: 1 group. Figure 8-10
shows the appearance.
Figure 8-10
210
Chapter 8
11_574892 ch08.qxd 1/7/05 11:02 PM Page 210
How It Works
The following description of how the regular expression engine operates is a conceptual one and may
not reflect the approach taken by any individual regular expression engine. The text matched is the
sequence of characters
Jekyll.
Matching starts at the beginning of the test text. The character following the regular expression’s posi-
tion is checked to see whether it is an uppercase
J. If so, that is matched, and an attempt is made to match
the other characters making up the sequence of characters
Jekyll. If any attempt to match fails, the
whole pattern fails, and the regular expression engine moves forward through the text attempting to
match the character sequence
Jekyll.
If a match is found for the character sequence
Jekyll, the regular expression engine is at the position
immediately before the
J of Jekyll. It checks that the immediately preceding character is a space charac-
ter. If so, it then tests if the character before that is a period character. If so, it tests if the character before
that is a lowercase
r. Finally, it tests if the character before that is an uppercase D. Because matching of
Jekyll was successful, and the constraint that the character sequence Jekyll be preceded by the charac-
ter sequence

Dr. (including a space character) was satisfied, the whole regular expression succeeds.
When you edit the pattern to read
(?<=Mr. )Jekyll, the character sequence Jekyll is successfully
matched as before. However, when the regular expression engine checks the characters that precede
that character sequence, the constraint fails, because despite the fact (reading backward) that the space
character, the period character, and the lowercase
r are all present, there is no preceding uppercase D.
Because the lookbehind constraint is not satisfied, there is no match.
It is possible to express alternatives in lookbehind. The problem definition might read as follows:
Match the character sequence
Hyde if it is preceded by EITHER the character sequence Mr. (includ-
ing a final space character) OR by the character sequence
Mister (including a final space character).
After changing the pattern to read
((?<=Mr. )|(?<=Mister ))Hyde, the regular expression engine
attempts to match the character sequence
Hyde. When it reaches the position immediately before the H
of Hyde it will successfully match that character sequence. It then must also satisfy the constraint on the
sequence of characters that precedes
Hyde.
The pattern
((?<=Mr. )|(?<=Mister ))Hyde uses parentheses to group two alternative patterns
that must precede
Hyde. The first option, specified by the pattern (?<=Mr. ), requires that the sequence
of four characters
M, r, a period, and a space character must precede Hyde. At Step 8, that four-character
sequence matches.
After the edit has been made to the test text, replacing
Mr. with Mister, the other alternative comes into
play. The pattern

(?<=Mister ) requires that a seven-character sequence (Mister plus a space charac-
ter) precedes
Hyde.
The positioning of the lookbehind assertion is important, as you will see in the next example.
211
Lookahead and Lookbehind
11_574892 ch08.qxd 1/7/05 11:02 PM Page 211
Try It Out Positioning of Positive Lookbehind
1.
Open RegexBuddy, click the Match tab, and enter the regular expression (?<=like )SQL Server.
2. Click the Test tab, click the Open File icon, and open the Databases.txt file.
3. Click the Find First icon, and inspect the highlighted text in the pane in the Test tab, as shown in
Figure 8-11.
Figure 8-11
4. Edit the regular expression in the Match tab so that it reads SQL Server(?<=like ).
5. Click the Find First icon in the Test tab. Confirm that there is no now no highlighted text.
6. Edit the regular expression in the Match tab so that it reads SQL Server(?<=like SQL
Server)
.
7. Click the Find First icon in the Test tab. Confirm that there is again a match in the test text, as
shown in Figure 8-12.
How It Works
When the pattern is (?<=like )SQL Server, the lookbehind looks behind, starting from the position
immediately before the
S of SQL. Because the character sequence like SQL Server exists in the test
text, there is a match. When the pattern is
SQL Server(?<=like ), the lookbehind starts from the posi-
tion after the
r of Server. Because that position is preceded by Server, not like, and the lookbehind is
attempting to match the character sequence

like, there is no match.
212
Chapter 8
11_574892 ch08.qxd 1/7/05 11:02 PM Page 212
Figure 8-12
Negative Lookbehind
Negative lookbehind is a constraint on matching. Matching occurs only if the pattern to be matched is
not preceded by the pattern contained in the lookbehind assertion.
Try It Out Negative Lookbehind
Find occurrences of the character sequence SQL Server that are not preceded by the character sequence
like followed by a space character.
1. Open RegexBuddy, click the Match tab, and enter the regular expression (?<!like )SQL Server.
2. Click the Test tab, click the Open File icon, and open the Databases.txt file.
3. Click the Find First icon, and inspect the highlighted text in the pane in the Test tab, as shown in
Figure 8-13.
4. Look for other matches by clicking the Find Next icon several times. Note which occurrences of
SQL Server match or don’t match.
213
Lookahead and Lookbehind
11_574892 ch08.qxd 1/7/05 11:02 PM Page 213
Figure 8-13
How It Works
When the regular expression engine matches the character sequence SQL Server, it checks whether the
preceding characters correspond to the pattern specified in the lookbehind.
The first occurrence of
SQL Server is not preceded by the character sequence like followed by a space
character. The negative lookbehind is, therefore, satisfied. Because the character sequence
SQL Server
matches and the negative lookbehind constraint is satisfied, the whole regular expression matches.
The only occurrence of the character sequence

SQL Server that fails to match is the occurrence pre-
ceded by the word
like. The occurrence of the character sequence like followed by a space character
does not satisfy the constraint imposed by the lookbehind. Therefore, although the character sequence
SQL Server matches, the failure to satisfy the lookbehind constraint means that the whole regular
expression fails to match.
How to Match Positions
By combining lookahead and lookbehind, it is possible to match positions between characters. For example,
suppose that you wanted to match a position immediately before the
Andrew of the following sample text:
This is Andrews book.
214
Chapter 8
11_574892 ch08.qxd 1/7/05 11:02 PM Page 214
You could state the problem definition as follows:
Match a position that is preceded by the character sequence
is followed by a space character and is
followed by the character sequence
Andrew.
You could match that position using the following pattern:
(?<=is )(?=Andrew)
Try It Out Matching a Position
1.
Open RegexBuddy. On the Match tab, type the regular expression pattern (?<=is )(?=Andrew).
If you used RegexBuddy for the replace example earlier in this chapter, delete the replacement
text on the Replace tab.
2. On the Test tab, enter the sample text This is Andrews book.
3. Click the Find First icon, and inspect the information in the lower pane of the Test tab, as shown
in Figure 8-14. On-screen, you can see the cursor blinking at the position immediately before the
initial

A of Andrews.
Figure 8-14
215
Lookahead and Lookbehind
11_574892 ch08.qxd 1/7/05 11:02 PM Page 215
How It Works
The regular expression engine starts at the beginning of the document and tests each position to see
whether both the lookbehind and lookahead constraints are satisfied. In the test text, only the position
immediately before the initial
A of Andrews satisfies both constraints. It is, therefore, the only position
that matches.
Adding Commas to Large Numbers
One of the useful ways to apply a combination of lookbehind and lookahead is adding commas to large
numbers.
Assume that the sales for the fictional Star Training Company are $1,234,567. The data would likely be
stored as an integer without any commas. However, for readability, commas are usual in many situa-
tions where financial or other numerical data is presented.
The process of adding commas to a large numeric value is essentially to match the position between the
appropriate numeric digits and replace that position by a comma.
In some European languages, the thousands separator, which is a comma in English, is a period charac-
ter. Such periods can be added to a numeric value by slightly modifying the technique presented below.
First, let’s look at a numeric value of
1234 and how you can add a comma in the appropriate place. You
want to insert the comma at the position between the
1 and the 2. The reason to insert a comma in that
position is that there are three numeric digits between the desired position and the end of the string.
Try It Out Adding a Comma Separator to a Four-Digit Number
1.
Open RegexBuddy. On the Replace tab, enter the pattern (?<=\d)(?=\d\d\d) in the upper pane
and a single comma character in the lower pane.

2. On the Test pane, click the Find First icon. Confirm that there is a match, as described in the
lower pane on the Test tab.
3. Click the Replace All icon, and check the replacement text shown in the lower pane on the Test
tab (see Figure 8-15). The replacement text is
1,234, which is what you want. The regular
expression pattern works for four-digit numbers.
216
Chapter 8
11_574892 ch08.qxd 1/7/05 11:02 PM Page 216
Figure 8-15
4. Edit the test text in the upper pane of the Test tab to read 1234567.
5. Click the Replace All icon, and inspect the replacement text in the lower pane of the Test tab.
The replacement text is
1,2,3,4,567, which is not what you want. All the positions that have
at least three numeric digits to the right have had a comma inserted, as shown in Figure 8-16.
6. Edit the pattern to (?<=\d)(?=(\d\d\d)+).
7. Click the Replace All icon, and inspect the replacement text in the lower pane of the Test tab.
The undesired commas are still there.
217
Lookahead and Lookbehind
11_574892 ch08.qxd 1/7/05 11:02 PM Page 217
Figure 8-16
8. Edit the pattern to (?<=\d)(?=(\d\d\d)+$).
9. Click the Replace All icon, and inspect the replacement text in the lower pane of the Test tab (see
Figure 8-17). This is
1,234,567, which is what you want.
10. Depending on your data source, the pattern (?<=\d)(?=(\d\d\d)+$) may not work. Imagine
if a single character — for example, a period character — follows the last digit of the number to
which you wish to add commas. Edit the test text to read
Monthly sales figures are

1234567.
11. Edit the regular expression on the Replace tab to read (?<=\d)(?=(\d\d\d)+\W).
218
Chapter 8
11_574892 ch08.qxd 1/7/05 11:02 PM Page 218
Figure 8-17
How It Works
The pattern (?<=\d)(?=\d\d\d) looks for a position that follows a single numeric digit and precedes
three numeric digits. In the sample text
1234, there is only one position that satisfies both the look-
behind and lookahead constraints: the position after the numeric digit
1.
When the test text is changed to
1234567, the pattern (?<=\d)(?=\d\d\d) matches several times. For
example, the position following the numeric digit
2 is preceded by a numeric digit and is followed by
three numeric digits. That position therefore satisfies both the lookbehind and lookahead constraints.
You need to group the numeric digits into groups of three to attempt to get rid of the undesired comma
replacements. The pattern
(?<=\d)(?=(\d\d\d)+) groups the numeric digits in the lookahead into
threes but fails, as you saw in Figure 8-16, to prevent the unwanted commas. At the position following
the numeric digit
2, there is still a sequence of three digits following that position, so the position matches.
A comma is therefore inserted (although that is not appropriate to formatting norms for numbers).
219
Lookahead and Lookbehind
11_574892 ch08.qxd 1/7/05 11:02 PM Page 219
When the pattern is edited to (?<=\d)(?=(\d\d\d)+$), you get the results you want. The position fol-
lowing the numeric digit
2 now fails to satisfy the lookahead constraint. It is followed by five numeric

digits, which does not match the pattern
(\d\d\d)+.
However, the position after the numeric digit
1 still matches. It is followed by six numeric digits, which
matches the pattern
(\d\d\d)+. Similarly, the position after the numeric digit 4 is matched, because it is
followed by three numeric digits, which matches the pattern
(\d\d\d)+. In both those positions that
match, a comma is inserted.
Exercises
These exercises allow you to test your understanding of some of the techniques for lookahead and look-
behind that were introduced in this chapter:
1. Specify a pattern that will match a sequence of one or more alphabetic characters only if they
are followed by a comma character.
2. Create a pattern, using lookbehind and lookahead, to match the word sheep. Do not use the
word-boundary metacharacters in your pattern.
220
Chapter 8
11_574892 ch08.qxd 1/7/05 11:02 PM Page 220
9
Sensitivity and Specificity
of Regular Expressions
This chapter discusses the issues of sensitivity and specificity of regular expression patterns.
Sensitivity and specificity relate to two fundamental tasks in all uses of regular expressions: trying
to ensure that you match all the text that you want to match and trying to avoid matching text that
you don’t want to match.
Assuming that you typically want to manipulate the data that you match in some way, failing to
match desired data will mean that part of your intended task remains undone. If you don’t have a
good appreciation of your data and the effect on it of the regular expression that you are using,
you can be completely unaware that you have missed some data. At least, you are unaware that

you have missed it until your manager or a customer calls and complains.
Conversely, matching and manipulating undesired data may well corrupt parts of your data.
Whether that data corruption leads to minor typos or more serious problems depends on your
data, what its intended use is, and the extent and severity of the undesired changes you uninten-
tionally make to it. Again, the undesired effects can impact adversely on customer satisfaction. So
sensitivity and specificity are issues to take seriously.
In this chapter, you will learn the following:
❑ What sensitivity and specificity are
❑ How to work out how far you should go in investing time and effort in maximizing sensi-
tivity and/or specificity
❑ How to use regular expression techniques to give an optimal balance of sensitivity and
specificity
❑ How the detail of the data source can affect sensitivity and specificity
❑ How to gain a better balance of sensitivity and specificity in the Star Training Company
example
12_574892 ch09.qxd 1/7/05 11:01 PM Page 221
What Are Sensitivity and Specificity?
Sensitivity is the capacity to match the pattern that you want to match. Specificity is the capacity to limit
the character sequences selected by a pattern to those character sequences that you want to detect.
The definitions given may feel a little abstract, so the following examples are provided to develop a
clearer understanding of the ideas of sensitivity and specificity.
Extreme Sensitivity, Awful Specificity
Suppose that you want to match the character sequence ABC. It is very easy to achieve 100 percent sensi-
tivity using the following pattern:
.*
It selects sequences of zero or more alphanumeric characters.
A sample document,
ABitOfEverything.txt, is shown here:
ABC123
DEF9FR

Mary had a little lamb.
var x = 234 / 1.56;
<html><body></body></html>
<book></book>
This is a random 58#Gooede garbled piece of 8983ju**nk but it is still selected.
Sensitivity and specificity are terms derived from quantitative disciplines such as
statistics and epidemiology. Broadly, sensitivity is a measure of the number of true
hits you find divided by the total number of true hits you ought to find if you match
all occurrences of the relevant character sequences, and specificity is the number of
hits you find that are true hits divided by the total number of hits you find. The
higher the sensitivity, the closer you are, in the context of regular expressions, to
finding all true matches, and the higher the specificity, the closer you are to finding
only true matches.
222
Chapter 9
12_574892 ch09.qxd 1/7/05 11:01 PM Page 222
As you can see, there is a pretty diverse range of content, not all of which is useful. However, if you
apply the regular expression pattern
.* you achieve 100 percent sensitivity, because the only occurrence
of the character sequence
ABC is matched. However, you also select every other piece of text in the sam-
ple document, as you can see in Figure 9-1 in OpenOffice.org Writer.
Figure 9-1
I introduced this slightly silly example to make an important point. It is possible to create very sensitive
regular expression patterns that achieve nothing useful. Of course, you are unlikely to use
.* as a
standalone pattern, but it is important to carefully consider the usefulness of the regular expression
patterns you create when, typically, the issues will be significantly more subtle.
Useful regular expressions keep the 100 percent sensitivity (or something very close to 100 percent) of
the

.* pattern but combine it with a high level of specificity.
223
Sensitivity and Specificity of Regular Expressions
12_574892 ch09.qxd 1/7/05 11:01 PM Page 223
Email Addresses Example
Suppose that you have a large number of documents or an email mail file that you need to search for
valid email addresses. The file
EmailOrNotEmail.txt illustrates the kind of data that might be con-
tained in the material you need to search. The content of
EmailOrNotEmail.txt is shown here:
@Home
@ttitude



John@
20 @ $10 each
@@@ This is a comment @@@


You will see pretty quickly that some of the character sequences in EmailOrNotEmail.txt are valid
email addresses and some are not.
One approach to matching email addresses would be to use the following regular expression to locate all
email addresses:
.*@.*
If you try that pattern using the findstr utility, you can type the following at the command line:
findstr /N /i .*@.* EmailOrNotEmail.txt
You search a single file, EmailOrNotEmail.txt, for the following regular expression pattern:
.*@.*
The /N switch indicates that the line number of any line containing a character sequence that matches the

regular expression pattern will be displayed. The
/i switch, which isn’t essential here, indicates that the
pattern will be applied in a case-insensitive way. Figure 9-2 shows the result of running the specified
command.
Figure 9-2
224
Chapter 9
12_574892 ch09.qxd 1/7/05 11:01 PM Page 224
As the figure shows, all the valid email addresses (which are on lines 4, 5, 9, and 10) are selected. This
gives you 100 percent sensitivity, at least on this test data set. In other words, you have selected every
character sequence that represents a valid email address. But you have, on all the other lines, matched
character sequences that are pretty obviously not email addresses. You need to find a more specific pat-
tern to improve the specificity of matching.
Look a little more carefully at how an email address is structured. Broadly, an email address follows this
structure:
username@somehostname
To achieve a better match, you must find patterns that match the username and the hostname but are
more specific than your previous attempt.
The structure of the username can be simply a sequence of alphabetic characters, as here:

Or it can include a period character, such as the following:

Therefore, you need to allow for the possibility of a period character occurring inside the username part
of the email address. The following pattern matches, at a minimum, a single alphabetic character due to
the
\w+ component of the pattern:
\w*\.?\w+
The \w*\.? allows the mandatory alphabetic character(s) to be preceded by zero or more optional
alphabetic characters followed by a single optional period character.
You probably don’t want to match an email address that begins with a period character, as in the following:


So you could use a lookbehind to allow a match for a period character only when it has been preceded
by at least one alphabetic character. This pattern would allow matching of a period character only when
it is preceded by an alphabetic character:
\w*(?<=\w)\.?\w+
Try It Out Email Address
1.
Open PowerGrep, and enter the pattern \w*(?<=\w)\.?\w+@.* in the Search text area.
2. Enter the folder name C:\BRegExp\Ch09 in the Folder text box. Amend, as appropriate, if you
downloaded the sample files to a different directory.
3. Enter the filename EmailOrNotEmail.txt in the File Mask text box, and click the Search button.
4. Inspect the results in the Results area. Compare the matches shown in Figure 9-2 with the matches
now shown in Figure 9-3, particularly noting the character sequences that no longer match.
225
Sensitivity and Specificity of Regular Expressions
12_574892 ch09.qxd 1/7/05 11:01 PM Page 225
Figure 9-3
This is an improvement. The pattern is more specific. You no longer match the undesired
character sequences on lines 1, 2, 7, and 8. However, the character sequence on Line 3,
, is not a valid email address.
You can remove that undesired match by making the hostname part of the email address more
specific. How specific you want to be is a matter of judgment. You know that all hostnames will
have a sequence of alphabetic characters, followed by a period character, followed by three
(
com, net, org, or biz) or four (info) alphabetic characters. For the purposes of this example
we won’t consider hostnames like
example.co.uk. The following pattern would be an appro-
priate pattern to match hostnames that correspond to the structure just described:
\w+\.\w{3,4}
The \w+ will match even single character domain names (which are allowed with .com, .net,

and
.org domains). The \. metacharacter matches a single period character, and the \w{3,4}
component matches either three or four alphabetic characters.
Combining that pattern with your earlier one gives you the following:
\w*(?<=\w)\.?\w+@\w+\.\w{3,4}
5. Enter the pattern \w*(?<=\w)\.?\w+@\w+\.\w{3,4} in the Search text area, and click the
Search button.
6. Inspect the results. Notice that the undesired match on Line 3 is no longer matched. However, a
problem on Line 6, not mentioned earlier, is brought to the surface. On Line 6, the seeming
email address has two
@ characters, which is not allowed.
One way to approach this is to use a lookahead to specify that following the first match for an
@
character, another @ character does not occur. If you continue to assume that only alphabetic char-
acters are allowed in an email address, you can specify that you look ahead from the first
@ charac-
ter matched to the first match for a character that is not an alphabetic character or a period
character.
226
Chapter 9
12_574892 ch09.qxd 1/7/05 11:01 PM Page 226
You can do that using the following pattern:
\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}
7. Edit the pattern in the Search text area to be
\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}, and click the Search button.
8. Inspect the results. Figure 9-4 shows the appearance.
Figure 9-4
Unfortunately, the lookahead has not solved the problem with the undesired matches on lines 3
and 6. You need to specify that the pattern is the whole text on a line. In other words, you add a
^ metacharacter to specify the position at the start of the line and the $ metacharacter to specify

the position at the end of the line.
9. Modify the pattern in the Search area to be
^\w*(?<=\w)\.?\w+@(?=[\w\.]+\W)\w+\.\w{3,4}$, and click the Search button.
10. Inspect the results. Figure 9-5 shows the appearance.
Figure 9-5
227
Sensitivity and Specificity of Regular Expressions
12_574892 ch09.qxd 1/7/05 11:01 PM Page 227
Happily, you have now succeeded in avoiding matching the undesired matches on lines 3 and 6. At least
on this simple test data, you have achieved 100 percent sensitivity and 100 percent specificity.
The terms sensitivity and specificity come from quantitative sciences, such as statistics and epidemiol-
ogy. In those contexts, both the sensitivity and specificity are expressed numerically, often as percent-
ages. So for the preceding example, you have a sensitivity of 100 percent because all true email addresses
are detected using your first attempt at a regular expression pattern, and you initially have a specificity
of 40 percent because 6 of the 10 matches are false matches (in the sense that they are not valid email
addresses). By the end of the Try It Out example, the specificity has risen to 100 percent on the test data.
Replacing Hyphens Example
This example looks at another problem that can occur if you are not careful in thinking through the
meaning of a regular expression.
Assume that you have a collection of text documents that have to be converted into HTML/XHTML.
This example focuses on the possible need for replacing a line of hyphens with the HTML/XHTML
<
hr> element to create a horizontal ruled line.
A simplified sample document,
HyphenTest.txt, is used in this example:
something
not much

a little text
Fred


-Fred
A first attempt at expressing the problem definition might be as follows:
Replace any hyphens that occur with the character sequence
<hr>.
However, that is too imprecise. For example, the third line would be replaced with the following:
<hr><hr><hr><hr>
A more precise statement of the problem definition would be as follows:
Replace any group of consecutive hyphens with the character sequence
<hr>.
Assume that you will omit the end tag of the
hr element, because many Web browsers have problems if
you use the empty element tag,
<hr/>.
If you use the following regular expression pattern to express the idea of one or more hyphens, you can
run into problems for two reasons:
-*
228
Chapter 9
12_574892 ch09.qxd 1/7/05 11:01 PM Page 228
First, not all regular expression engines interpret that pattern correctly. The pattern -* means “Match
zero or more hyphens,” which means that the occurrence of zero hyphens is a match. Therefore, the text
Fred ought to match, which may not be what you expected. Why does Fred match? Because there are
zero hyphens.
OpenOffice.org Writer implements the
-* pattern as you might intuitively expect, because it matches
only when at least one hyphen occurs, as shown in Figure 9-6, when it ought to match on each line
because each line has zero hyphens at the beginning.
Figure 9-6
The Komodo Regular Expressions Toolkit interprets the regular expression pattern correctly—for exam-

ple, detecting a match for the text
Fred, as you can see in Figure 9-7.
Of course, the pattern
-+ is more appropriate because you want at least one hyphen to be present before
you expect a match. However, the fact that the
* quantifier matches even the absence of the character or
metacharacter that it refers to can cause confusion in some situations.
229
Sensitivity and Specificity of Regular Expressions
12_574892 ch09.qxd 1/7/05 11:01 PM Page 229
Figure 9-7
The Sensitivity/Specificity Trade-Off
Sensitivity and specificity are always part of a trade-off. Sensitivity and specificity are components of the
trade-off, but the amount of effort required to get 100 percent sensitivity and 100 percent specificity may
not be practical in some situations. Some undefined “good” specificity may be enough. It’s a trade-off in
that, in the end, only you can judge how much effort is appropriate for the task that you are using regu-
lar expressions to achieve.
How important are sensitivity and specificity? The answer is, “It depends.” There are many times when
you will need high sensitivity, 100 percent sensitivity ideally, and at the same time you also need high
specificity. At other times, one or the other may be less important. This section looks at some of the fac-
tors that influence how much importance it is relevant to place on sensitivity and specificity.
It depends to a significant extent on who the customer is. If you are using regular expressions to achieve
something for your own use, you may not worry too much if you miss one or two matches. On the other
hand, if you are conducting a replacement of every occurrence of a company name after a takeover, for
example, it would be serious if sensitivity fell below 100 percent.
How Metacharacters Affect Sensitivity and
Specificity
In general, the more metacharacters you use, the more specific a pattern becomes. The pattern cat
matches that sequence of characters whether they refer to a feline mammal or form character sequences
in words such as

cathode and caterpillar.
230
Chapter 9
12_574892 ch09.qxd 1/7/05 11:01 PM Page 230

×