Tải bản đầy đủ (.ppt) (20 trang)

Bài giảng Phân tích và Thiết kế giải thuật nâng cao: Chương 5 PGS.TS. Trần Cao Đệ

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (275.83 KB, 20 trang )

Pattern Matching
1
Pattern Matching
1
a b a c a a b
234
a b a c a b
a b a c a b
Text processing
Pattern Matching
2
Outline and Reading
Strings (§9.1.1)
Pattern matching algorithms

Brute-force algorithm (§9.1.2)

Boyer-Moore algorithm (§9.1.3)

Knuth-Morris-Pratt algorithm (§9.1.4)
Pattern Matching
3
Strings
A string is a sequence of
characters
Examples of strings:

Java program

HTML document


DNA sequence

Digitized image
An alphabet
Σ
is the set of
possible characters for a
family of strings
Example of alphabets:

ASCII

Unicode

{0, 1}

{A, C, G, T}
Let P be a string of size m

A substring P[i j] of P is the
subsequence of P consisting of
the characters with ranks
between i and j

A prefix of P is a substring of
the type P[0 i]

A suffix of P is a substring of
the type P[i m − 1]
Given strings T (text) and P

(pattern), the pattern matching
problem consists of finding a
substring of T equal to P
Applications:

Text editors

Search engines

Biological research
Pattern Matching
4
Brute-Force Algorithm
The brute-force pattern
matching algorithm compares
the pattern P with the text T
for each possible shift of P
relative to T, until either

a match is found, or

all placements of the pattern
have been tried
Brute-force pattern matching
runs in time O(nm)
Example of worst case:

T = aaa … ah

P = aaah


may occur in images and
DNA sequences

unlikely in English text
Algorithm BruteForceMatch(T, P)
Input text T of size n and pattern
P of size m
Output starting index of a
substring of T equal to P or −1
if no such substring exists
for i ← 0 to n − m
{ test shift i of the pattern }
j ← 0
while j < m ∧ T[i + j] = P[j]
j ← j + 1
if j = m
return i {match at i}
else
break while loop {mismatch}
return -1 {no match anywhere}
Pattern Matching
5
Boyer-Moore Heuristics
The Boyer-Moore’s pattern matching algorithm is based on two
heuristics
Looking-glass heuristic: Compare P with a subsequence of T
moving backwards
Character-jump heuristic: When a mismatch occurs at T[i] = c


If P contains c, shift P to align the last occurrence of c in P with T[i]

Else, shift P to align P[0] with T[i + 1]
Example
1
a p a t t e r n m a t c h i n g a l g o r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
r i t h m
2
3
4
5
6
7891011
Pattern Matching
6
Last-Occurrence Function
Boyer-Moore’s algorithm preprocesses the pattern P and the
alphabet
Σ
to build the last-occurrence function L mapping
Σ
to
integers, where L(c) is defined as


the largest index i such that P[i] = c or

−1 if no such index exists
Example:

Σ
= {a, b, c, d}

P

= abacab
The last-occurrence function can be represented by an array
indexed by the numeric codes of the characters
The last-occurrence function can be computed in time O(m + s),
where m is the size of P and s is the size of
Σ
c a b c d
L(c) 4 5 3
−1
Pattern Matching
7
m − j
i
j l
. . . . . .
a
. . . . . .
. . . .
b a
. . . .

b a
j
Case 1: j ≤ 1 + l
The Boyer-Moore Algorithm
Algorithm BoyerMooreMatch(T, P,
Σ
)
L ← lastOccurenceFunction(P,
Σ
)
i ← m − 1
j ← m − 1
repeat
if T[i] = P[j]
if j = 0
return i { match at i }
else
i ← i − 1
j ← j − 1
else
{ character-jump }
l ← L[T[i]]
i ← i + m – min(j, 1 + l)
j ← m − 1
until i > n − 1
return −1 { no match }
m − (1 + l)
i
jl
. . . . . .

a
. . . . . .
.
a
. .
b
.
.
a
. .
b
.
1 + l
Case 2: 1 + l ≤ j
Pattern Matching
8
Example
1
a b a c a a b a d c a b a c a b a a b b
234
5
6
7
891012
a b a c a b
a b a c a b
a b a c a b
a b a c a b
a b a c a b
a b a c a b

1113
Pattern Matching
9
Analysis
Boyer-Moore’s algorithm
runs in time O(n+m + s)
Example of worst case:

T = aaa … a

P = baaa
The worst case may occur in
images and DNA sequences
but is unlikely in English text
Boyer-Moore’s algorithm is
significantly faster than the
brute-force algorithm on
English text
11
1
a a a a a a a a a
23456
b a a a a a
b a a a a a
b a a a a a
b a a a a a
7891012
131415161718
192021222324
Pattern Matching

10
The KMP Algorithm - Motivation
Knuth-Morris-Pratt’s algorithm
compares the pattern to the
text in left-to-right, but
shifts the pattern more
intelligently than the brute-
force algorithm.
When a mismatch occurs,
what is the most we can shift
the pattern so as to avoid
redundant comparisons?
Answer: the largest prefix of
P[0 j] that is a suffix of P[1 j]
x
j
. .
a b a a b
. . . . .
a b a a b a
a b a a b a
No need to
repeat these
comparisons
Resume
comparing
here
Pattern Matching
11
KMP Failure Function

Knuth-Morris-Pratt’s
algorithm preprocesses the
pattern to find matches of
prefixes of the pattern with
the pattern itself
The failure function F(j) is
defined as the size of the
largest prefix of P[0 j] that is
also a suffix of P[1 j]
Knuth-Morris-Pratt’s
algorithm modifies the brute-
force algorithm so that if a
mismatch occurs at P[j] ≠ T[i]
we set j ← F(j − 1)
j 0 1 2 3 4
5
P[j] a b a a b a
F(j) 0 0 1 1 2
3
x
j
. .
a b a a b
. . . . .
a b a a b a
F(j

1)
a b a a b a
Pattern Matching

12
The KMP Algorithm
The failure function can be
represented by an array and
can be computed in O(m) time
At each iteration of the while-
loop, either

i increases by one, or

the shift amount i − j
increases by at least one
(observe that F(j − 1) < j)
Hence, there are no more
than 2n iterations of the
while-loop
Thus, KMP’s algorithm runs in
optimal time O(m + n)
Algorithm KMPMatch(T, P)
F ← failureFunction(P)
i ← 0
j ← 0
while i < n
if T[i] = P[j]
if j = m − 1
return i − j { match }
else
i ← i + 1
j ← j + 1
else

if j > 0
j ← F[j − 1]
else
i ← i + 1
return −1 { no match }
Pattern Matching
13
Computing the Failure
Function
The failure function can be
represented by an array and
can be computed in O(m) time
The construction is similar to
the KMP algorithm itself
At each iteration of the while-
loop, either

i increases by one, or

the shift amount i − j
increases by at least one
(observe that F(j − 1) < j)
Hence, there are no more
than 2m iterations of the
while-loop
Algorithm failureFunction(P)
F[0] ← 0
i ← 1
j ← 0
while i < m

if P[i] = P[j]
{we have matched j + 1 chars}
F[i] ← j + 1
i ← i + 1
j ← j + 1
else if j > 0 then
{use failure function to shift P}
j ← F[j − 1]
else
F[i] ← 0 { no match }
i ← i + 1
Pattern Matching
14
Example
1
a b a c a a b a c a b a c a b a a b b
7
8
191817
15
a b a c a b
1614
13
2 3 4 5 6
9
a b a c a b
a b a c a b
a b a c a b
a b a c a b
10 11 12

c
j 0 1 2 3 4
5
P[j] a b a c a b
F(j) 0 0 1 0 1
2
Rabin-Karp Algorithm
Let Σ = {0,1,2, . . .,9}.
We can view a string of k consecutive characters as
representing a length-
k
decimal number.

Let
p
denote the decimal number for P[1 m]

Let
t
s
denote the decimal value of the length-m substring
T[s+1 s+m]
of
T[1 n]
for
s
=
0, 1, . . ., n-m
.


t
s
= p
if and only if

T[s+1 s+m] = P[1 m
].
p
=
P[m]
+ 10(
P[m-1
] +10(
P
[
m
-2]+ . . . +10(
P
[2]+10(
P
[1]))
We can compute
p
in O(
m
) time.
Similarly we can compute
t
0
from

T
[1
m
] in
O(m)
time.
Pattern Matching
15
Example
6378 = 8 + 10 (7 + 10 (3 + 10(6)))
= 8 + 7 × 10 + 3 × 102 + 6 × 103
= 8 + 70 + 300 + 6000
Pattern Matching
16
Compute T
s
t
s+1
can be computed from
t
s
in constant time.
t
s+1
= 10(
t
s
–10
m
-1

T
[
s
+1])+
T
[
s+m
+1]
Example :
T
= 314152

t
s
= 31415,
s
= 0,
m
= 5 and
T
[
s+m
+1] = 2

t
s+1
= 10(31415 –10000*3) +2 = 14152
Thus p and
t
0

, t
1
, . . ., t
n-m
can all be computed in O(
n+m
)
time.
And all occurences of the pattern
P[1 m
] in the text
T[1 n]
can be found in time
O(n+m).

However,
p
and
t
s
may be too large to work with
conveniently.
Pattern Matching
17
Computation of
p
and
t
0
using

modulus
q
With a
d
-ary alphabet {0,1,…,
d
-1},
q
is chosen such that
d×q

fits within a computer word.
The recurrence equation can be rewritten as

t
s+1
= (d(t
s
–T[s+1]h)+ T[s+m+1]) mod q,
where
h = d
m-1
(mod q)
is the value of the digit “1” in the
high order position of an
m
-digit text window.
Note that
t
s

≡ p mod q
does not imply that
t
s
= p
.
However, if
t
s
is not equivalent to
p
mod
q
,
then
t
s
≠ p
, and the shift s is invalid.
We use
t
s
≡ p mod q
as a fast heuristic test to rule out the
invalid shifts.
Further testing is done to eliminate spurious hits
Test to check whether
P[1 m] = T[s+1 s+m]
Pattern Matching
18

Example
t
s+1
= (d(t
s
–T[s+1]h)+ T[s+m+1]) mod q
h = d
m-1
(mod q)
Example :
d=10, alphabet = {0…9}
T = 31415; P = 26, n = 5, m = 2, q = 11
We have:
p = 26 mod 11 = 4
t0 = 31 mod 11 = 9
t1 = (10(9 - 3(10) mod 11 ) + 4) mod 11
= (10 (9- 8) + 4) mod 11 = 14 mod 11 = 3
Pattern Matching
19
Rabin-Karp Implementation
Procedure RABIN-KARP-MATCHER(T,P,d,q)
Input : Text T, pattern P, radix d ( which is typically =
Σ
), and the prime q.
Output : valid shifts s where P matches
n ← length[T]; m ← length[P];
h

d
m-1

mod q; p ← 0; t
0
← 0;
for i ← 1 to m do {
p ← (d
×
p + P[i] mod q;
t
0
← (d
×
t
0
+T[i] mod q;
}
for s ← 0 to n-m do
if (p = t
s )
if (P[1 m] = T[s+1 s+m])
“pattern occurs with shift ‘s’
else if (s < n-m) t
s+1


(d(t
s
–T[s+1]h)+ T[s+m+1]) mod q;
Pattern Matching
20

×