CPS222 Lecture: Pattern Matching in Strings - Last revised 4/07/2015
OBJECTIVES:
1. To become familiar with the Knuth-Morris-Pratt algorithm for pattern
matching in strings.
2. To become familiar with the basic wildcard matching algorithm for strings.
MATERIALS:
1. Projectable of brute force matching algorithm
2. Projectable of applying brute force algorithm to a pathological case
3. Projectable of KMP algorithm
4. Projectable of applying KMP to some example strings
5. Projectable of computation of KMP failure function
6. Projectable of applying this to some example patterns
7. Projectable of wildcard matching algorithm
I. Introduction
- ------------
A. When we speak of a string, we are in general speaking of a (possibly
empty) sequence of symbols drawn from some alphabet. There are, therefore,
as many different types of strings as there are possible alphabets:
1. Bit strings are drawn from the alphabet {0,1}. Ultimately, all
data is represented in memory as bit strings - but some hardware
systems and some programming languages provide facilities for
manipulating bit strings of arbitrary length (not necessarily
8 or 32 or whatever the word size happens to be).
2. An individual's DNA can be represented by a string on the alphabet
{ A, C, G, T } - symbolic names for four chemical bases.
3. Character strings are drawn from the alphabet {c | c is of type char }.
Most often, when we speak of a string without further qualification, this
is what we mean. We will focus our discussion on character strings.
B. The key problem in implementing strings in any language is variable
length. Over the course of a program's execution, the space
needed by a given string variable may vary widely - especially when
working with character strings.
C. C++ has two different facilities for working with character strings
1. One facility is inherited from C - hence such strings are often
called "C strings". A C string is an array of characters, with the
end of the string (and hence its length) marked by a null character ('\0').
a. When the C++ compiler sees a sequence of characters enclosed in
quotes, it interprets it as a C string.
b. Example: the string "hello" has internal representation
_______________________________
| h | e | l | l | o | \0 |
-------------------------------
(Note that, although the string is 5 characters long, the representation
needs six characters to allow for the terminating null character)
c. When declaring a variable to hold a C string, the declared size
of the array must be big enough to hold the largest length
string the variable will ever hold. This will, of course, be 1
more than the number of characters.
When the variable holds a smaller value, some number of elements
in the array will be unused.
d. Because of the equivalence between arrays and pointers in C,
such a string can be regarded equivalently as being of type
char ... [] or of type char *.
2. The other facility - which is unique to C++ - is the library string
class. This facility supports variable length strings, as follows:
a. A string object contains a single field, which points to a
representation that looks like this:
current length
space reserved (can be larger than current length)
# of references
"selfish"
data (array of characters at least big enough to hold current value)
b. Operations on a string that would change its length (e.g.
assignment of a new value, inserting or appending characters) may
result in a new representation being created and the pointer being
reset, if the total space available is less than the new needed
length.
c. For efficiency, two or more strings can share the same
representation structure. The reference count keeps track of
the number of strings sharing a representation - when it goes
to zero the representation is deleted.
d. The selfish flag indicates that a representation is subject to
being modified (is being accessed by a non const method)
and hence cannot be shared.
II. Pattern matching algorithms
-- ------- -------- ----------
A. From an efficiency standpoint, the most challenging string operation to
implement is pattern matching: given a pattern string and a subject
string, determine whether the pattern occurs in the subject and, if so,
where the match begins.
1. This is an important operation.
a. For example, it is what string methods such as index() do.
b. But pattern matching is also used for other kinds of strings -
e.g. matching DNA sequences in a database of DNA samples.
2. A fairly straight-forward approach is to use brute force as follows.
We use the variable i to point to a character in the subject and j
to point to a character in the pattern. At each iteration of the
loop, we compare subject[i] to pattern[j]:
// Return first position where pattern matches subject or -1 if no match
int match(string pattern, string subject)
{
int p = 0; int s = 0; // Positions in pattern and subject
while (p < pattern.length() && s < subject.length())
{
if (pattern[p] == subject[s])
{
p ++;
s ++;
}
else
{
s = s - p + 1;
p = 0;
}
}
if (p >= pattern.length())
return s - p;
else
return -1;
}
PROJECT
Example: search for 'lo' in 'hello'
i j
Initialize 0 0
if fails ('h' != 'l')
Set i and j back 1 0
if fails ('e'!='l')
Set i and j back 2 0
if succeeds ('l' = 'l')
increment i, j 3 1
if fails ('l'!='o')
Set i and j back 3 0
if succeeds ('l' = 'l')
increment i, j 4 1
if succeeds ('o'='o')
increment i, j 5 2
exit while loop - j >= length('lo')
declare match starting at (5 - 2) = 3 and exit
3. If we let n be the length of the subject and m of the pattern, then
in most cases, this algorithm has performance close to O(n+m).
a. When pattern and subject match, i is incremented by 1
b. When pattern and subject don't match, i can be decremented. However,
observe that if j = 0, i is in fact INCREMENTED, and if j = 1, i
is left alone. Since mismatches generally occur early with
typical subjects and patterns, we expect that most iterations of
the while loop will result in increasing i by 1, and i will rarely
be decremented. Since the while loop exits when i > n, we expect
slightly more that O(n) iterations of the loop.
c. We arrive at the approximation O(n+m) by arguing that we will have
O(n) "false starts" and will match m characters when we finally
succeed.
4. However, this algorithm has worst case performance O(n^2). To
see this, consider an (admitedly-contrived) example:
search for 'aaaab' in 'aaaaaaaaab'
Initialize 0 0
'a' = 'a' - increment 1 1
'a' = 'a' - increment 2 2
'a' = 'a' - increment 3 3
'a' = 'a' - increment 4 4
'a' != 'b' - back up 1 0
'a' = 'a' - increment 2 1
'a' = 'a' - increment 3 2
'a' = 'a' - increment 4 3
'a' = 'a' - increment 5 4
'a' != 'b' - back up 2 0
'a' = 'a' - increment 3 1
'a' = 'a' - increment 4 2
'a' = 'a' - increment 5 3
'a' = 'a' - increment 6 4
'a' != 'b' - back up 3 0
'a' = 'a' - increment 4 1
'a' = 'a' - increment 5 2
'a' = 'a' - increment 6 3
'a' = 'a' - increment 7 4
'a' != 'b' - back up 4 0
'a' = 'a' - increment 5 1
'a' = 'a' - increment 6 2
'a' = 'a' - increment 7 3
'a' = 'a' - increment 8 4
'a' != 'b' - back up 5 0
'a' = 'a' - increment 6 1
'a' = 'a' - increment 7 2
'a' = 'a' - increment 8 3
'a' = 'a' - increment 9 4
'a' = 'a' - increment 10 6
Exit reporting match starting at (10 - 5) = 5
PROJECT
What in fact happens here is we have a series of 5 matches of almost
all the characters in the pattern, which fail only on the last
character, but the 6th such match succeeds. That is, we compare
(n - m + 1)(m) = nm - m^2 + m characters = O(mn).
That is, the worst case for our algorithm is O(mn); and since the
pattern can be of length comparable to the subject, this becomes
O(n^2).
B. The pathological behavior of the algorithm in cases like the above
can be avoided (and a guaranteed O(n+m) search can be achieved) by
taking advantage of information acquired up to the time a match
fails.
1. In the above example, our brute force algorithm backs up all the
way to the beginning of the pattern when it discovers the mismatch
between the "b" in the pattern and the "a" in the subject. This
is not necessary, since we know that the pattern begins with a
series of 4 a's, all of which already matched in the subject.
When we go to consider a possible match at the next starting
position in the subject, three of the subject a's are already known
to match the corresponding pattern a's, so we could resume by comparing
the fourth a in the pattern with the corresponding character in the
subject.
That is, instead of backing up from j = 4 to j = 0 (4 positions),
we could simply back up from j = 4 to j = 3 (1 position), leaving i
where it is.
First trial match: a a a a a a a a a b
a a a a b
(we consider all five positions)
Second trial match: a a a a a a a a a b
a a a a b
^ ^
| |
(we only need to consider these positions)
Third trial match: a a a a a a a a a b
a a a a b
^ ^
| |
(we only need to consider these positions)
...
Sixth trial match: a a a a a a a a a b
a a a a b
^ ^
| |
(we only need to consider these positions)
The total number of comparisons needed now is five on the first trial
and two each on the second through sixth trials - or 15 total = n + m.
2. Now consider another slightly different example. Suppose the pattern
is as before, but the subject is a a a a c a a a a b. What happens
when we discover the mismatch between b and c at position 4 in the
subject and pattern?
a. In this case, because the pattern begins with four a's the c
in the subject at position 4 allows us to conclude that no
match can possibly start at positions 1, 2, 3, or 4 in the
subject.
b. Thus, after failing to find a match starting at position 0 in the
subject, we can skip ahead to consider a possible match starting
at position 5 in the subject - which of course succeeds.
c. Here, we should be able to complete the search with only 10
comparisons (though the algorithm we will consider will, in
fact, use 14 - close to the previous case.)
3. These observations are incorporated into an algorithm known as the
Knuth-Morris-Pratt algorithm.
a. The basic idea is this: before starting the actual matching process,
we compute a table next[j] with one entry for each position in the
pattern, defined as follows. (The book calls this function f
instead of next):
next[j] = length of the longest prefix of the pattern that is a
suffix of the pattern starting at position 1 and ending
at position j with special case next[0] = 0
Example: for our pattern aaaab, we compute next[] as follows
j next[j] Rationale
0 0 Special case
1 1 pattern[0..0] = pattern[1..1]
2 2 pattern[0..1] = pattern[1..2]
3 3 pattern[0..2] = pattern[1..3]
4 0 No suffix of pattern[1..j] matches any
prefix of the pattern
(We will discuss the mechanics of computing this table shortly.)
b. Now we use this table in the search as follows:
i. Suppose that we have matched j characters of the pattern
with the subject, but pattern[j] fails to match subject[i].
ii. In the brute force algorithm, we would continue the search by
comparing pattern[0] with subject[i-j+1] - i.e. we start the
whole matching process over beginning 1 beyond where the one that
failed started. Now, instead, we proceed as follows:
start with i = 0, j = 0;
while i < length of subject and j < length of pattern
if subject[i] == pattern[j]
increment both i and j
else if j > 0
set j = next[j-1] and leave i alone
(i.e. look for a match in which the current subject
character matches the pattern at position next[j-1]
since it didn't match at position j, though previous
characters are already known to match.)
else
then set i = i + 1. (No match can begin with subject[i],
so the first possible match would begin at subject[i+1]).
if j >= length of pattern then
declare match found beginning at position i - j
else
declare no match
PROJECT
iii. Example: match aaaab with aaaaaaaaab
i j Action
Initialize
0 0 'a' == 'a' - increase i,j
1 1 'a' == 'a' - increase i,j
2 2 'a' == 'a' - increase i,j
3 3 'a' == 'a' - increase i,j
4 4 'a' != 'b' .
Since next[3] = 3, set j = 3 - leave i alone
4 3 'a' == 'a' - increase i,j
5 4 'a' != 'a' .
Since next[3] = 3, set 3 = 3 - leave i alone
5 3 'a' == 'a' - increase i,j
6 4 'a' ! 'b'.
Since next[3] = 3, set j = 3 - leave i alone
6 3 'a' == 'a' - increase i,j
7 4 'a' != 'b' .
Since next[3] = 3, set j = 3 - leave i alone
7 3 'a' == 'a' - increase i,j
8 4 'a' != 'b' .
Since next[3] = 3, set j = 3 - leave i alone
8 3 'a' == 'a' - increase i,j
9 4 'b' == 'b' - increase i,j
10 5 Exit loop - announce match beginning at
position 10-5 = 5
iv. Another example: match a a a a b with a a a a c a a a a b.
i j Action
Initialize
0 0 'a' == 'a' - increase i,j
1 1 'a' == 'a' - increase i,j
2 2 'a' == 'a' - increase i,j
3 3 'a' == 'a' - increase i,j
4 4 'c' != 'b' .
Since next[3] = 3, set j = 3 - leave i alone
4 3 'c' != 'a' .
Since next[2] = 2, set j = 2 - leave i alone
4 2 'c' != 'a' .
Since next[1] = 1, set j = 1 - leave i alone
4 1 'c' != 'a' .
Since next[0] = 0, set j = 0 - leave i alone
4 0 'c' == 'a' .
Since j == 0, leave j at 0 and increment i
5 0 'a' == 'a' - increase i,j
6 1 'a' == 'a' - increase i,j
7 2 'a' == 'a' - increase i,j
8 3 'a' == 'a' - increase i,j
9 4 'b' == 'b' - increase i,j
10 5 Exit loop - announce match beginning at
position 10-5 = 5
(14 comparisons total)
PROJECT
c. Of course, we still have the problem of computing the table next[j].
This must be done as a preliminary step before matching begins,
and is done by matching the pattern against itself, as follows:
i. Set i = 1, j = 0, next[0] = 0
ii. As long as i < the length of the pattern:
if p[i] = p[j] then
set next[i]=j+1 and increment both i and j
else if j > 0
set j = next(j-1) but leave i alone
else
set next[i] = 0 and increment i but leave j 0
PROJECT
iii. Example: for the pattern a a a a b:
i j Initial next[0..4]
1 0 0 ? ? ? ?
1 0 'a' == 'a' 0 1 ? ? ?
2 1 'a' == 'a' 0 1 2 ? ?
3 2 'a' == 'a' 0 1 2 3 ?
4 3 'b' != 'a', j > 0 0 1 2 3 ?
4 2 'b' != 'a', j > 0 0 1 2 3 ?
4 1 'b' != 'a', j > 0 0 1 2 3 ?
4 0 'b' != 'a', j == 0 0 1 2 3 4
5 0 Done
iv. Example: for the pattern a b a b a:
i j Initial next[0..4]
1 0 'b' != 'a', j == 0 0 0 ? ? ?
2 0 'a' == 'a' 0 0 1 ? ?
3 1 'b' == 'b' 0 0 1 2 ?
4 2 'a' == 'a' 0 0 1 2 3
5 3 Done
PROJECT
d. To see that the Knuth-Morris-Pratt algorithm is in fact O(n+m), observe:
i. In the main loop of the matching algorithm, we increment i
most of the time, and we never decrease i. Further, we
increment i exactly n times.
ii. It is possible to have a series of one or more steps where
we don't increment i; but since next[j-1] < j, if we have a
series of k steps where we don't increment i, these must
follow a series of k steps where we did increment both i and j.
Thus, the number of times we don't increment i cannot exceed
the number of times we do.
iii. Thus, the total number of loop iterations is <= 2n, and
the main match is O(n).
iv. By similar argument, the preliminary process for creating
next[] is O(m), so the total process is O(n+m).
4. There are several other ways to improve the matching process. For example,
one algorithm developed by Boyer and Moore relies on working BACKWARD through
the pattern, and can be very fast in those cases where characters near the end
of the pattern do not occur elsewhere in the pattern.
5. Still other techniques are based on HASHING, rather than direct
comparison. We will not discuss these now - see a book on Algorithms.
C. Another interesting situation arises when we allow the use of wildcard characters
in the pattern.
1. For example, C shells allow the use of the wildcard characters ? and * in
arguments to commands. ? matches any one character, and * matches any
sequence of characters (including an empty sequence).
If a directory contains the files
foe
foo
foreign
fo? would match the first two, but not foreign
foe* would match foe but not the other two
fo* would match all three
2. The simple pattern matching we looked at earlier can easily be extended to
handle wildcard like these by using a recursive auxiliary function, using
symbolic names WILDCARD_SINGLE and WILDCARD_MANY for the wildcards:
// Recursive auxiliary - return true if pattern starting at position p matches
// subject starting at position s
int matchAux(string pattern, string subject, int p, int s)
{
while (p < pattern.length() && s < subject.length())
{
if (pattern[p] == subject[s] || pattern[p] == WILDCARD_SINGLE)
{ p ++; s ++;
}
else if (pattern[p] == WILDCARD_MANY)
{ return matchAux(pattern, subject, p, s+1) ||
matchAux(pattern, subject, p+1, s);
}
else
{
return false;
}
}
return p >= pattern.length();
}
// Return first position where pattern matches subject or -1 if no match
int match(string pattern, string subject)
{
for (int start = 0; start < subject.length(); start ++)
if (matchAux(pattern, subject, 0, start))
return start;
return -1;
}
PROJECT
3. The extension to handle WILDCARD_SINGLE was easy. The extension
to handle WILDCARD_MANY is the tricky part.
Notice how we consider two possibilities:
a. The wildcard matches at least the current character in the subject,
and maybe more - so we consider what happens if we increment the position
in the subject but not in the pattern.
b. The wildcard matches no more characters in the subject, so we move on to
the next character in the pattern while remaining at the same position in
the subject.
c. The fact that there are two possible ways of going at this point leads to
the need to use a recursive auxiliary.
4. It is possible - though we won't pursue it here - to extend an algorithm like
KMP to allow the use of wildcards while still preserving O(n) behavior.
D. A final issue concerns allowing generalized regular expressions to be used in the
pattern.
1. Both the java and C++ standard libraries provide support for this - the package
java.util.regex and the standard header regex.h.
2. However, since regular expressions are a topic in CPS320 we won't discuss
them further here.