Recent Articles



































Knuth-Morris-Pratt algorithm



         


The Knuth-Morris-Pratt string searching algorithm locates a pattern of characters within a string. It uses the fact that after finding the first mismatch between pattern and string we already know the characters compared before to improve the efficiency of the search. We therefore will have to compare the pattern to itself to determine how many positions we have to move to the left.

The algorithm was invented by Knuth and Pratt and independently by J. H. Morris in 1976.

[Top]

Overview of Knuth-Morris-Pratt Algorithm

[Top]

Example

String: ABC ABC ABC ABDAB ABCDABCDABDE pattern: ABCDABD
[Top]

Building 'next' table


index 0 1 2 3 4 5 6 7
char A B C D A B D -
pos -1 0 0 0 0 1 2 0


This is the precomputed table of offsets generated by the algorithm. If we look at the table, we can see that these numbers are the maximum number of characters matchable before the nth position on the string (without matching the entire prefix of n characters).

For example take position 6

ABCDABD ABCDABD ^

The pointer is at position 6. As seen above, the maximum number of characters that can matched before position 6 is 2. Thus, the array entry of 6 is 2.

[Top]

Processing text

We now have the table so we can use the algorithm to check the string for matches:

ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

We find that the first three match, but the fourth does not, so we look up the fourth character (or index 3) in the table, this gives us zero. We should now try matching position zero on the pattern with the fourth character on the string:

ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

This time the match fails on the first character (or index 0) on the pattern, which means we go to the lookup table, the table is -1 therefore we now try the fifth character on the main string:

ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

We continue to increment in this way, sometimes matching, sometimes not, until we reach the end of 'ABCDAB'.

ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

This last 'D' does not match, so we look up in the table, find 2 so we start from position 2:

ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

So we match the remaining characters and when we reach the end of the pattern we know it is a match.

We check the last piece of the array (index 7) since we have had a match, it contains position zero. So we check the last 'E' against 'A', it is not a match and we have finished searching the string.

[Top]

Notes on the Algorithm

This algorithm builds the next array as a finite state machine and uses it to dictate to the algorithm what to do at each point.

The complexity of processing the pattern is o(m), where m is the number of characters in the string.

The complexity of searching the string for the given pattern is o(n) where n is the length of the main string.







  View Live Article   This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License