| |||||||||
The Knuth-Morris-Pratt string searching algorithm locates a pattern of characters within a string. It uses the fact that after finding the first mismatch between pattern and string we already know the characters compared before to improve the efficiency of the search. We therefore will have to compare the pattern to itself to determine how many positions we have to move to the left.
The algorithm was invented by Knuth and Pratt and independently by J. H. Morris in 1976.
| index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| char | A | B | C | D | A | B | D | - |
| pos | -1 | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
This is the precomputed table of offsets generated by the algorithm. If we look at the table, we can see that these numbers are the maximum number of characters matchable before the nth position on the string (without matching the entire prefix of n characters).
For example take position 6
The pointer is at position 6. As seen above, the maximum number of characters that can matched before position 6 is 2. Thus, the array entry of 6 is 2.
We now have the table so we can use the algorithm to check the string for matches:
We find that the first three match, but the fourth does not, so we look up the fourth character (or index 3) in the table, this gives us zero. We should now try matching position zero on the pattern with the fourth character on the string:
This time the match fails on the first character (or index 0) on the pattern, which means we go to the lookup table, the table is -1 therefore we now try the fifth character on the main string:
We continue to increment in this way, sometimes matching, sometimes not, until we reach the end of 'ABCDAB'.
This last 'D' does not match, so we look up in the table, find 2 so we start from position 2:
So we match the remaining characters and when we reach the end of the pattern we know it is a match.
We check the last piece of the array (index 7) since we have had a match, it contains position zero. So we check the last 'E' against 'A', it is not a match and we have finished searching the string.
This algorithm builds the next array as a finite state machine and uses it to dictate to the algorithm what to do at each point.
The complexity of processing the pattern is o(m), where m is the number of characters in the string.
The complexity of searching the string for the given pattern is o(n) where n is the length of the main string.