paint-brush
Understanding and Applying the Knuth-Morris-Pratt Algorithm: From String Matching to Stream Searchby@vpinchuk
559 reads
559 reads

Understanding and Applying the Knuth-Morris-Pratt Algorithm: From String Matching to Stream Search

by Vadym PinchukMay 9th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The Knuth-Morris-Pratt (KMP) algorithm is a string-matching algorithm. It is used to find all occurrences of a pattern string in a text string. The KMP algorithm has many practical applications, such as in text editors, search engines, and DNA sequencing.
featured image - Understanding and Applying the Knuth-Morris-Pratt Algorithm: From String Matching to Stream Search
Vadym Pinchuk HackerNoon profile picture


Algorithm Explanation

The Knuth-Morris-Pratt (KMP) algorithm is a string-matching algorithm that is used to find all occurrences of a pattern string in a text string. It was developed by Donald Knuth, Vaughan Pratt, and James Morris in 1977 and is named after them.


The KMP algorithm is a linear-time algorithm, which means that it can find all occurrences of a pattern string in a text string in O(n+m) time, where n is the length of the text string and m is the length of the pattern string.


This makes it much faster than naive algorithms, which have a worst-case time complexity of O(nm).


The KMP algorithm works by preprocessing the pattern string to generate a partial match table (also called the failure function or the next array). This table contains information about the pattern string that allows the algorithm to skip over unnecessary comparisons during the search.


During the search, the KMP algorithm compares each character in the text string with the corresponding character in the pattern string. If a mismatch is found, the algorithm uses the information from the partial match table to determine where to start the next comparison.


The KMP algorithm has many practical applications, such as in text editors, search engines, and DNA sequencing. It is also used in other algorithms, such as the Boyer-Moore algorithm, which is a faster string-matching algorithm that is based on the KMP algorithm.


The “Algorithm Explanation” section was written by an AI.

Stream-Based KMP Algorithm

The KMP substring search algorithm can be modified to search for a pattern string within a stream of strings. This modification can be useful in various scenarios, such as searching for a particular keyword in a log file or searching for a sequence of characters in a video stream.


To understand this modified algorithm, it is recommended to first have a good grasp of the original KMP pattern search algorithm. A great resource for learning KMP is Tushar Roy’s video guide, which provides an amazing explanation of the algorithm.


The modified algorithm starts with checking for some edge cases and immediately returns false if the pattern or input is empty.


Then, the algorithm creates a pattern table that defines how many characters can be safely ignored (not checked again) while the characters at the current position are not matched.


Next, the algorithm iterates over the list of strings, representing the stream, and searches for the pattern. If the pattern is found in any of the strings, the algorithm returns true; otherwise, it continues searching until the end of the stream is reached.


The modified version of the Knuth-Morris-Pratt (KMP) algorithm requires an additional check on the index inside each string from the stream. However, all the other parts of the algorithm remain the same.


To see how this modified version differs from the original KMP algorithm, you can find the complete code on the author’s GitHub page.


By comparing the two versions of the algorithm, you can gain a better understanding of how KMP can be applied in different scenarios.


The complete code for this modified algorithm can be found in the provided link.

Complete Code in Dart

/// Time Complexity O(m + n), Space complexity O(n)
/// m - input, n - pattern length
bool searchSubstring(List<String> input, String pattern) {
  if (input.isEmpty || pattern.isEmpty) {
    return false;
  }
  /// table to build a pattern
  final List<int> patternTable = List<int>.filled(pattern.length, 0);

  int j = 0;
  int i = 1;
  final int patMaxIndex = pattern.length - 1;

  /// length of prefix which is suffix also
  /// Generate safe check positions for optimization
  /// It defines how much chars can be safely ignored (not checked again)
  /// while chars at current position not matched
  /// And return to position after safe prefix
  /// and check again with same char in the input
  while (i <= patMaxIndex) {
    if (pattern[i] == pattern[j]) {
      patternTable[i] = j + 1;
      j++;
      i++;
    } else if (j > 0) {
      j = patternTable[j - 1];
    } else {
      i += 1;
    }
  }

  int patIdx = 0; // for iteration over pattern;
  for (final String line in input) {
    int lineIdx = 0;
    while (lineIdx < line.length) {
      /// If it is last char in pattern - return start position of this pattern
      if (patIdx == pattern.length) {
        return true;
      }

      /// If chars match - iterate to the next
      if (line[lineIdx] == pattern[patIdx]) {
        patIdx++;
        lineIdx++;
      } else if (patIdx > 0) {
        /// if Value is not zero - means that we could return to pattern position
        /// safely ignore some prefix equal to suffix and check chars again
        patIdx = patternTable[patIdx - 1];
      } else {
        /// If value is zero - move to next char in line
        lineIdx++;
      }
    }
  }
  return false;
}


Please refer to the kmp_mod_substring_search and kmp_substring_search files from my GitHub for a complete code comparison.