返回

字符串匹配算法:BF、KMP、RK、BM、Sunday,终极必杀技!

后端

字符串匹配算法:文本中的宝藏寻宝

在计算机科学的汪洋大海中,字符串匹配算法是一颗璀璨的明珠,指引着我们在文本宝库中寻获宝藏。它在文本处理、数据挖掘和生物信息学等领域发挥着至关重要的作用,堪称一项基本而不可或缺的任务。

算法家族:探索字符串匹配的五种利器

在寻找文本中的特定模式时,有五大经典的字符串匹配算法供我们选择。就像一把把钥匙,它们各有千秋,适合不同的场景。

1. BF算法:朴实无华的直线搜索

BF算法,又称暴力匹配算法,可谓最简单的一种字符串匹配算法。它遵循着最直接的策略:逐个字符地将模式字符串与文本中的每个子串进行比较。就像一只勤劳的小蜜蜂,它在文本的花丛中穿梭,直到找到与模式字符串相匹配的花朵。然而,这种朴素的策略也带来了时间复杂度的代价,它需要比较文本中的每个字符,在最坏的情况下达到O(mn),其中m是模式字符串的长度,n是文本的长度。

代码示例:

def brute_force(text, pattern):
    """
    Brute Force String Matching Algorithm

    Args:
        text: The text to be searched.
        pattern: The pattern to be found.

    Returns:
        The index of the first occurrence of the pattern in the text, or -1 if not found.
    """

    for i in range(len(text) - len(pattern) + 1):
        if text[i:i + len(pattern)] == pattern:
            return i

    return -1

2. KMP算法:善用失败函数的巧妙算法

KMP算法是BF算法的升级版,它引入了失败函数这个概念。就像一位经验丰富的侦探,失败函数记录了模式字符串中每个字符的最长公共前缀和后缀的长度。有了这个秘密武器,KMP算法可以巧妙地跳过一些字符,从而大幅降低比较次数。它的时间复杂度达到O(m+n),与BF算法相比,效率有了显著提升。

代码示例:

def knuth_morris_pratt(text, pattern):
    """
    Knuth-Morris-Pratt String Matching Algorithm

    Args:
        text: The text to be searched.
        pattern: The pattern to be found.

    Returns:
        The index of the first occurrence of the pattern in the text, or -1 if not found.
    """

    # Preprocess the pattern to compute the failure function.
    failure_function = compute_failure_function(pattern)

    # Match the pattern with the text.
    i = 0
    j = 0
    while i < len(text):
        if pattern[j] == text[i]:
            i += 1
            j += 1

        if j == len(pattern):
            return i - j

        elif i < len(text) and pattern[j] != text[i]:
            if j != 0:
                j = failure_function[j - 1]
            else:
                i += 1

    return -1

def compute_failure_function(pattern):
    """
    Compute the failure function for the given pattern.

    Args:
        pattern: The pattern to compute the failure function for.

    Returns:
        The failure function for the given pattern.
    """

    failure_function = [0] * len(pattern)

    i = 1
    j = 0
    while i < len(pattern):
        if pattern[i] == pattern[j]:
            failure_function[i] = j + 1
            i += 1
            j += 1
        elif j > 0:
            j = failure_function[j - 1]
        else:
            failure_function[i] = 0
            i += 1

    return failure_function

3. RK算法:哈希碰撞的巧妙运用

RK算法采用了哈希函数这个神器。它将模式字符串和文本字符串都转换成数字,就像给文本中的每个子串分配了一个专属的身份证号码。然后,算法比较这些身份证号码是否相同。如果相同,它再进一步比较模式字符串和文本字符串中的字符是否一一对应。这种策略巧妙地减少了比较次数,时间复杂度达到O(m+n)。然而,RK算法对哈希函数的选择很敏感,一个好的哈希函数可以大幅提升它的效率。

代码示例:

def rabin_karp(text, pattern):
    """
    Rabin-Karp String Matching Algorithm

    Args:
        text: The text to be searched.
        pattern: The pattern to be found.

    Returns:
        The index of the first occurrence of the pattern in the text, or -1 if not found.
    """

    # Compute the hash value of the pattern.
    pattern_hash = hash(pattern)

    # Compute the hash value of the first window of the text.
    window_hash = hash(text[:len(pattern)])

    # Iterate over the text, updating the window hash value at each step.
    for i in range(1, len(text) - len(pattern) + 1):
        # Update the window hash value by removing the hash value of the first character and adding the hash value of the last character.
        window_hash = (window_hash - hash(text[i - 1])) * 31 + hash(text[i + len(pattern) - 1])

        # If the window hash value is equal to the pattern hash value, compare the pattern with the window.
        if window_hash == pattern_hash:
            if text[i:i + len(pattern)] == pattern:
                return i

    return -1

4. BM算法:字符匹配的高效算法

BM算法以其对字符匹配的巧妙处理而闻名。它从模式字符串的最后一个字符开始,依次与文本的最后一个字符进行比较。如果相等,则继续比较前一个字符;如果不相等,则根据模式字符串中该字符出现的位置,将模式字符串向左移动一定距离,然后再从文本的下一个字符开始比较。这种策略有效地减少了不必要的比较,时间复杂度达到O(mn)。但是,BM算法对模式字符串中字符的分布很敏感,当模式字符串中包含大量重复字符时,它的效率会下降。

代码示例:

def boyer_moore(text, pattern):
    """
    Boyer-Moore String Matching Algorithm

    Args:
        text: The text to be searched.
        pattern: The pattern to be found.

    Returns:
        The index of the first occurrence of the pattern in the text, or -1 if not found.
    """

    # Preprocess the pattern to compute the bad character table and the good suffix table.
    bad_character_table, good_suffix_table = preprocess_pattern(pattern)

    # Match the pattern with the text.
    i = 0
    while i < len(text) - len(pattern) + 1:
        j = len(pattern) - 1

        while j >= 0 and pattern[j] == text[i + j]:
            j -= 1

        if j == -1:
            return i

        else:
            i += max(good_suffix_table[j], bad_character_table[text[i + j]] - j)

    return -1

def preprocess_pattern(pattern):
    """
    Preprocess the pattern to compute the bad character table and the good suffix table.

    Args:
        pattern: The pattern to preprocess.

    Returns:
        A tuple of the bad character table and the good suffix table.
    """

    bad_character_table = [-1] * 256
    good_suffix_table = [len(pattern)] * len(pattern)

    for i in range(len(pattern)):
        bad_character_table[ord(pattern[i])] = i

    for i in range(len(pattern) - 1):
        j = good_suffix_table[i + 1]
        while j < len(pattern) and pattern[j] == pattern[j - i - 1]:
            j += 1
        good_suffix_table[i] = j

    return bad_character_table, good_suffix_table

5. Sunday算法:从前向后逐字符匹配

Sunday算法也是一种基于字符匹配的算法。它从模式字符串的第一个字符开始,依次与文本的第一个字符进行比较。如果相等,则继续比较下一个字符;如果不相等,则将模式字符串向右移动一定距离,然后从文本的下一个字符开始比较。移动距离由模式字符串中该字符出现的位置决定。这种策略有效地避免了不必要的回溯,时间复杂度达到O(mn)。与BM算法类似,