LeetCode 1661-单词模糊搜索的数据结构设计与实现

2024-02-09 02:37:25

为了解决LeetCode 1661题中的问题，我们需要设计一种数据结构来存储单词，以便能够高效地执行两种操作：添加新单词和查找字符串是否与任何先前添加的字符串匹配，以及模糊查询。

数据结构设计

我们将使用字典树（也称为前缀树或 Trie）来实现这种数据结构。字典树是一种树形结构，每个结点代表一个字母，结点的子结点代表该字母的下一个字母，以此类推。一个单词可以通过从根结点开始，沿着结点之间的边走，直到到达单词的最后一个字母的结点来表示。

添加新单词

要添加一个新单词，我们首先需要创建一个新的结点来表示该单词的第一个字母。如果该字母的结点已经存在，则我们直接使用该结点。然后，我们继续创建该单词的第二个字母的结点，并将其连接到第一个字母的结点上。以此类推，直到我们创建了该单词的最后一个字母的结点。

查找字符串是否与任何先前添加的字符串匹配

要查找一个字符串是否与任何先前添加的字符串匹配，我们首先需要从字典树的根结点开始，沿着结点之间的边走，直到我们到达该字符串的最后一个字母的结点。如果我们能够到达该结点，则意味着该字符串与至少一个先前添加的字符串匹配。否则，该字符串与任何先前添加的字符串都不匹配。

模糊查询

模糊查询是指查找与给定字符串相似（但并不完全相同）的字符串。为了实现模糊查询，我们可以使用动态规划算法来计算两个字符串之间的编辑距离。编辑距离是指将一个字符串转换为另一个字符串所需的最小编辑操作数，编辑操作包括插入、删除和替换。

编辑距离越小，两个字符串越相似。我们可以使用动态规划算法来计算两个字符串之间的编辑距离，然后根据编辑距离来判断两个字符串是否相似。

代码实现

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_word = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word):
        current = self.root
        for letter in word:
            if letter not in current.children:
                current.children[letter] = TrieNode()
            current = current.children[letter]
        current.is_word = True

    def search(self, word):
        current = self.root
        for letter in word:
            if letter not in current.children:
                return False
            current = current.children[letter]
        return current.is_word

    def starts_with(self, prefix):
        current = self.root
        for letter in prefix:
            if letter not in current.children:
                return False
            current = current.children[letter]
        return True

def fuzzy_search(word1, word2):
    """
    计算两个字符串之间的编辑距离。

    编辑距离是指将一个字符串转换为另一个字符串所需的最小编辑操作数，
    编辑操作包括插入、删除和替换。

    Args:
        word1: 第一个字符串。
        word2: 第二个字符串。

    Returns:
        两个字符串之间的编辑距离。
    """
    m, n = len(word1), len(word2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        dp[i][0] = i

    for j in range(1, n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if word1[i - 1] == word2[j - 1]:
                cost = 0
            else:
                cost = 1

            dp[i][j] = min(
                dp[i - 1][j] + 1,  # 删除
                dp[i][j - 1] + 1,  # 插入
                dp[i - 1][j - 1] + cost  # 替换
            )

    return dp[m][n]

def find_similar_words(trie, word, max_edit_distance):
    """
    查找与给定字符串相似的单词。

    相似是指两个字符串之间的编辑距离小于或等于给定的最大编辑距离。

    Args:
        trie: 字典树。
        word: 给定字符串。
        max_edit_distance: 最大编辑距离。

    Returns:
        与给定字符串相似的单词列表。
    """
    similar_words = []

    def dfs(node, current_word, edit_distance):
        if edit_distance > max_edit_distance:
            return

        if node.is_word:
            similar_words.append(current_word)

        for letter, child_node in node.children.items():
            dfs(child_node, current_word + letter, edit_distance + (letter != word[len(current_word)]))

    dfs(trie.root, "", 0)

    return similar_words


if __name__ == "__main__":
    trie = Trie()
    trie.insert("apple")
    trie.insert("banana")
    trie.insert("cherry")
    trie.insert("dog")
    trie.insert("elephant")

    print(trie.search("apple"))  # True
    print(trie.search("banana"))  # True
    print(trie.search("cherry"))  # True
    print(trie.search("dog"))  # True
    print(trie.search("elephant"))  # True
    print(trie.search("cat"))  # False

    print(trie.starts_with("app"))  # True
    print(trie.starts_with("ban"))  # True
    print(trie.starts_with("che"))  # True
    print(trie.starts_with("dog"))  # True
    print(trie.starts_with("ele"))  # True
    print(trie.starts_with("ca"))  # False

    similar_words = find_similar_words(trie, "apple", 1)
    print(similar_words)  # ['banana']

    similar_words = find_similar_words(trie, "banana", 2)
    print(similar_words)  # ['apple', 'cherry']