数据结构：手写字典树（Trie）树探索与应用

2024-01-11 00:37:14

简介
字典树又称单词查找树或键树，是一种树形结构，是哈希树的一种变体。典型应用是用于统计和排序大量的字符串（但不限于字符串），所以经常被搜索引擎系统用于文本词频统计。它的优点是：利用字符串的公共前缀来减少查询时间，最大限度地减少无谓的字符串比较，查询效率比哈希表高。

原理

字典树是一种多叉树数据结构，用于存储字符串集合。每个节点代表一个字符，根节点是空字符串。如果一个字符串是另一个字符串的前缀，那么这两个字符串在字典树中共享一个公共前缀。

例如，单词“apple”和“banana”共享公共前缀“ap”。因此，在字典树中，这两个单词都存储在同一个节点下。

字典树的查询效率很高，因为只需要沿着字符串的公共前缀在树中搜索一次，就可以找到所有匹配的字符串。

应用

字典树有许多应用，包括：

文本编辑器中的自动完成功能 ：当用户在文本编辑器中输入时，字典树可以用来建议可能的单词。
搜索引擎中的拼写检查功能 ：当用户在搜索引擎中输入查询时，字典树可以用来检查查询的拼写是否正确。
网络过滤中的垃圾邮件过滤功能 ：字典树可以用来过滤垃圾邮件，方法是将垃圾邮件地址存储在字典树中，然后检查传入邮件的地址是否在字典树中。

实现

我们可以使用数组或链表来实现字典树。数组实现比较简单，但是链表实现更加灵活。

数组实现

class TrieNode:
    def __init__(self):
        self.children = [None]*26
        self.isEndOfWord = False

class Trie:
    def __init__(self):
        self.root = self.getNode()

    def getNode(self):
        return TrieNode()

    def _charToIndex(self,ch):
        return ord(ch)-ord('a')

    def insert(self,key):
        pCrawl = self.root
        length = len(key)
        for level in range(length):
            index = self._charToIndex(key[level])

            #如果当前字符不在 Trie 节点孩子的列表中，就添加这个字符
            if not pCrawl.children[index]:
                pCrawl.children[index] = self.getNode()
            pCrawl = pCrawl.children[index]

        #标记最后一个字符作为单词的结束
        pCrawl.isEndOfWord = True

    def search(self, key):
        pCrawl = self.root
        length = len(key)
        for level in range(length):
            index = self._charToIndex(key[level])
            if not pCrawl.children[index]:
                return False
            pCrawl = pCrawl.children[index]

        return pCrawl != None and pCrawl.isEndOfWord

# 使用字典树
trie = Trie()
trie.insert("apple")
trie.insert("banana")
print(trie.search("apple"))  # True
print(trie.search("app"))  # False
print(trie.search("banana"))  # True

链表实现

class TrieNode:
    def __init__(self):
        self.children = {}
        self.isEndOfWord = False

class Trie:
    def __init__(self):
        self.root = self.getNode()

    def getNode(self):
        return TrieNode()

    def _charToIndex(self,ch):
        return ord(ch)-ord('a')

    def insert(self,key):
        pCrawl = self.root
        length = len(key)
        for level in range(length):
            index = self._charToIndex(key[level])

            #如果当前字符不在 Trie 节点孩子的列表中，就添加这个字符
            if not pCrawl.children.get(index):
                pCrawl.children[index] = self.getNode()
            pCrawl = pCrawl.children[index]

        #标记最后一个字符作为单词的结束
        pCrawl.isEndOfWord = True

    def search(self, key):
        pCrawl = self.root
        length = len(key)
        for level in range(length):
            index = self._charToIndex(key[level])
            if not pCrawl.children.get(index):
                return False
            pCrawl = pCrawl.children[index]

        return pCrawl != None and pCrawl.isEndOfWord

# 使用字典树
trie = Trie()
trie.insert("apple")
trie.insert("banana")
print(trie.search("apple"))  # True
print(trie.search("app"))  # False
print(trie.search("banana"))  # True