非 UTF-8 字符：如何从字符串中清除它们？

php

2024-03-19 10:34:33

从字符串中清除非 UTF-8 字符的终极指南

问题

遇到过字符串中包含难以正确显示的非 UTF-8 字符的情况吗？这些字符可能会用十六进制表示，如 0x97 0x61 0x6C 0x6F。如果遇到这种情况，清除这些字符就显得尤为重要。

解决方法

清除字符串中非 UTF-8 字符有两种有效的方法：

正则表达式

正则表达式是一种强大的工具，可以匹配和删除非 UTF-8 字符。以下正则表达式可以识别任何非 UTF-8 字符：

import re

pattern = re.compile(r'[^\x00-\x7F]')  # 匹配非 UTF-8 字符
cleaned_string = pattern.sub('', string)  # 替换非 UTF-8 字符为空字符串

内置函数

Python 中的 unicodedata 模块提供了一个名为 normalize 的函数，可以将字符串规范化为 UTF-8 格式。

import unicodedata

cleaned_string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('ascii')

此方法将字符串规范化为规范分解形式 (NFKD)，然后将其编码为 ASCII 格式，忽略任何无效字符。最后，将编码后的字符串解码回 ASCII 格式，从而去除非 UTF-8 字符。

示例

假设我们有一个包含非 UTF-8 字符的字符串：

string = "This is a string with non-UTF-8 characters: 0x97 0x61 0x6C 0x6F"

使用正则表达式：

import re

pattern = re.compile(r'[^\x00-\x7F]')
cleaned_string = pattern.sub('', string)
print(cleaned_string)  # 输出：This is a string with non-UTF-8 characters:

使用内置函数：

import unicodedata

cleaned_string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('ascii')
print(cleaned_string)  # 输出：This is a string with non-UTF-8 characters: