Python 函数助力数据清洗：提升数据质量，优化建模

2024-03-14 07:56:59

利用 Python 函数对数据进行清洗

简介

数据清洗是数据分析和机器学习中的重要步骤，涉及将原始数据转换为更适合建模和分析的格式。Python 提供了丰富的函数，可用于执行各种数据清洗任务，包括清除数字、停用词、标点符号和执行词形还原。

清除数字

数字在文本数据中可能不是相关特征。我们可以使用正则表达式来匹配数字并用空字符串替换它们。以下函数执行此操作：

def remove_numbers(text):
    return re.sub(r'\d+', '', text)

清除停用词

停用词是常见词，信息含量低，例如“the”、“is”和“of”。清除停用词可以提高模型效率并减少噪声。Python 的 NLTK 库提供了停用词列表。以下函数使用 NLTK 来清除停用词：

from nltk.corpus import stopwords

def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    return ' '.join([word for word in text.split() if word not in stop_words])

清除标点符号

标点符号对于理解人类语言很重要，但对于机器学习模型可能不是很有用。我们可以使用正则表达式来匹配标点符号并将其删除。以下函数执行此操作：

import string

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

词形还原

词形还原将单词还原为其基础形式或词干。这有助于将具有不同时态或语态的单词归为一类，从而提高模型准确性。NLTK 库提供了词形还原功能。以下函数使用 NLTK 来执行词形还原：

from nltk.stem import WordNetLemmatizer

def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

示例

我们可以将这些函数组合起来创建一个完整的文本清洗流水线。以下示例展示了如何使用这些函数对文本进行清洗：

text = "This is a sample sentence with numbers (123) and stop words (the, is). Let's clean it up!"

# Apply the cleaning functions
cleaned_text = remove_numbers(text)
cleaned_text = remove_stopwords(cleaned_text)
cleaned_text = remove_punctuation(cleaned_text)
cleaned_text = lemmatize(cleaned_text)

print(cleaned_text)

输出：