机器学习代码片段宝典：数据处理、特征工程、建模一网打尽

2024-01-31 19:39:25

作为一名数据科学家或机器学习工程师，我们每天都会与大量的代码片段打交道。为了提高效率和代码质量，将经常使用的代码片段整理成一个宝典是非常有必要的。本篇文章汇总了我个人在数据分析和机器学习项目中高频使用的代码片段，涵盖了从数据处理到特征工程再到建模的各个环节，希望能够为各位同行提供一些参考和帮助。

数据处理

Pandas设置

# 设置显示所有列
pd.set_option('display.max_columns', None)

# 设置显示所有行
pd.set_option('display.max_rows', None)

# 设置浮点数显示精度
pd.set_option('display.float_format', lambda x: '%.2f' % x)

可视化

# 绘制直方图
plt.hist(data, bins=20)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of Data')

# 绘制散点图
plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter Plot of x and y')

Jieba分词

import jieba

# 分词
words = jieba.cut(text)

# 获取停用词表
stopwords = jieba.load_userdict('stopwords.txt')

# 过滤停用词
words = [word for word in words if word not in stopwords]

缺失值处理

# 填充缺失值（数值型）
data.fillna(data.mean(), inplace=True)

# 填充缺失值（字符串型）
data.fillna('Unknown', inplace=True)

# 删除缺失值（行）
data.dropna(inplace=True)

特征分布

# 计算特征分布
data['feature'].value_counts()

# 绘制特征分布图
sns.countplot(data['feature'])
plt.title('Distribution of Feature')

特征工程

数据归一化

from sklearn.preprocessing import StandardScaler

# 实例化StandardScaler
scaler = StandardScaler()

# 归一化数据
data = scaler.fit_transform(data)

上下采样

from sklearn.utils import resample

# 上采样
majority = data[data['label']==1]
minority = data[data['label']==0]
majority_upsampled = resample(majority, replace=True, n_samples=len(minority))

# 下采样
minority_downsampled = resample(minority, replace=False, n_samples=len(majority))

建模

回归模型

from sklearn.linear_model import LinearRegression

# 实例化LinearRegression
model = LinearRegression()

# 训练模型
model.fit(X, y)

# 预测
y_pred = model.predict(X_test)

分类模型

from sklearn.linear_model import LogisticRegression

# 实例化LogisticRegression
model = LogisticRegression()

# 训练模型
model.fit(X, y)

# 预测
y_pred = model.predict(X_test)

以上仅列举了部分常用的代码片段，在实际项目中还有更多可能需要用到的代码片段。希望这篇代码片段宝典能够帮助大家提高机器学习代码编写的效率和代码质量，从而更高效地完成数据分析和机器学习任务。