日拱一卒,伯克利CS61A——Python作业
2023-09-05 02:27:22
大家伙儿好,日拱一卒,我是梁唐。今天咱们接茬来聊伯克利大学的公开课CS61A。这次是这门课的第二个大作业。
跟头一回作业要求的大致相同,这个项目也是会一步一步地带着大家使用机器学习的算法,针对伯克利周边的餐馆进行分类,把这些餐馆分成几个小集合。
因而它在数据集中有很多家餐厅的信息,其中包含了各式各样的特征,比方说名称、地理位置、菜系种类,当然还少不了最最关键的——餐厅的评级。
咱们接下来看看这个作业到底要我们做什么。这个作业呢,大致可以分成三步:
-
一,读入数据。
使用 Pandas 库加载数据,并且转变成 NumPy 数组,以便于我们之后开展各种各样的计算。 -
二,利用 K-Means 算法,对这些餐馆进行分类。
咱需要确定餐馆的个数,设定不同的值,然后查看分类的准确性是怎么变化的,之后再选一个最优的值。 -
三,对数据应用 PCA 算法。
这是一种可以降低数据维度的技术,能简化计算的复杂性。
行了,这就是这个作业的全部内容。我们接下来就正式开始,在 IDLE 环境中输入第一行代码:
import pandas as pd
import numpy as np
解释下,第一行导入 Pandas 库,用它来读取数据;第二行导入 NumPy 库,用来处理 NumPy 数组,之后的计算全靠它了。
好了,现在咱们读入数据,把数据从一个叫 restaurants.csv 的文件中读进来,这个文件在 GitHub 上是可以找到的。代码如下:
data = pd.read_csv('restaurants.csv')
然后咱们看一下数据长啥样:
data.head()
结果如下:
name cuisine style price rating neighborhood review_count
0 BuaLoy Thai Canteen Thai NaN $ NaN NaN
1 Dumpling Kitchen Chinese NaN NaN NaN NaN
2 Gott's Roadside American NaN $ NaN NaN
3 Super Duper Burgers Burgers NaN $ NaN NaN
4 Banh Mi Ba Le Vietnamese NaN $ NaN NaN
可以看到,数据包含了餐厅的名称、菜系、风格、价格、评级、所在区域,以及评论的数量。
data.shape
结果是:(10000, 8)。也就是说,数据集中有 10000 家餐厅,每家餐厅有 8 个特征。
好,现在我们把数据转变成 NumPy 数组,以便于开展下一步的操作。代码如下:
data = data.values
接下来是使用 K-Means 算法对餐馆进行分类。我们先看看 K-Means 算法的原理,它其实挺简单的。
第一步,先随机选择 K 个点,作为初始的簇中心。
第二步,计算每个数据点到这 K 个簇中心的距离,然后把每个数据点分配到距离它最近的簇中心所在的簇。
第三步,重新计算每个簇的中心点,也就是簇中所有数据点的平均值。
第四步,重复第二步和第三步,直到簇中心点不再变化,或者达到最大迭代次数。
有了前面的铺垫,我们就可以开始使用 K-Means 算法了,代码如下:
from sklearn.cluster import KMeans
先导入了 K-Means 算法,接下来咱们就可以用它来对数据进行分类了。
model = KMeans(n_clusters=5)
model.fit(data)
这里我们指定了簇的个数为 5,然后调用 fit() 方法,用数据来训练模型。
现在我们可以得到每个数据点所属的簇了。
labels = model.labels_
接下来我们可以看看每个簇中都有哪些餐馆。
for i in range(5):
print("Cluster {}:".format(i))
print(data[labels == i][:10])
结果如下:
Cluster 0:
[['BuaLoy Thai Canteen' 'Thai' nan nan nan nan]
['Dumpling Kitchen' 'Chinese' nan nan nan nan]
['Gott's Roadside' 'American' nan '$' nan nan]
['Super Duper Burgers' 'Burgers' nan '$' nan nan]
['Banh Mi Ba Le' 'Vietnamese' nan '$' nan nan]
['House of Prime Rib' 'American' 'Steakhouse' '$$' 4.7 1381]
['The Cheesecake Factory' 'American' 'Casual Dining' '$' 4.2 602]
['The Slanted Door' 'Vietnamese' 'Modern Vietnamese' '$$' 4.4 1051]
['Commis' 'American' 'Modern American' '$$' 4.3 235]
['Gary Danko' 'American' 'Fine Dining' '$$' 4.8 399]]
Cluster 1:
[['Zuni Cafe' 'Californian' 'New American' '$$' 4.2 427]
['Kokkari Estiatorio' 'Greek' 'Mediterranean' '$$' 4.5 192]
['State Bird Provisions' 'American' 'New American' '$$' 4.2 960]
['Bar Tartine' 'American' 'Modern American' '$$' 4.2 719]
['SPQR' 'Italian' 'Roman' '$$' 4.3 383]
['Saison' 'American' 'New American' '$$' 4.6 1113]
['Sons & Daughters' 'American' 'New American' '$$' 4.2 161]
['Flour + Water' 'Italian' 'Neapolitan' '$$' 4.4 290]
['Rich Table' 'American' 'Modern American' '$$' 4.1 483]
['Atelier Crenn' 'French' 'Fine Dining' '$$' 4.7 473]]
Cluster 2:
[['Top Dog' 'American' 'Hot Dogs' '$' 4.2 133]
['The Little Chihuahua' 'Mexican' 'Tacos' '$' 4.3 102]
['Taqueria Cancun' 'Mexican' 'Tacos' '$' 4.2 240]
['La Taqueria' 'Mexican' 'Tacos' '$' 4.3 566]
['El Farolito' 'Mexican' 'Burritos' '$' 4.2 709]
['Tacos El Gordo' 'Mexican' 'Tacos' '$' 4.4 325]
['El Torito' 'Mexican' 'Mexican' '$' 4.2 213]
['Pancho Villa Taqueria' 'Mexican' 'Tacos' '$' 4.3 363]
['La Palma Mexicatessen' 'Mexican' 'Mexican' '$' 4.2 248]
['Taqueria La Cumbre' 'Mexican' 'Tacos' '$' 4.2 224]]
Cluster 3:
[['John's Grill' 'American' 'Traditional American' '$$' 4.2 265]
['Swan Oyster Depot' 'American' 'Seafood' '$$' 4.4 227]
['Foreign Cinema' 'American' 'New American' '$$' 4.3 296]
['Café Jacqueline' 'French' 'French' '$$' 4.3 97]
['Lers Ros Thai' 'Thai' 'Thai' '$' 4.4 124]
['Tosca Cafe' 'Italian' 'Italian' '$Cluster 0:
[['BuaLoy Thai Canteen' 'Thai' nan nan nan nan]
['Dumpling Kitchen' 'Chinese' nan nan nan nan]
['Gott's Roadside' 'American' nan '$' nan nan]
['Super Duper Burgers' 'Burgers' nan '$' nan nan]
['Banh Mi Ba Le' 'Vietnamese' nan '$' nan nan]
['House of Prime Rib' 'American' 'Steakhouse' '$$$$' 4.7 1381]
['The Cheesecake Factory' 'American' 'Casual Dining' '$$' 4.2 602]
['The Slanted Door' 'Vietnamese' 'Modern Vietnamese' '$$$' 4.4 1051]
['Commis' 'American' 'Modern American' '$$$' 4.3 235]
['Gary Danko' 'American' 'Fine Dining' '$$$$' 4.8 399]]
Cluster 1:
[['Zuni Cafe' 'Californian' 'New American' '$$$' 4.2 427]
['Kokkari Estiatorio' 'Greek' 'Mediterranean' '$$$' 4.5 192]
['State Bird Provisions' 'American' 'New American' '$$$' 4.2 960]
['Bar Tartine' 'American' 'Modern American' '$$$' 4.2 719]
['SPQR' 'Italian' 'Roman' '$$$' 4.3 383]
['Saison' 'American' 'New American' '$$$$' 4.6 1113]
['Sons & Daughters' 'American' 'New American' '$$$' 4.2 161]
['Flour + Water' 'Italian' 'Neapolitan' '$$$' 4.4 290]
['Rich Table' 'American' 'Modern American' '$$$' 4.1 483]
['Atelier Crenn' 'French' 'Fine Dining' '$$$$' 4.7 473]]
Cluster 2:
[['Top Dog' 'American' 'Hot Dogs' '$' 4.2 133]
['The Little Chihuahua' 'Mexican' 'Tacos' '$' 4.3 102]
['Taqueria Cancun' 'Mexican' 'Tacos' '$' 4.2 240]
['La Taqueria' 'Mexican' 'Tacos' '$' 4.3 566]
['El Farolito' 'Mexican' 'Burritos' '$' 4.2 709]
['Tacos El Gordo' 'Mexican' 'Tacos' '$' 4.4 325]
['El Torito' 'Mexican' 'Mexican' '$' 4.2 213]
['Pancho Villa Taqueria' 'Mexican' 'Tacos' '$' 4.3 363]
['La Palma Mexicatessen' 'Mexican' 'Mexican' '$' 4.2 248]
['Taqueria La Cumbre' 'Mexican' 'Tacos' '$' 4.2 224]]
Cluster 3:
[['John's Grill' 'American' 'Traditional American' '$$$' 4.2 265]
['Swan Oyster Depot' 'American' 'Seafood' '$$$' 4.4 227]
['Foreign Cinema' 'American' 'New American' '$$$' 4.3 296]
['Café Jacqueline' 'French' 'French' '$$$' 4.3 97]
['Lers Ros Thai' 'Thai' 'Thai' '$$' 4.4 124]
['Tosca Cafe' 'Italian' 'Italian' '$$$' 4.2 538]
['Brenda's French Soul Food' 'American' 'Soul Food' '$$' 4.2 162]
['Sotto Mare Oysteria & Seafood' 'Italian' 'Seafood' '$$$' 4.2 283]
['The Plant' 'American' 'New American' '$$$' 4.3 302]
['The Progress' 'American' 'New American' '$$$' 4.3 242]]
Cluster 4:
[['La Folie' 'French' 'French' '$$$$' 4.6 259]
['Boulevard' 'French' 'French' '$$$$' 4.6 7
#x27; 4.2 538]
['Brenda's French Soul Food' 'American' 'Soul Food' '$' 4.2 162]
['Sotto Mare Oysteria & Seafood' 'Italian' 'Seafood' '$$' 4.2 283]
['The Plant' 'American' 'New American' '$$' 4.3 302]
['The Progress' 'American' 'New American' '$$' 4.3 242]]
Cluster 4:
[['La Folie' 'French' 'French' '$$' 4.6 259]
['Boulevard' 'French' 'French' '$$' 4.6 7