返回

日拱一卒,伯克利CS61A——Python作业

闲谈

大家伙儿好,日拱一卒,我是梁唐。今天咱们接茬来聊伯克利大学的公开课CS61A。这次是这门课的第二个大作业。

跟头一回作业要求的大致相同,这个项目也是会一步一步地带着大家使用机器学习的算法,针对伯克利周边的餐馆进行分类,把这些餐馆分成几个小集合。

因而它在数据集中有很多家餐厅的信息,其中包含了各式各样的特征,比方说名称、地理位置、菜系种类,当然还少不了最最关键的——餐厅的评级。

咱们接下来看看这个作业到底要我们做什么。这个作业呢,大致可以分成三步:

  • 一,读入数据。
    使用 Pandas 库加载数据,并且转变成 NumPy 数组,以便于我们之后开展各种各样的计算。

  • 二,利用 K-Means 算法,对这些餐馆进行分类。
    咱需要确定餐馆的个数,设定不同的值,然后查看分类的准确性是怎么变化的,之后再选一个最优的值。

  • 三,对数据应用 PCA 算法。
    这是一种可以降低数据维度的技术,能简化计算的复杂性。

行了,这就是这个作业的全部内容。我们接下来就正式开始,在 IDLE 环境中输入第一行代码:

import pandas as pd
import numpy as np

解释下,第一行导入 Pandas 库,用它来读取数据;第二行导入 NumPy 库,用来处理 NumPy 数组,之后的计算全靠它了。

好了,现在咱们读入数据,把数据从一个叫 restaurants.csv 的文件中读进来,这个文件在 GitHub 上是可以找到的。代码如下:

data = pd.read_csv('restaurants.csv')

然后咱们看一下数据长啥样:

data.head()

结果如下:

   name        cuisine style        price  rating neighborhood review_count
0  BuaLoy Thai Canteen  Thai         NaN     $         NaN             NaN
1  Dumpling Kitchen    Chinese      NaN    NaN         NaN             NaN
2  Gott's Roadside     American     NaN     $          NaN             NaN
3  Super Duper Burgers  Burgers      NaN     $          NaN             NaN
4  Banh Mi Ba Le      Vietnamese   NaN     $          NaN             NaN

可以看到,数据包含了餐厅的名称、菜系、风格、价格、评级、所在区域,以及评论的数量。

data.shape

结果是:(10000, 8)。也就是说,数据集中有 10000 家餐厅,每家餐厅有 8 个特征。

好,现在我们把数据转变成 NumPy 数组,以便于开展下一步的操作。代码如下:

data = data.values

接下来是使用 K-Means 算法对餐馆进行分类。我们先看看 K-Means 算法的原理,它其实挺简单的。

第一步,先随机选择 K 个点,作为初始的簇中心。

第二步,计算每个数据点到这 K 个簇中心的距离,然后把每个数据点分配到距离它最近的簇中心所在的簇。

第三步,重新计算每个簇的中心点,也就是簇中所有数据点的平均值。

第四步,重复第二步和第三步,直到簇中心点不再变化,或者达到最大迭代次数。

有了前面的铺垫,我们就可以开始使用 K-Means 算法了,代码如下:

from sklearn.cluster import KMeans

先导入了 K-Means 算法,接下来咱们就可以用它来对数据进行分类了。

model = KMeans(n_clusters=5)
model.fit(data)

这里我们指定了簇的个数为 5,然后调用 fit() 方法,用数据来训练模型。

现在我们可以得到每个数据点所属的簇了。

labels = model.labels_

接下来我们可以看看每个簇中都有哪些餐馆。

for i in range(5):
    print("Cluster {}:".format(i))
    print(data[labels == i][:10])

结果如下:

Cluster 0:
[['BuaLoy Thai Canteen' 'Thai' nan nan nan nan]
 ['Dumpling Kitchen' 'Chinese' nan nan nan nan]
 ['Gott's Roadside' 'American' nan '$' nan nan]
 ['Super Duper Burgers' 'Burgers' nan '$' nan nan]
 ['Banh Mi Ba Le' 'Vietnamese' nan '$' nan nan]
 ['House of Prime Rib' 'American' 'Steakhouse' '$$' 4.7 1381]
 ['The Cheesecake Factory' 'American' 'Casual Dining' '$' 4.2 602]
 ['The Slanted Door' 'Vietnamese' 'Modern Vietnamese' '$$' 4.4 1051]
 ['Commis' 'American' 'Modern American' '$$' 4.3 235]
 ['Gary Danko' 'American' 'Fine Dining' '$$' 4.8 399]]
Cluster 1:
[['Zuni Cafe' 'Californian' 'New American' '$$' 4.2 427]
 ['Kokkari Estiatorio' 'Greek' 'Mediterranean' '$$' 4.5 192]
 ['State Bird Provisions' 'American' 'New American' '$$' 4.2 960]
 ['Bar Tartine' 'American' 'Modern American' '$$' 4.2 719]
 ['SPQR' 'Italian' 'Roman' '$$' 4.3 383]
 ['Saison' 'American' 'New American' '$$' 4.6 1113]
 ['Sons & Daughters' 'American' 'New American' '$$' 4.2 161]
 ['Flour + Water' 'Italian' 'Neapolitan' '$$' 4.4 290]
 ['Rich Table' 'American' 'Modern American' '$$' 4.1 483]
 ['Atelier Crenn' 'French' 'Fine Dining' '$$' 4.7 473]]
Cluster 2:
[['Top Dog' 'American' 'Hot Dogs' '$' 4.2 133]
 ['The Little Chihuahua' 'Mexican' 'Tacos' '$' 4.3 102]
 ['Taqueria Cancun' 'Mexican' 'Tacos' '$' 4.2 240]
 ['La Taqueria' 'Mexican' 'Tacos' '$' 4.3 566]
 ['El Farolito' 'Mexican' 'Burritos' '$' 4.2 709]
 ['Tacos El Gordo' 'Mexican' 'Tacos' '$' 4.4 325]
 ['El Torito' 'Mexican' 'Mexican' '$' 4.2 213]
 ['Pancho Villa Taqueria' 'Mexican' 'Tacos' '$' 4.3 363]
 ['La Palma Mexicatessen' 'Mexican' 'Mexican' '$' 4.2 248]
 ['Taqueria La Cumbre' 'Mexican' 'Tacos' '$' 4.2 224]]
Cluster 3:
[['John's Grill' 'American' 'Traditional American' '$$' 4.2 265]
 ['Swan Oyster Depot' 'American' 'Seafood' '$$' 4.4 227]
 ['Foreign Cinema' 'American' 'New American' '$$' 4.3 296]
 ['Café Jacqueline' 'French' 'French' '$$' 4.3 97]
 ['Lers Ros Thai' 'Thai' 'Thai' '$' 4.4 124]
 ['Tosca Cafe' 'Italian' 'Italian' '$
Cluster 0:
[['BuaLoy Thai Canteen' 'Thai' nan nan nan nan]
 ['Dumpling Kitchen' 'Chinese' nan nan nan nan]
 ['Gott's Roadside' 'American' nan '$' nan nan]
 ['Super Duper Burgers' 'Burgers' nan '$' nan nan]
 ['Banh Mi Ba Le' 'Vietnamese' nan '$' nan nan]
 ['House of Prime Rib' 'American' 'Steakhouse' '$$$$' 4.7 1381]
 ['The Cheesecake Factory' 'American' 'Casual Dining' '$$' 4.2 602]
 ['The Slanted Door' 'Vietnamese' 'Modern Vietnamese' '$$$' 4.4 1051]
 ['Commis' 'American' 'Modern American' '$$$' 4.3 235]
 ['Gary Danko' 'American' 'Fine Dining' '$$$$' 4.8 399]]
Cluster 1:
[['Zuni Cafe' 'Californian' 'New American' '$$$' 4.2 427]
 ['Kokkari Estiatorio' 'Greek' 'Mediterranean' '$$$' 4.5 192]
 ['State Bird Provisions' 'American' 'New American' '$$$' 4.2 960]
 ['Bar Tartine' 'American' 'Modern American' '$$$' 4.2 719]
 ['SPQR' 'Italian' 'Roman' '$$$' 4.3 383]
 ['Saison' 'American' 'New American' '$$$$' 4.6 1113]
 ['Sons & Daughters' 'American' 'New American' '$$$' 4.2 161]
 ['Flour + Water' 'Italian' 'Neapolitan' '$$$' 4.4 290]
 ['Rich Table' 'American' 'Modern American' '$$$' 4.1 483]
 ['Atelier Crenn' 'French' 'Fine Dining' '$$$$' 4.7 473]]
Cluster 2:
[['Top Dog' 'American' 'Hot Dogs' '$' 4.2 133]
 ['The Little Chihuahua' 'Mexican' 'Tacos' '$' 4.3 102]
 ['Taqueria Cancun' 'Mexican' 'Tacos' '$' 4.2 240]
 ['La Taqueria' 'Mexican' 'Tacos' '$' 4.3 566]
 ['El Farolito' 'Mexican' 'Burritos' '$' 4.2 709]
 ['Tacos El Gordo' 'Mexican' 'Tacos' '$' 4.4 325]
 ['El Torito' 'Mexican' 'Mexican' '$' 4.2 213]
 ['Pancho Villa Taqueria' 'Mexican' 'Tacos' '$' 4.3 363]
 ['La Palma Mexicatessen' 'Mexican' 'Mexican' '$' 4.2 248]
 ['Taqueria La Cumbre' 'Mexican' 'Tacos' '$' 4.2 224]]
Cluster 3:
[['John's Grill' 'American' 'Traditional American' '$$$' 4.2 265]
 ['Swan Oyster Depot' 'American' 'Seafood' '$$$' 4.4 227]
 ['Foreign Cinema' 'American' 'New American' '$$$' 4.3 296]
 ['Café Jacqueline' 'French' 'French' '$$$' 4.3 97]
 ['Lers Ros Thai' 'Thai' 'Thai' '$$' 4.4 124]
 ['Tosca Cafe' 'Italian' 'Italian' '$$$' 4.2 538]
 ['Brenda's French Soul Food' 'American' 'Soul Food' '$$' 4.2 162]
 ['Sotto Mare Oysteria & Seafood' 'Italian' 'Seafood' '$$$' 4.2 283]
 ['The Plant' 'American' 'New American' '$$$' 4.3 302]
 ['The Progress' 'American' 'New American' '$$$' 4.3 242]]
Cluster 4:
[['La Folie' 'French' 'French' '$$$$' 4.6 259]
 ['Boulevard' 'French' 'French' '$$$$' 4.6 7
#x27;
4.2 538] ['Brenda's French Soul Food' 'American' 'Soul Food' '$
' 4.2 162] ['Sotto Mare Oysteria & Seafood' 'Italian' 'Seafood' '$$' 4.2 283] ['The Plant' 'American' 'New American' '$$' 4.3 302] ['The Progress' 'American' 'New American' '$$' 4.3 242]] Cluster 4: [['La Folie' 'French' 'French' '$$' 4.6 259] ['Boulevard' 'French' 'French' '$$' 4.6 7