揭秘锻刀村字幕背后的故事：打造数据可视化词云图

2022-11-22 02:24:03

锻刀村字幕的可视化：从获取到词云图

一、获取字幕

开启我们锻刀村字幕可视化之旅的第一步是获取字幕。我们借助强大的 Python requests 库来完成这项任务。首先，你需要安装它：

pip install requests

接下来，获取锻刀村视频的 ID。前往哔哩哔哩网站，找到视频，复制其链接，并在地址栏中定位 ID（视频链接中第一个数字）。

有了视频 ID，我们就可以使用 requests 发送请求：

import requests

video_id = '123456789'

headers = {
    'User-Agent': 'Mozilla/5.0 ... Safari/537.36'
}

response = requests.get(f'https://www.bilibili.com/video/{video_id}', headers=headers)

subtitle_text = response.text

with open('subtitle.txt', 'w', encoding='utf-8') as f:
    f.write(subtitle_text)

现在，锻刀村的字幕已安全地保存在 subtitle.txt 文件中。

二、分词

下一步是分词，将字幕文本分解为一个个单词。我们借助中文分词神器 jieba：

pip install jieba

过滤掉常见的停用词后，分词代码如下：

import jieba

stopwords = set()
with open('stopwords.txt', 'r', encoding='utf-8') as f:
    for line in f:
        stopwords.add(line.strip())

words = jieba.cut(subtitle_text, cut_all=False)

filtered_words = [word for word in words if word not in stopwords]

with open('words.txt', 'w', encoding='utf-8') as f:
    f.write('\n'.join(filtered_words))

瞧，我们获得了干净利落的 words.txt，其中包含了所有经过过滤的分词。

三、词云图

最后一步，让我们绘制词云图，将分词结果可视化。wordcloud 库派上用场：

pip install wordcloud

代码如下：

import wordcloud

with open('words.txt', 'r', encoding='utf-8') as f:
    words = f.readlines()

wordcloud = WordCloud(
    background_color='white',
    font_path='msyh.ttf',
    width=1000,
    height=800,
    max_words=200
)
wordcloud.generate(' '.join(words))

wordcloud.to_file('wordcloud.png')