Python x 金庸 = 武侠世界代码之旅

2024-02-11 05:00:57

导读

作为武侠迷，你是不是常常幻想着自己穿梭在金庸笔下那个快意恩仇的江湖？别急，有了Python，这个梦想离你不再遥远。本篇技术博客将带你用Python探索金庸小说世界，从网站爬取、数据整理到正则匹配，代码帮你一次性搞定武侠数据大作战！

1. 网站爬取：开启金庸数据之旅

我们先从金庸小说网站获取数据。这里推荐使用BeautifulSoup，它可以轻松解析HTML文档。先安装：

pip install beautifulsoup4

代码如下：

import requests
from bs4 import BeautifulSoup

url = 'https://www.jinyongwang.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

解析完成，接下来就是定位数据。使用XPath可以精准定位：

novel_list = soup.select('div.novellist_1 ul li a')

2. 数据整理：构建你的金庸知识库

数据爬取完成后，我们用Pandas整理一下：

import pandas as pd

novels = []
for novel in novel_list:
    novels.append({
        'name': novel.text,
        'link': novel['href']
    })

df_novels = pd.DataFrame(novels)

3. 正则匹配：抽丝剥茧，挖掘人物数据

接下来，我们来提取人物数据。使用正则匹配，精准定位：

import re

pattern = r'class="character">(.+?)</span>'
characters = []
for novel in df_novels['link']:
    response = requests.get(novel)
    soup = BeautifulSoup(response.text, 'html.parser')
    characters.extend(re.findall(pattern, soup.text))

4. 代码实例：你的Python武侠指南

现在，我们整合一下知识点，写个代码示例：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

url = 'https://www.jinyongwang.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

novel_list = soup.select('div.novellist_1 ul li a')

novels = []
for novel in novel_list:
    novels.append({
        'name': novel.text,
        'link': novel['href']
    })

df_novels = pd.DataFrame(novels)

pattern = r'class="character">(.+?)</span>'
characters = []
for novel in df_novels['link']:
    response = requests.get(novel)
    soup = BeautifulSoup(response.text, 'html.parser')
    characters.extend(re.findall(pattern, soup.text))

print(characters)