Python魔法爬虫re+BeautifulSoup强势解析起点小说，超高效！

2022-12-05 12:57:24

探索网络爬虫的奥秘：使用 Python 爬取起点小说

前言

网络爬虫，又称网络机器人，是一种自动化工具，可从互联网浩瀚的数据海洋中收集和提取信息。借助 Python 强大的库和工具，网络爬虫开发变得轻而易举。本文将深入探讨使用正则表达式和 BeautifulSoup 库爬取起点小说的技巧，为您的网络爬虫之旅注入活力。

正则表达式：文本处理利器

正则表达式是一种强大的模式匹配工具，可用于匹配和提取字符串中的特定模式。Python 的 re 模块提供了丰富的函数和方法，帮助您轻松处理正则表达式。例如，您可以使用 compile() 方法编译正则表达式，然后使用 search() 方法查找匹配的子串。

BeautifulSoup：HTML 解析神器

BeautifulSoup 是一个 Python 库，使 HTML 网页解析变得简单。它将 HTML 网页解析成一个树形结构，使您能够轻松访问标题、段落、链接等页面元素。利用 BeautifulSoup，您可以专注于提取所需信息，而无需深入 HTML 代码的复杂性。

爬取起点小说

1. 安装必备库

在开始之前，请使用 pip 命令安装 re 模块和 BeautifulSoup 库：

pip install re
pip install BeautifulSoup4

2. 获取起点小说 URL

以《斗破苍穹》为例，其起点小说 URL 为 https://book.qidian.com/info/1004214333。

3. 发送请求并获取响应

使用 requests 库发送 HTTP 请求并获取响应：

import requests

url = 'https://book.qidian.com/info/1004214333'
response = requests.get(url)

4. 解析 HTML 网页

利用 BeautifulSoup 解析 HTML 网页：

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

5. 提取章节标题和内容

使用正则表达式提取章节标题和内容：

import re

# 章节标题正则表达式
chapter_title_pattern = re.compile(r'<h3 class="j_chapterName">(.*?)</h3>')

# 章节内容正则表达式
chapter_content_pattern = re.compile(r'<div id="content">(.*?)</div>')

# 提取章节标题和内容
chapter_titles = []
chapter_contents = []

for chapter in soup.find_all('div', class_='chapter'):
    chapter_title = chapter_title_pattern.search(chapter.text).group(1)
    chapter_content = chapter_content_pattern.search(chapter.text).group(1)

    chapter_titles.append(chapter_title)
    chapter_contents.append(chapter_content)

6. 保存章节信息

将章节标题和内容保存到文件中：

with open('斗破苍穹.txt', 'w', encoding='utf-8') as f:
    for chapter_title, chapter_content in zip(chapter_titles, chapter_contents):
        f.write(chapter_title + '\n')
        f.write(chapter_content + '\n')