requests-html库初识 + 无资料解BUG之 I/O error : encoder error，Python爬虫第30例

2023-10-05 08:12:13

requests-html是一个非常方便的库，可以直接对网页进行解析，无需再借助BeautifulSoup等工具了。这个库最大的特点就是将HTML作为一个变量来处理，以一种更方便的方式来处理HTML标签，从而提高我们的爬虫效率。

requests-html 库安装

pip install requests-html

requests-html基本用法

requests-html 库基本用法与 requests 库相似，也是先定义一个 Request 对象，然后用 request 对象的 get 方法发送请求，最后再通过HTML对象来解析网页内容。

下面演示一下使用requests-html来获取一个网页的

import requests_html

# 定义请求对象
session = requests_html.HTMLSession()

# 发起请求
response = session.get('https://www.python.org/')

# 解析HTML内容
html = response.html

# 获取标题
title = html.find('title', first=True).text

# 打印标题
print(title)

无资料解BUG

在使用 requests-html 来爬取网页时，遇到了一个BUG，即出现 I/O error : encoder error 错误。
通过查找相关资料，我们发现，这个问题的原因是requests-html库在解析网页时，会自动对网页中的文本进行编码，而有些网页中的文本编码并不是UTF-8，这就会导致编码错误。

为了解决这个问题，我们需要在发送请求时，指定 headers 参数，并将 'User-Agent' 设为 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'。

import requests_html

# 定义请求对象
session = requests_html.HTMLSession()

# 设置请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'}

# 发起请求
response = session.get('https://www.python.org/', headers=headers)

# 解析HTML内容
html = response.html

# 获取标题
title = html.find('title', first=True).text

# 打印标题
print(title)