爬虫痛点逐个击破：教你搞定 UTF8/GBK/GB2312 乱码

2024-02-16 11:36:53

作为一名爬虫工程师，字符编码乱码是您遇到的最棘手的难题之一。UTF8、GBK 和 GB2312 等编码之间的转换可能会让您抓狂。但不要担心，本文将为您提供一个一劳永逸的解决方案，让您轻松应对乱码问题。

理解编码乱码的原因

当您从网页提取数据时，服务器会使用特定编码将数据发送给您。如果您的爬虫使用不同的编码来解释数据，就会出现乱码。例如，如果服务器使用 UTF8 编码，而您的爬虫使用 GBK 编码，您将看到乱码字符。

解决乱码的万灵药

解决乱码的秘诀在于正确检测和转换网页编码。下面是如何使用 Python Requests 库和 Golang Charset 库来实现这一目标：

Python Requests 库

import requests
from chardet import detect_encoding

# 获取网页内容
response = requests.get('https://example.com')

# 检测网页编码
encoding = detect_encoding(response.content)['encoding']

# 转换网页内容为 Unicode
content = response.content.decode(encoding)

Golang Charset 库

import (
	"golang.org/x/text/encoding/simplifiedchinese"
	"golang.org/x/text/transform"
)

// 转换网页内容为 GBK
gbkReader := transform.NewReader(response.Body, simplifiedchinese.GBK.NewDecoder())
gbkBytes, err := ioutil.ReadAll(gbkReader)
if err != nil {
	log.Fatal(err)
}

// 将 GBK 字节转换为字符串
gbkContent := string(gbkBytes)