Python 爬虫入门：入门指南与快速上手技巧

2023-08-20 22:40:04

一、引言

在数字化时代，数据成为了最宝贵的资源之一。而爬虫技术，作为获取网络数据的重要手段，正逐渐受到越来越多开发者的关注。Python，凭借其简洁易学和强大的库支持，成为了爬虫开发的理想选择。对于初学者而言，掌握Python爬虫技术不仅能帮助你更好地获取和分析数据，还能为你的职业发展增添一份竞争力。

二、Python爬虫的基本库

在开始编写爬虫之前，你需要熟悉几个基本的Python库。这些库将帮助你完成从发送网络请求到解析HTML内容的整个过程。

1. Requests

Requests 是一个用于发送HTTP请求的库。它允许你轻松地发送GET和POST请求，并获取服务器的响应。使用 Requests，你可以像处理普通HTTP请求一样处理网络爬虫中的请求。

示例代码：

import requests

url = 'https://example.com'
response = requests.get(url)

print(response.status_code)  # 输出响应状态码
print(response.text)         # 输出响应内容

2. BeautifulSoup

BeautifulSoup 是一个用于解析HTML和XML文档的库。它提供了简单易用的API，帮助你从复杂的HTML文档中提取所需的数据。

示例代码：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 提取标题
title = soup.title.string
print(title)  # 输出: The Dormouse's story

# 提取所有链接
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

3. urllib

urllib 是Python内置的URL处理库。虽然它不像 Requests 那么易用，但在某些情况下，使用 urllib 可能更为合适。

示例代码：

from urllib.request import urlopen

url = 'https://example.com'
response = urlopen(url)

html = response.read().decode('utf-8')
print(html)

三、伪装成浏览器的访问

许多网站会检查访问者的User-Agent来判断请求是否来自真实的浏览器。为了防止被识别为爬虫，你可以使用 User-Agent 库来设置伪装的浏览器头。

示例代码：

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

url = 'https://example.com'
response = requests.get(url, headers=headers)

print(response.status_code)
print(response.text)

四、封装爬虫代码

为了提高代码的可维护性和可扩展性，建议将爬虫代码封装成一个类。这样，你可以定义各种方法，如发送请求、解析页面和下载数据，使代码更加清晰和易于管理。

五、Python爬虫入门案例

接下来，我们将通过一个简单的爬虫案例来巩固你的理解。我们将获取CSDN博客文章标题作为示例：

import requests
from bs4 import BeautifulSoup

class CSDNSpider:
    def __init__(self):
        self.url = 'https://blog.csdn.net/'

    def fetch_data(self):
        response = requests.get(self.url)
        return response.text

    def parse_data(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        titles = soup.find_all('h4', class_='title')
        return [title.text for title in titles]

    def run(self):
        html = self.fetch_data()
        titles = self.parse_data(html)
        for title in titles:
            print(title)

if __name__ == '__main__':
    spider = CSDNSpider()
    spider.run()