分布式爬虫：如何利用Celery爬取豆瓣图书标签数据

2024-02-11 16:17:52

分布式爬虫是一种利用多台机器同时执行爬虫任务的爬虫技术。分布式爬虫的优点在于，它可以加快爬虫的速度，并提高爬虫的效率。Celery是一个分布式任务队列，它可以帮助您管理分布式爬虫任务。

在本文中，我们将向您展示如何利用Celery实现分布式爬虫，以爬取豆瓣图书标签数据。我们将介绍Celery的安装和使用，以及如何通过Celery任务队列来管理爬虫任务。同时，您将学会如何解析豆瓣图书标签数据的HTML代码，并存储数据到数据库中。

安装Celery

在开始之前，您需要先安装Celery。您可以使用以下命令来安装Celery：

pip install celery

安装完成后，您需要在项目目录中创建名为celery.py的配置文件。celery.py的内容如下：

from __future__ import absolute_import
import os

from celery import Celery

# 定义Celery对象
app = Celery('tasks', broker='redis://localhost:6379/0')

# 配置任务执行的路径
app.conf.update(
    CELERY_TASK_SERIALIZER='json',
    CELERY_RESULT_SERIALIZER='json',
    CELERY_ACCEPT_CONTENT=['json'],  
    CELERY_RESULT_BACKEND='redis://localhost:6379/0',
    CELERYD_CONCURRENCY=4,
)

# 加载任务
app.autodiscover_tasks(['tasks'])

使用Celery

在安装并配置好Celery后，您就可以开始使用它来管理分布式爬虫任务了。首先，您需要创建任务函数。任务函数是您想要执行的任务的代码。在我们的例子中，我们需要创建一个任务函数来爬取豆瓣图书标签数据。任务函数的代码如下：

from __future__ import absolute_import

import requests
from bs4 import BeautifulSoup

from celery import Celery

app = Celery()

@app.task
def crawl_douban_book_tags(url):
    # 请求豆瓣图书标签页面
    response = requests.get(url)

    # 解析HTML代码
    soup = BeautifulSoup(response.text, 'html.parser')

    # 提取标签数据
    tags = []
    for tag in soup.find_all('a', class_='tag'):
        tags.append(tag.text)

    # 返回标签数据
    return tags

在创建好任务函数后，您就可以使用Celery来执行任务了。以下是如何使用Celery执行任务的示例：

from tasks import crawl_douban_book_tags

# 创建Celery客户端
client = Celery()

# 执行任务
result = client.send_task('crawl_douban_book_tags', args=('https://book.douban.com/tag/',))

# 获取任务结果
tags = result.get()

解析豆瓣图书标签数据的HTML代码

在执行完任务后，您就可以开始解析豆瓣图书标签数据的HTML代码了。豆瓣图书标签数据的HTML代码如下：

<div class="tag-list">
  <a href="https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4" class="tag">小说</a>
  <a href="https://book.douban.com/tag/%E6%95%B0%E5%AD%A6" class="tag">数学</a>
  <a href="https://book.douban.com/tag/%E7%A7%91%E5%AD%A6" class="tag">历史</a>
  <a href="https://book.douban.com/tag/%E7%BB%8F%E7%90%86" class="tag">哲学</a>
  <a href="https://book.douban.com/tag/%E8%AE%BA%E6%96%87" class="tag">文学</a>
</div>

您可以使用BeautifulSoup来解析豆瓣图书标签数据的HTML代码。以下是如何使用BeautifulSoup来解析豆瓣图书标签数据的HTML代码的示例：

from bs4 import BeautifulSoup

html = """
<div class="tag-list">
  <a href="https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4" class="tag">小说</a>
  <a href="https://book.douban.com/tag/%E6%95%B0%E5%AD%A6" class="tag">数学</a>
  <a href="https://book.douban.com/tag/%E7%A7%91%E5%AD%A6" class="tag">历史</a>
  <a href="https://book.douban.com/tag/%E7%BB%8F%E7%90%86" class="tag">哲学</a>
  <a href="https://book.douban.com/tag/%E8%AE%BA%E6%96%87" class="tag">文学</a>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

tags = []
for tag in soup.find_all('a', class_='tag'):
    tags.append(tag.text)

print(tags)

存储数据到数据库

在解析完豆瓣图书标签数据的HTML代码后，您就可以将数据存储到数据库中了。您可以使用任何您喜欢的数据库，比如MySQL、PostgreSQL、MongoDB等。以下是如何使用MySQL存储豆瓣图书标签数据的示例：

import mysql.connector

# 创建数据库连接
conn = mysql.connector.connect(
    host='localhost',
    user='root',
    password='password',
    database='douban_book_tags'
)

# 创建游标
cursor = conn.cursor()

# 创建表格
cursor.execute("""
CREATE TABLE IF NOT EXISTS tags (
  id INT AUTO_INCREMENT PRIMARY KEY,
  tag VARCHAR(255) NOT NULL
)
""")

# 插入数据
for tag in tags:
    cursor.execute("""
    INSERT INTO tags (tag)
    VALUES (%s)
    """, (tag,))

# 提交事务
conn.commit()

# 关闭游标
cursor.close()

# 关闭数据库连接
conn.close()