打造属于自己的微信群早报爬虫

闲谈

2023-11-28 21:01:03

微信群早报的意义

在当今这个信息爆炸的时代，人们每天需要处理海量信息，很难做到及时了解所有新闻动态。而微信群早报可以有效解决这个问题，它能够将新闻、资讯、天气预报等信息汇总成一份早报，并在每天早上定时发送给微信群成员，帮助他们快速了解当天的重要信息。

如何利用 Python 爬虫实现微信群发新闻早报？

1. 场景

其实，早期使用的方案，是利用爬虫获取到一些新闻网站的标题，然后做了一些简单的数据清洗，最后利用 itchat 发送到指定的社群中。但这种方法存在一些问题，比如：

新闻内容不完整，只有标题，没有正文。
新闻来源单一，缺乏多样性。
无法定时发送早报。

为了解决这些问题，我们需要对爬虫程序进行一些改进。

2. 数据爬取

首先，我们需要使用爬虫程序从新闻网站上爬取新闻内容。我们可以使用 BeautifulSoup 库来解析新闻网页，提取出新闻标题、正文、作者、发布时间等信息。

import requests
from bs4 import BeautifulSoup

url = 'https://www.sina.com.cn/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

news_list = []
for news in soup.find_all('div', class_='news-item'):
    title = news.find('h3', class_='news-title').text
    content = news.find('p', class_='news-content').text
    author = news.find('span', class_='news-author').text
    publish_time = news.find('span', class_='news-publish-time').text

    news_list.append({
        'title': title,
        'content': content,
        'author': author,
        'publish_time': publish_time
    })

3. 数据清洗

爬取到的新闻数据可能包含一些无用信息，比如广告、推荐内容等。我们需要对数据进行清洗，去除这些无用信息。

import re

def clean_data(news_list):
    for news in news_list:
        news['title'] = re.sub(r'\s+', ' ', news['title'])
        news['content'] = re.sub(r'\s+', ' ', news['content'])
        news['author'] = re.sub(r'\s+', ' ', news['author'])
        news['publish_time'] = re.sub(r'\s+', ' ', news['publish_time'])

    return news_list

4. 服务化

为了能够定时发送早报，我们需要将爬虫程序服务化。我们可以使用 FastAPI 库来构建一个简单的 RESTful API。

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

class News(BaseModel):
    title: str
    content: str
    author: str
    publish_time: str

app = FastAPI()

@app.get('/news')
async def get_news():
    news_list = clean_data(crawl_news())
    return news_list

5. 定时发送早报

最后，我们需要设置一个定时任务，在每天早上定时调用 API 发送早报。我们可以使用 Celery 库来实现定时任务。

from celery import Celery

app = Celery('tasks', broker='redis://localhost:6379/0')

@app.task
def send_morning_news():
    news_list = get_news()
    for news in news_list:
        itchat.send(news['title'] + '\n' + news['content'], toUserName='@123456789')

app.start()

完整示例代码

import requests
from bs4 import BeautifulSoup
import re
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from celery import Celery

app = FastAPI()

class News(BaseModel):
    title: str
    content: str
    author: str
    publish_time: str

@app.get('/news')
async def get_news():
    news_list = clean_data(crawl_news())
    return news_list

def crawl_news():
    url = 'https://www.sina.com.cn/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    news_list = []
    for news in soup.find_all('div', class_='news-item'):
        title = news.find('h3', class_='news-title').text
        content = news.find('p', class_='news-content').text
        author = news.find('span', class_='news-author').text
        publish_time = news.find('span', class_='news-publish-time').text

        news_list.append({
            'title': title,
            'content': content,
            'author': author,
            'publish_time': publish_time
        })

    return news_list

def clean_data(news_list):
    for news in news_list:
        news['title'] = re.sub(r'\s+', ' ', news['title'])
        news['content'] = re.sub(r'\s+', ' ', news['content'])
        news['author'] = re.sub(r'\s+', ' ', news['author'])
        news['publish_time'] = re.sub(r'\s+', ' ', news['publish_time'])

    return news_list

app = Celery('tasks', broker='redis://localhost:6379/0')

@app.task
def send_morning_news():
    news_list = get_news()
    for news in news_list:
        itchat.send(news['title'] + '\n' + news['content'], toUserName='@123456789')

app.start()