返回
打造属于自己的微信群早报爬虫
闲谈
2023-11-28 21:01:03
微信群早报的意义
在当今这个信息爆炸的时代,人们每天需要处理海量信息,很难做到及时了解所有新闻动态。而微信群早报可以有效解决这个问题,它能够将新闻、资讯、天气预报等信息汇总成一份早报,并在每天早上定时发送给微信群成员,帮助他们快速了解当天的重要信息。
如何利用 Python 爬虫实现微信群发新闻早报?
1. 场景
其实,早期使用的方案,是利用爬虫获取到一些新闻网站的标题,然后做了一些简单的数据清洗,最后利用 itchat 发送到指定的社群中。但这种方法存在一些问题,比如:
- 新闻内容不完整,只有标题,没有正文。
- 新闻来源单一,缺乏多样性。
- 无法定时发送早报。
为了解决这些问题,我们需要对爬虫程序进行一些改进。
2. 数据爬取
首先,我们需要使用爬虫程序从新闻网站上爬取新闻内容。我们可以使用 BeautifulSoup 库来解析新闻网页,提取出新闻标题、正文、作者、发布时间等信息。
import requests
from bs4 import BeautifulSoup
url = 'https://www.sina.com.cn/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
news_list = []
for news in soup.find_all('div', class_='news-item'):
title = news.find('h3', class_='news-title').text
content = news.find('p', class_='news-content').text
author = news.find('span', class_='news-author').text
publish_time = news.find('span', class_='news-publish-time').text
news_list.append({
'title': title,
'content': content,
'author': author,
'publish_time': publish_time
})
3. 数据清洗
爬取到的新闻数据可能包含一些无用信息,比如广告、推荐内容等。我们需要对数据进行清洗,去除这些无用信息。
import re
def clean_data(news_list):
for news in news_list:
news['title'] = re.sub(r'\s+', ' ', news['title'])
news['content'] = re.sub(r'\s+', ' ', news['content'])
news['author'] = re.sub(r'\s+', ' ', news['author'])
news['publish_time'] = re.sub(r'\s+', ' ', news['publish_time'])
return news_list
4. 服务化
为了能够定时发送早报,我们需要将爬虫程序服务化。我们可以使用 FastAPI 库来构建一个简单的 RESTful API。
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
class News(BaseModel):
title: str
content: str
author: str
publish_time: str
app = FastAPI()
@app.get('/news')
async def get_news():
news_list = clean_data(crawl_news())
return news_list
5. 定时发送早报
最后,我们需要设置一个定时任务,在每天早上定时调用 API 发送早报。我们可以使用 Celery 库来实现定时任务。
from celery import Celery
app = Celery('tasks', broker='redis://localhost:6379/0')
@app.task
def send_morning_news():
news_list = get_news()
for news in news_list:
itchat.send(news['title'] + '\n' + news['content'], toUserName='@123456789')
app.start()
完整示例代码
import requests
from bs4 import BeautifulSoup
import re
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from celery import Celery
app = FastAPI()
class News(BaseModel):
title: str
content: str
author: str
publish_time: str
@app.get('/news')
async def get_news():
news_list = clean_data(crawl_news())
return news_list
def crawl_news():
url = 'https://www.sina.com.cn/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
news_list = []
for news in soup.find_all('div', class_='news-item'):
title = news.find('h3', class_='news-title').text
content = news.find('p', class_='news-content').text
author = news.find('span', class_='news-author').text
publish_time = news.find('span', class_='news-publish-time').text
news_list.append({
'title': title,
'content': content,
'author': author,
'publish_time': publish_time
})
return news_list
def clean_data(news_list):
for news in news_list:
news['title'] = re.sub(r'\s+', ' ', news['title'])
news['content'] = re.sub(r'\s+', ' ', news['content'])
news['author'] = re.sub(r'\s+', ' ', news['author'])
news['publish_time'] = re.sub(r'\s+', ' ', news['publish_time'])
return news_list
app = Celery('tasks', broker='redis://localhost:6379/0')
@app.task
def send_morning_news():
news_list = get_news()
for news in news_list:
itchat.send(news['title'] + '\n' + news['content'], toUserName='@123456789')
app.start()