Python抓取静态网站

闲谈

2023-12-24 09:33:16

使用Python进行网站抓取的完整指南

环境准备

踏上网站抓取之旅前，我们首先需要打好地基。确保您已安装以下必备库和软件：

Python 3.6或更高版本： 作为数据抓取的基础。
Requests： 用于发送HTTP请求。
BeautifulSoup： 用于解析HTML和XML。
lxml： 一个比BeautifulSoup更快的解析器。
Selenium（可选）： 对于抓取动态网站至关重要。

静态网站抓取

静态网站不会根据用户的交互而改变其内容。对于这类网站，我们可以按照以下步骤进行抓取：

1. 获取HTML内容：

import requests

# 指定要抓取的URL
url = 'https://example.com'

# 发送GET请求并获取响应
response = requests.get(url)

# 提取HTML内容
html = response.text

2. 解析HTML内容：

from bs4 import BeautifulSoup

# 创建BeautifulSoup对象
soup = BeautifulSoup(html, 'html.parser')

# 定位所需元素
title = soup.find('title').text
body = soup.find('body')

3. 提取数据：

# 从标题中提取文本
title_text = title.text

# 从正文中提取文本
body_text = body.text

动态网站抓取

动态网站的交互性会随着用户输入而改变其内容。为了抓取此类网站，我们需要借助Selenium：

1. 使用Selenium：

from selenium import webdriver

# 创建WebDriver对象
driver = webdriver.Chrome()

# 加载目标URL
driver.get('https://example.com')

# 等待页面加载完毕
driver.implicitly_wait(10)

# 获取HTML内容
html = driver.page_source

# 关闭WebDriver对象
driver.quit()

2. 解析HTML内容：

from bs4 import BeautifulSoup

# 创建BeautifulSoup对象
soup = BeautifulSoup(html, 'html.parser')

# 定位所需元素
title = soup.find('title').text
body = soup.find('body')

3. 提取数据：

# 从标题中提取文本
title_text = title.text

# 从正文中提取文本
body_text = body.text

示例

为了巩固所学知识，这里有几个示例：

**获取网站```python
import requests
from bs4 import BeautifulSoup

指定目标URL

url = 'https://example.com'

获取HTML内容

response = requests.get(url)
html = response.text

创建BeautifulSoup对象

soup = BeautifulSoup(html, 'html.parser')

定位标题元素

title = soup.find('title').text

打印标题文本

print(title)


- **获取网站所有链接：** 
```python
import requests
from bs4 import BeautifulSoup

# 指定目标URL
url = 'https://example.com'

# 获取HTML内容
response = requests.get(url)
html = response.text

# 创建BeautifulSoup对象
soup = BeautifulSoup(html, 'html.parser')

# 定位所有链接
links = soup.find_all('a')

# 迭代链接并打印URL
for link in links:
    print(link['href'])

使用Selenium获取动态网站的数据：

from selenium import webdriver
from bs4 import BeautifulSoup

# 创建WebDriver对象
driver = webdriver.Chrome()

# 加载目标URL
driver.get('https://example.com')

# 等待页面加载完毕
driver.implicitly_wait(10)

# 获取HTML内容
html = driver.page_source

# 创建BeautifulSoup对象
soup = BeautifulSoup(html, 'html.parser')

# 定位标题元素
title = soup.find('title').text

# 打印标题文本
print(title)