如何用 Python 和 Selenium 爬取网页表格数据？

2024-07-12 05:22:36

如何使用 Selenium 和 Python 爬取网页表格数据

在数据分析和信息收集的过程中，我们常常需要从网页上获取数据。网页表格作为一种常见的数据展示形式，蕴藏着大量有价值的信息。手动复制粘贴显然效率低下且容易出错，这时就需要借助 Web Scraping 技术来自动化完成这项任务。Selenium 作为一个强大的浏览器自动化工具，结合 Python 的灵活语法，可以轻松实现网页表格数据的抓取。

本文将以一个包含响应式表格的网页为例，详细介绍如何使用 Selenium 和 Python 提取表格数据，并针对实际操作中可能遇到的动态加载问题提供解决方案。

爬取网页表格数据的挑战

许多网站采用响应式表格来展示数据，这类表格的 HTML 结构往往较为复杂，给数据提取带来了一定的挑战。你可能会尝试使用 Selenium 定位表格元素，并提取其中的文本内容，但最终得到的结果可能是一堆空列表或者不完整的数据。

以如下 HTML 代码为例，它展示了一个典型的响应式表格结构（为了保护隐私，部分代码已做修改，但结构与实际代码一致）：

<table class="items">
    <tbody>
        <tr class="odd">
            <td class="centered">1</td>
            <td class="centered no-border-right">
                <a title="company 1" name="" href="/company1/year_id/1970"> <img src="https://company_1.com/logo.png"> </a>
            </td>
            <td class="mainlink no-border-links">
                <a title="company 1" name="" href="/company1/year_id/1970">company 1</a>
            </td>
            <td class="rights mainlink redtext">$270k</td>
            <td class="centered">
                <a href="/company 1/purchase/year_id/1970">5</a>
            </td>
            <td class="rights mainlink greentext">- </td>
            <td class="centered">
                <a href="/company 1/purchase/year_id/1970">4</a>
            </td>
            <td class="rights mainlink">
                <span class="redtext">$-270k</span>
            </td>
        </tr>
        <tr class="even"> 
            # 其他24行数据，结构类似
        </tbody>
</table>

如果你使用以下 Python 代码尝试提取数据：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from fake_useragent import UserAgent
from pandas import DataFrame

option = webdriver.ChromeOptions()
option.add_argument("--headless")
ua = UserAgent()
option.add_argument(f"user-agent={ua.chrome}")
driver = webdriver.Chrome(options=option)

table_class='items'

url_expenditure = 'https://target_website.com'
driver.get(url_expenditure)
driver.implicitly_wait(5)

table_element = driver.find_element(By.CLASS_NAME, table_class)
table_data = table_element.find_element(By.TAG_NAME, "tr") 

table_data = []
for row in table_element.find_elements(By.TAG_NAME, "tr"):
    row_data = [cell.text.strip() for cell in row.find_elements(By.TAG_NAME, "td")]  
    table_data.append(row_data)

driver.quit()

print(table_data)

你得到的结果很可能是：

[[], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', ''], ['', '', '', '', '', '', '', '']]

这与我们期望的结果（[['1','','company 1','$270K','5','-','4','$-270K'],[#next row of data]...]）相去甚远。

克服动态加载，精准提取数据

出现上述问题的原因在于目标网页可能使用了 JavaScript 动态加载表格数据。Selenium 在页面加载完成后立即获取表格内容，但此时数据尚未完全加载，因此无法获取到正确的结果。

为了解决这个问题，我们需要借助 WebDriverWait 等待数据加载完成后再进行提取。

修改后的代码如下：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from fake_useragent import UserAgent
from pandas import DataFrame

# 设置 Chrome 选项
option = webdriver.ChromeOptions()
option.add_argument("--headless")
ua = UserAgent()
option.add_argument(f"user-agent={ua.chrome}")
driver = webdriver.Chrome(options=option)

# 设置目标网页地址和表格 CSS 选择器
url_expenditure = 'https://target_website.com'
table_class = 'items'

# 打开网页
driver.get(url_expenditure)

# 等待表格数据加载完成
try:
    table_element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, table_class))
    )
except TimeoutException:
    print("表格加载超时！")
    driver.quit()
    exit()

# 提取表格数据
table_data = []
for row in table_element.find_elements(By.TAG_NAME, "tr"):
    row_data = [cell.text.strip() for cell in row.find_elements(By.TAG_NAME, "td")]  
    table_data.append(row_data)

# 关闭浏览器
driver.quit()

# 打印提取到的数据
print(table_data)

在这段代码中，我们使用了 WebDriverWait 和 expected_conditions 模块来等待表格元素加载完成。WebDriverWait(driver, 10) 表示等待最多 10 秒钟，EC.presence_of_element_located((By.CLASS_NAME, table_class)) 表示等待 class 为 'items' 的元素出现。

通过这种方式，Selenium 就可以在数据完全加载完成后再进行提取，从而获得正确的结果。