信息科技-爬取建标库规范全文到本地word（selenium+python-docx+tesseract实现）

2023-12-14 00:05:18

前言

建标库是住建部公布的包含各类工程建设标准规范的文件库，其中涵盖了建筑、市政、水利、交通、电力等多个领域的规范。建标库规范全文到本地word（selenium+python-docx+tesseract实现）

实现步骤

1. 环境准备

Python 3.6+
Selenium
python-docx
Tesseract OCR
Pytesseract

2. 代码实现

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from docx import Document
import pytesseract
from PIL import Image

def main():
    # 创建 Selenium 浏览器驱动
    driver = webdriver.Chrome()

    # 打开建标库网站
    driver.get("http://jbk.mohurd.gov.cn/")

    # 定位规范搜索框并输入规范编号
    search_input = driver.find_element(By.ID, "standard_no")
    search_input.send_keys("GB 50352-2019")

    # 定位搜索按钮并点击
    search_button = driver.find_element(By.CLASS_NAME, "btn-search")
    search_button.click()

    # 等待规范详情页面加载完成
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "standard_content")))

    # 获取规范正文内容
    standard_content = driver.find_element(By.ID, "standard_content")

    # 将规范正文内容保存到本地图片
    standard_content.screenshot("standard_content.png")

    # 使用 Tesseract OCR 识别图片中的文字
    text = pytesseract.image_to_string(Image.open("standard_content.png"))

    # 创建 docx 文档
    document = Document()

    # 将识别出的文字添加到 docx 文档中
    document.add_paragraph(text)

    # 保存 docx 文档
    document.save("standard.docx")

    # 关闭 Selenium 浏览器驱动
    driver.close()

if __name__ == "__main__":
    main()