PyMuPDF \

2024-03-25 23:51:20

解决 PyMuPDF 的“xref = 137”错误：从 PDF 中提取文本和图像的完整指南

简介

在使用 PyMuPDF 库从 PDF 文档中提取文本和图像时，一个常见的错误是“xref = 137”。这个错误表明使用的交叉引用不正确，阻碍了图像的提取。本文将深入探讨这个错误的原因，并提供一个分步指南，让你准确地提取 PDF 中的文本和图像。

错误根源

“xref = 137”错误源于使用错误的交叉引用来提取图像。代码中通常使用 xref = img[0] 提取图像，但这在某些情况下是不准确的。正确获取交叉引用的方法是使用 xref = block['image'][0]。

分步指南

以下是如何解决“xref = 137”错误并从 PDF 中提取文本和图像的分步指南：

导入库：
```
import fitz
```
打开 PDF 文档：
```
doc = fitz.open(pdf_path)
```

循环遍历页面：

for page_num in range(len(doc)):
    page = doc.load_page(page_num)

获取文本和图像块：

blocks = page.get_text("dict")["blocks"]

循环遍历块：
```
for block in blocks:
```
识别图像块：
```
if block['type'] == 1:
```
获取正确的交叉引用：
```
xref = block['image'][0]
```

提取并保存图像：

base_image = doc.extract_image(xref)
image_filename = f"image_{image_counter}.png"
with open(image_filename, "wb") as img_file:
    img_file.write(base_image["image"])

在提取的文本中添加图像标签：

image_label = f"<<<image_{image_counter}>>>"
full_text += f"{image_label}\n"

关闭 PDF 文档：

doc.close()

完整示例代码

以下是完整示例代码，展示了如何使用正确的交叉引用从 PDF 文档中提取文本和图像：

import fitz

def extract_text_and_save_images(pdf_path):

    doc = fitz.open(pdf_path)
    full_text = ""
    image_counter = 1  # Initialize the image counter before iterating through pages

    for page_num in range(len(doc)):  # Iterate through each page of the pdf document

        page = doc.load_page(page_num)  # Load the pdf page
        blocks = page.get_text("dict")["blocks"]  # The list of block dictionaries

        for block in blocks:  # Iterate through each block

            if block['type'] == 0:  # If the block is a text block
                for line in block["lines"]:  # Iterate through lines in the block
                    for span in line["spans"]:  # Iterate through spans in the line
                        full_text += span["text"] + " "  # Append text to full_text
                full_text += "\n"  # Add newline after each block

            elif block['type'] == 1:  # If the block is an image block
                image_label = f"<<<image_{image_counter}>>>"  # Label to insert in the extracted text in place of the corresponding image
                full_text += f"{image_label}\n"  # Insert image label at the image location
                img = block['image']
                xref = img[0]
                base_image = doc.extract_image(xref)  # Extract image
                image_bytes = base_image["image"]
                image_filename = f"image_{image_counter}.png"

                with open(image_filename, "wb") as img_file:  # Save the image
                    img_file.write(image_bytes)

                image_counter += 1  # Increment counter for next image

    doc.close()  # Close the pdf document
    return full_text

pdf_path = "path_to_your_pdf_file.pdf"
extracted_text = extract_text_and_save_images(pdf_path)
print(extracted_text)