python 利用OCR识别PDF内容基础流程

PDF内容识别处理逻辑：

加载PDF
转化成图像
将图像内容转化成字符串（根据训练集数据）

对应的python包（可以用pip安装）：

pdfplumber
pillow
pytesseract

在处理流程中，需要安装poppler和tesseract两个组件：

1. 安装poppler

1.1 下载最新Poppler，解压至设定的目录；
1.2 设置系统环境变量：例如D:\Program Files (x86)\poppler\poppler-23.11.0\Library\bin；

2. 去GitHub - tesseract-ocr/tessdoc下载* Windows - Tesseract at UB Mannheim（如图所示）

提供3种Tesseract程序

2.1 安装流程：

安装Tesseract at UB Mannheim到设定的目录（如D:\Program Files\）

2.2 下载OCR需要的训练集，Traineddata Files for Version 4.00 + | tessdoc

如果只需要识别某种语言的，也可以在该页面下面，下载对应语言的*.traineddata

设置系统环境变量：D:\Program Files (x86)\Tesseract-OCR；
搜索pytesseract.py，将其中的tesseract_cmd = 'tesseract'修改为tesseract_cmd = r'D:\Program Files (x86)\Tesseract-OCR\tesseract.exe'（路径为之前安装的Tesseract-ORC中的tesseract.exe程序）；
下载OCR训练数据集，将其文件解压至D:\Program Files (x86)\Tesseract-OCR\tessdata中；

训练数据集

测试代码如下：

import glob
import pdfplumber
import pytesseract
from PIL import Image
from pdf2image import convert_from_path

# 使用 glob 模块获取所有 PDF 文件的路径
pdf_files = glob.glob("path_to_your_pdf_file.pdf")

# 遍历所有 PDF 文件
for pdf_file in pdf_files:
    # 打开PDF文件
    with pdfplumber.open(pdf_file) as pdf:
        # 初始化一个空字符串来存储提取的文本
        extracted_text = ""

        # 遍历PDF中的每一页
        for page in pdf.pages:
            # 提取当前页的文本
            text = page.extract_text()
            if text:
                extracted_text += text + "\n"

        # 将PDF页面转换为图像
        images = convert_from_path(pdf_file)
        images[0].show()
        # images[0].save('1.jpg')
        

        # 使用pytesseract进行OCR识别
        # 你需要安装tesseract-ocr并配置环境变量
        recognized_text = ""
        for image in images:
            recognized_text += pytesseract.image_to_string(image) + "\n"

    # 输出提取的文本和OCR识别的文本
    print(f"Extracted Text from {pdf_file}:")
    print(extracted_text)

    print(f"\nRecognized Text from {pdf_file}:")
    print(recognized_text)

python 利用OCR识别PDF内容基础流程

PDF内容识别处理逻辑：

对应的python包（可以用pip安装）：

在处理流程中，需要安装poppler和tesseract两个组件：

1. 安装poppler

2. 去GitHub - tesseract-ocr/tessdoc下载* Windows - Tesseract at UB Mannheim（如图所示）

2.1 安装流程：

2.2 下载OCR需要的训练集，Traineddata Files for Version 4.00 + | tessdoc

测试代码如下：

推荐阅读更多精彩内容