Skip to content

[Python] 使用 Python OCR 将 PDF 转换成文本内容 #15

Open
@yangruihan

Description

@yangruihan

使用 Python OCR 将 PDF 转换成文本内容

测试平台

系统:macOS 10.14.6
Python:Python 3.8.5

准备工作

  • 安装 tesseractbrew install tesseract

  • 安装 popplerbrew install poppler

  • 安装 pytesseractpip3 install pytesseract

  • 安装 pdf2imagepip3 install pdf2image

  • 安装 numpypip3 install numpy

  • 安装pillowpip3 install pillow

代码

import numpy as np
import pytesseract
from pdf2image import convert_from_path
import time

def pdf_ocr(fname, **kwargs):
    """
    将pdf通过ocr转换成文本
    fname: pdf 路径 (string)
    kwargs: 打开 pdf 的各种参数
    """

    # 将 pdf 转换成图片
    images = convert_from_path(fname, **kwargs)
    
    # 结果保存在此变量中
    text = ''

    images_cnt = len(images)
    sum_time = 0
    
    for i, img in enumerate(images):
        # 计算识别耗时
        print(f'start {i + 1} / {images_cnt}...')
        start_time = time.time()

        img = np.array(img)

        # 识别图片中的文本
        text += pytesseract.image_to_string(img, lang='eng+chi')

        # 打印识别耗时        
        end_time = time.time()
        print(f'done {i + 1} / {images_cnt} use time: {end_time - start_time}\n')
        sum_time += end_time - start_time

    print(f'sum use time: {sum_time}')
    return text

fname = 'test.pdf'
text = pdf_ocr(fname)

# 将结果写入到文件中
with open('result.txt', 'w') as f:
    f.write(text)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions