[Python] 使用 Python OCR 将 PDF 转换成文本内容

# 使用 Python OCR 将 PDF 转换成文本内容

## 测试平台

系统：macOS 10.14.6
Python：Python 3.8.5

## 准备工作

- 安装 [tesseract](https://github.com/tesseract-ocr/tesseract)：`brew install tesseract`
- 安装 [poppler](https://poppler.freedesktop.org/)：`brew install poppler`

- 安装 [pytesseract](https://github.com/madmaze/pytesseract)：`pip3 install pytesseract`
- 安装 [pdf2image](https://github.com/Belval/pdf2image)：`pip3 install pdf2image`
- 安装 [numpy](https://github.com/numpy/numpy)：`pip3 install numpy`
- 安装[pillow](https://github.com/python-pillow/Pillow)：`pip3 install pillow`

## 代码

```python
import numpy as np
import pytesseract
from pdf2image import convert_from_path
import time

def pdf_ocr(fname, **kwargs):
    """
    将pdf通过ocr转换成文本
    fname: pdf 路径 (string)
    kwargs: 打开 pdf 的各种参数
    """

    # 将 pdf 转换成图片
    images = convert_from_path(fname, **kwargs)
    
    # 结果保存在此变量中
    text = ''

    images_cnt = len(images)
    sum_time = 0
    
    for i, img in enumerate(images):
        # 计算识别耗时
        print(f'start {i + 1} / {images_cnt}...')
        start_time = time.time()

        img = np.array(img)

        # 识别图片中的文本
        text += pytesseract.image_to_string(img, lang='eng+chi')

        # 打印识别耗时        
        end_time = time.time()
        print(f'done {i + 1} / {images_cnt} use time: {end_time - start_time}\n')
        sum_time += end_time - start_time

    print(f'sum use time: {sum_time}')
    return text

fname = 'test.pdf'
text = pdf_ocr(fname)

# 将结果写入到文件中
with open('result.txt', 'w') as f:
    f.write(text)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] 使用 Python OCR 将 PDF 转换成文本内容 #15

使用 Python OCR 将 PDF 转换成文本内容

测试平台

准备工作

代码

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Python] 使用 Python OCR 将 PDF 转换成文本内容 #15

Description

使用 Python OCR 将 PDF 转换成文本内容

测试平台

准备工作

代码

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions