Skip to content

Commit

Permalink
fix & refactor & docs:update ocr logic and installation guides (#88)
Browse files Browse the repository at this point in the history
* refactor(extract_pdf): When converting a PDF to a list of images, do not perform a BGR channel conversion upfront.

* feat(self_modify): refine text and formula detection box updating logic

Update the logic for merging and refining detection boxes in self_modify module.
Replace hardcoded checks with dynamic calculations for determining overlapping regions,
resulting in more accurate detection box merging when formulae are identified within texts.

* fix(pdf_extract): optimize batch size and worker count for DataLoader

Reduce the batch size from 128 to 64 and set the number of workers to 0 in the DataLoaderto improve stability and performance on systems with limited resources.

refactor(pdf_extract): refactor ocr and table recognition logicRefactor the ocr and table recognition logic to enhance readability and maintainability.This includes the adjustment of formula recognition coordinates relative to the cropped
image and streamlining the process for handling OCR results and table recognition.

* refactor(pdf_extract): optimize image processing and table recognition

- Rename loop variable 'idx' to 'pdf_idx' for clarity.- Adjust image pasting and coordinate handling during OCR processing.- Add comments for improved code understanding.- Ensure proper rendering of images during PDF visualization.
- Refactor logging and utility imports in self_modify module.

The changes include improvements to image processing routines, better variable naming,
and streamlined table recognition logic. Also, the visualization process has been tweaked
to handle images more accurately. Additionally, redundant logging and utility importshave been cleaned up in the self_modify module to declutter the codebase.

* refactor(pdf_extract): remove hardcoded paste values in crop_img function

The crop_img function now accepts `crop_paste_x` and `crop_paste_y` as parameters
instead of using hardcoded values. This change makes the function more flexible andeasier to adjust for different use cases.

* fix(extract_pdf): prevent overscaling of large images

Adjust the condition to prevent images from being enlarged beyond a width or
height of 9000 pixels, ensuring large images do not become overly large when
processed. This change avoids unnecessary resource consumption and potential
performance issues when handling scaled images.

* docs: update installation guides and requirements

- Update the installation guides for macOS and Windows with new commands and simplified dependency installation.
- Add new installation guide for Linux.
- Modify requirements for CPU and GPU environments, including updates to
  `unimernet`, `matplotlib`, and `paddlepaddle`.
- Provide precompiled wheels for `detectron2` in the installation process.

* docs(windows_en): update config guidance for windows

* Update func description in self_modify.py

* change parameter name in pdf_extract.py, update padding size in ocr

* update some instructions in Install_in_Windows_en.md

* update some instructions in Install_in_Windows_zh_cn.md

* Update README.md

* Update README-zh_CN.md

---------

Co-authored-by: Fan Wu <34300920+wufan-tb@users.noreply.github.com>
  • Loading branch information
myhloli and wufan-tb authored Aug 14, 2024
1 parent 4a2fc68 commit 74a5e17
Show file tree
Hide file tree
Showing 13 changed files with 239 additions and 145 deletions.
12 changes: 3 additions & 9 deletions README-zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,16 +229,10 @@ conda create -n pipeline python=3.10

pip install -r requirements.txt

pip install --extra-index-url https://miropsota.github.io/torch_packages_builder detectron2==0.6+pt2.3.1cu121
pip install https://github.com/opendatalab/PDF-Extract-Kit/raw/main/assets/whl/detectron2-0.6-cp310-cp310-linux_x86_64.whl
```

安装完环境后,可能会遇到一些版本冲突导致版本变更,如果遇到了版本相关的报错,可以尝试下面的命令重新安装指定版本的库。

```bash
pip install pillow==8.4.0
```

除了版本冲突外,可能还会遇到torch无法调用的错误,可以先把下面的库卸载,然后重新安装cuda12和cudnn。
安装完环境后,可能还会遇到torch无法调用的错误,可以先把下面的库卸载,然后重新安装cuda12和cudnn。

```bash
pip uninstall nvidia-cusparse-cu12
Expand All @@ -260,7 +254,7 @@ pip uninstall nvidia-cusparse-cu12
## 运行提取脚本

```bash
python pdf_extract.py --pdf data/pdfs/ocr_1.pdf
python pdf_extract.py --pdf assets/examples/example.pdf
```

相关参数解释:
Expand Down
14 changes: 4 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,23 +214,17 @@ The formula recognition we used is based on the weights downloaded from [UniMERN
The table recognition we used is based on the weights downloaded from [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy), a solution that converts images of Table into LaTeX. Compared to the table recognition capability of PP-StructureV2, StructEqTable demonstrates stronger recognition performance, delivering good results even with complex tables, which may currently be best suited for data within research papers. There is also significant room for improvement in terms of speed, and we are continuously iterating and optimizing. Within a week, we will update the table recognition capability to [MinerU](https://github.com/opendatalab/MinerU).


## Installation Guide
## Installation Guide(Linux)

```bash
conda create -n pipeline python=3.10

pip install -r requirements.txt

pip install --extra-index-url https://miropsota.github.io/torch_packages_builder detectron2==0.6+pt2.3.1cu121
pip install https://github.com/opendatalab/PDF-Extract-Kit/raw/main/assets/whl/detectron2-0.6-cp310-cp310-linux_x86_64.whl
```

After installation, you may encounter some version conflicts leading to version changes. If you encounter version-related errors, you can try the following commands to reinstall specific versions of the libraries.

```bash
pip install pillow==8.4.0
```

In addition to version conflicts, you may also encounter errors where torch cannot be invoked. First, uninstall the following library and then reinstall cuda12 and cudnn.
After installation, you may also encounter errors where torch cannot be invoked. First, uninstall the following library and then reinstall cuda12 and cudnn.

```bash
pip uninstall nvidia-cusparse-cu12
Expand All @@ -255,7 +249,7 @@ If you intend to experience this project on Google Colab, please <a href="https:
## Run Extraction Script

```bash
python pdf_extract.py --pdf data/pdfs/ocr_1.pdf
python pdf_extract.py --pdf assets/examples/example.pdf
```

Parameter explanations:
Expand Down
Binary file not shown.
26 changes: 16 additions & 10 deletions docs/Install_in_Windows_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ To run the project smoothly on Windows, perform the following preparations:
- Install ImageMagick:
- https://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-on-windows
- Modify configurations:
- PDF-Extract-Kit/pdf_extract.py:148
- PDF-Extract-Kit/pdf_extract.py:L148 Adjust `batch_size` to suit your GPU memory. Specifically, try to lower `batch_size` when you encounter an error of OOM(out of memory).
```python
dataloader = DataLoader(dataset, batch_size=128, num_workers=0)
dataloader = DataLoader(dataset, batch_size=64, num_workers=0)
```

## Using in CPU Environment
Expand All @@ -36,19 +36,19 @@ pip install https://github.com/opendatalab/PDF-Extract-Kit/raw/main/assets/whl/d

### 3.Modify Configurations for CPU Inference

PDF-Extract-Kit/configs/model_configs.yaml:2
PDF-Extract-Kit/configs/model_configs.yaml:L2
```yaml
device: cpu
```
PDF-Extract-Kit/modules/layoutlmv3/layoutlmv3_base_inference.yaml:72
PDF-Extract-Kit/modules/layoutlmv3/layoutlmv3_base_inference.yaml:L72
```yaml
DEVICE: cpu
```
### 4.Run the Application
```bash
python pdf_extract.py --pdf demo/demo1.pdf
python pdf_extract.py --pdf assets/examples/example.pdf
```

## Using in GPU Environment
Expand All @@ -60,7 +60,7 @@ python pdf_extract.py --pdf demo/demo1.pdf
https://developer.nvidia.com/cuda-11-8-0-download-archive
- cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x
https://developer.nvidia.com/rdp/cudnn-archive
- Ensure your GPU has adequate memory, with a minimum of 6GB recommended; ideally, 16GB or more is preferred.
- Ensure your GPU has adequate memory, with a minimum of 8GB recommended; ideally, 16GB or more is preferred.
- If the GPU memory is less than 16GB, adjust the `batch_size` in the [Preprocessing](#Preprocessing) section as needed, lowering it to "64" or "32" appropriately.


Expand All @@ -82,19 +82,25 @@ pip install https://github.com/opendatalab/PDF-Extract-Kit/blob/main/assets/whl/
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
```

### 3.Modify Configurations for CUDA Inference
### 3.Modify Configurations for CUDA Inference(Layout & Formula)

PDF-Extract-Kit/configs/model_configs.yaml:2
PDF-Extract-Kit/configs/model_configs.yaml:L2
```yaml
device: cuda
```
PDF-Extract-Kit/modules/layoutlmv3/layoutlmv3_base_inference.yaml:72
PDF-Extract-Kit/modules/layoutlmv3/layoutlmv3_base_inference.yaml:L72
```yaml
DEVICE: cuda
```
### 4.Run the Application
```bash
python pdf_extract.py --pdf demo/demo1.pdf
python pdf_extract.py --pdf assets/examples/example.pdf
```

### 5.When VRAM is 16GB or more, OCR acceleration can be enabled.
When you confirm that your VRAM is 16GB or more, you can install paddlepaddle-gpu using the following command, which will automatically enable OCR acceleration after installation:
```bash
pip install paddlepaddle-gpu==2.6.1
```
26 changes: 16 additions & 10 deletions docs/Install_in_Windows_zh_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@
- 安装ImageMagick
- https://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-on-windows
- 需要修改的配置
- PDF-Extract-Kit/pdf_extract.py:148
- PDF-Extract-Kit/pdf_extract.py:L148 根据显卡的显存大小来调整`batch_size`, 如果遇到了显存不足的报错,可以尝试减小`batch_size`.
```python
dataloader = DataLoader(dataset, batch_size=128, num_workers=0)
dataloader = DataLoader(dataset, batch_size=64, num_workers=0)
```

## 在cpu环境使用
Expand All @@ -37,19 +37,19 @@ pip install https://github.com/opendatalab/PDF-Extract-Kit/raw/main/assets/whl/d

### 3.修改config, 使用cpu推理

PDF-Extract-Kit/configs/model_configs.yaml:2
PDF-Extract-Kit/configs/model_configs.yaml:L2
```yaml
device: cpu
```
PDF-Extract-Kit/modules/layoutlmv3/layoutlmv3_base_inference.yaml:72
PDF-Extract-Kit/modules/layoutlmv3/layoutlmv3_base_inference.yaml:L72
```yaml
DEVICE: cpu
```
### 4.运行
```bash
python pdf_extract.py --pdf demo/demo1.pdf
python pdf_extract.py --pdf assets/examples/example.pdf
```

## 在gpu环境使用
Expand All @@ -61,7 +61,7 @@ python pdf_extract.py --pdf demo/demo1.pdf
https://developer.nvidia.com/cuda-11-8-0-download-archive
- cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x
https://developer.nvidia.com/rdp/cudnn-archive
- 确认显卡显存是否够用,最低6GB,推荐16GB及以上
- 确认显卡显存是否够用,最低8GB,推荐16GB及以上
- 如果显存小于16GB,请将[预处理](#预处理)中需要修改的配置中batch_size酌情调低至"64"或"32"


Expand All @@ -82,19 +82,25 @@ pip install https://github.com/opendatalab/PDF-Extract-Kit/blob/main/assets/whl/
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
```

### 3.修改config, 使用cuda推理
### 3.修改config, 使用cuda推理(layout和公式)

PDF-Extract-Kit/configs/model_configs.yaml:2
PDF-Extract-Kit/configs/model_configs.yaml:L2
```yaml
device: cuda
```
PDF-Extract-Kit/modules/layoutlmv3/layoutlmv3_base_inference.yaml:72
PDF-Extract-Kit/modules/layoutlmv3/layoutlmv3_base_inference.yaml:L72
```yaml
DEVICE: cuda
```
### 4.运行
```bash
python pdf_extract.py --pdf demo/demo1.pdf
python pdf_extract.py --pdf assets/examples/example.pdf
```

### 5.显存大于等于16GB时,可开启ocr加速
确认自己显存大于等于16GB时,可通过以下命令安装paddlepaddle-gpu,安装完成后自动开启ocr加速
```bash
pip install paddlepaddle-gpu==2.6.1
```
3 changes: 1 addition & 2 deletions docs/Install_in_macOS_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,7 @@ Use either venv or conda, with Python version recommended as 3.10.
### 2.Install Dependencies

```bash
pip install unimernet==0.1.0
pip install -r requirements-without-unimernet+cpu.txt
pip install -r requirements+cpu.txt

# For detectron2, compile it yourself as per https://github.com/facebookresearch/detectron2/issues/5114
# Or use our precompiled wheel
Expand Down
3 changes: 1 addition & 2 deletions docs/Install_in_macOS_zh_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,7 @@
### 2.安装依赖

```bash
pip install unimernet==0.1.0
pip install -r requirements-without-unimernet+cpu.txt
pip install -r requirements+cpu.txt

# detectron2需要编译安装,自行编译安装可以参考https://github.com/facebookresearch/detectron2/issues/5114
# 或直接使用我们编译好的的whl包
Expand Down
15 changes: 7 additions & 8 deletions modules/extract_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,15 @@ def load_pdf_fitz(pdf_path, dpi=72):
doc = fitz.open(pdf_path)
for i in range(len(doc)):
page = doc[i]
pix = page.get_pixmap(matrix=fitz.Matrix(dpi/72, dpi/72))
image = Image.frombytes('RGB', (pix.width, pix.height), pix.samples)
mat = fitz.Matrix(dpi / 72, dpi / 72)
pm = page.get_pixmap(matrix=mat, alpha=False)

# if width or height > 3000 pixels, don't enlarge the image
if pix.width > 3000 or pix.height > 3000:
pix = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
image = Image.frombytes('RGB', (pix.width, pix.height), pix.samples)
# If the width or height exceeds 9000 after scaling, do not scale further.
if pm.width > 9000 or pm.height > 9000:
pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)

# images.append(image)
images.append(np.array(image)[:,:,::-1])
img = Image.frombytes("RGB", (pm.width, pm.height), pm.samples)
images.append(np.array(img))
return images


Expand Down
Loading

0 comments on commit 74a5e17

Please sign in to comment.