1.在ppstructure管道中添加latex_ocr公式识别功能；2.添加pdf转markdown文件功能 #13868

ztyf-lq · 2024-09-13T10:45:26Z

前言

尊敬的 ppocr 官方人员您好，我是一名 ppocr 项目的使用者，在日常工作学习中我都会用到ppocr，我深感 ppocr 的强大之处！同时能为 ppocr 做贡献也是我非常想要做的事情。非常期待您在百忙之中看看我写的代码是否是 ppocr 所需要的。

改动如下：

在 ppstructure 管道中添加 latex_ocr 公式识别功能；
a. 修改 ppstructure/predict_system.py 文件中 StructureSystem 类，添加 latex_ocr 模型和布局为公式的区域处理；
b. 由于 docx 中不支持插入 latex 公式，在 ppstructure/recovery/recovery_to_doc.py 文件中 convert_info_docx 函数中跳过latex公式；
c. 在 ppstructure/utility.py 中 draw_structure_result 函数中可视化 ocr 结果中跳过 latex 公式；
添加 pdf 转 markdown 文件功能
a. 在目录 ppstructure/recovery 下添加文件 recovery_to_markdown.py，其中程序功能为转换ppstructure识别结果为markdown文件。其中对于文本区域处理目前给出了两种处理方法，第一种为每一个自然段分割标志位开头两个空格,第二种为每个自然段开头没有空格，这种情况下以每个自然段最后一行一般不会是“满行”，而是会留有空余空间；
b. ppstructure/predict_system.py 文件中调用转换 ppstructure 识别结果到 markdown 文件的函数；
添加必要的命令行参数选项；
a. 添加 latex_ocr 公式识别模型必要的参数；
b. 添加 recovery_to_markdown 选项达到开启/关闭转换 ppstructure 识别结果到 markdown 文件；
c. 添加 formula 选项达到开启/关闭latex公式识别；

如果我的代码恰巧是 ppocr 所需要的，后续我会跟进官方人员的建议并且在版面恢复文档中添加 pdf 转 markdown 文件的教程。

CLAassistant · 2024-09-13T10:45:32Z

All committers have signed the CLA.

GreatV · 2024-09-13T10:57:13Z

感谢大佬的贡献

GreatV · 2024-09-13T11:10:09Z

@liuhongen1234567 大佬，麻烦review一下这个PR。

ppstructure/recovery/recovery_to_markdown.py

… new_branch

GreatV · 2024-09-14T05:16:05Z

建议更新一下文档，说明使用方法。由于我们的文档站点还在迁移中，所以需要更新两个地方。

ppstructure

docs

ztyf-lq · 2024-09-20T06:07:31Z

您好，我后续有更新文档的打算，最近可能使用ppocr复现其他的项目，更新文档的时间最晚会在十月。

jzhang533 · 2024-09-27T06:25:19Z

ppstructure/predict_system.py

@@ -78,6 +80,13 @@ def __init__(self, args):
                    )
                else:
                    self.table_system = TableSystem(args)
+            if args.formula:
+                args_fomula = deepcopy(args)


There is a typo in args_fomula which should be args_formula

ok, I will modify it

I had modified it.

jzhang533 · 2024-09-27T06:34:31Z

you may need to sign the updated CLA, and I think we can leave the documentation as future work.

ztyf-lq · 2024-09-27T06:48:22Z

I had updated the CLA

GreatV · 2024-09-27T06:58:40Z

这个功能该怎么使用呢，麻烦给一个示例，我们验证一下。

ztyf-lq · 2024-09-27T07:21:21Z

运行示例：
cd ppstructure && python predict_system.py --image_dir=data/math.pdf --det_model_dir=models/ch_PP-OCRv4_det_infer --rec_model_dir=models/ch_PP-OCRv4_rec_infer --table_model_dir=models/ch_ppstructure_mobile_v2.0_SLANet_infer --formula_model_dir=models/rec_latex_ocr_infer --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt --layout_model_dir=models/picodet_lcnet_x1_0_fgd_layout_cdla_infer --layout_dict_path=../ppocr/utils/dict/layout_dict/layout_cdla_dict.txt --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt --vis_font_path=../doc/fonts/simfang.ttf --formula=True --recovery=True --recovery_to_markdown=True --output=../output/
其中--formula选项决定是否对布局检测出的公式区域进行识别；--recovery_to_markdown决定是否将识别结果转换成markdown文件，--recovery_to_markdown只有在--recovery=True的时候才会起作用；公式识别只有layout_cdla布局模型下才能work。

我的一个转换示例如下：

GreatV · 2024-09-27T09:19:03Z

看起来不错，非常好的工作。就是双栏处理看起来似乎还有点问题。

where u E Rn is the input signal, E Rh is the internal state,and y E Rm is the output. Here, we are letting n > 1, m > 1,which yields a multiple-input, multiple-output (MIMO) state-space model. For the remainder of this paper, we will ignorethe Du term as we do not use it.

The state-space model in its original form describes acontinuous-time system, but in the field of digital signalprocessing, there are standard recipes for discretizing sucha system into a discrete-time state-space model. One suchmethod that we use in this work is the zero-order hold (ZOH),which gives us the discrete-time state-space matrices A andB as follows:

$$\overline{{{A}}}=\exp(\Delta A)$$

$$\overline{{B}}=(\Delta A)^{-1}\cdot(\exp(\Delta A)-1)\cdot\Delta B.;;(2)$$

The discrete state-space model is then given byc[t+1]=Ax[t]+Bu[t]，

y[t]=Cx[t]

We use an hourglass network with long-range skip connec-tions, similar in form to the Sashimi network [24] for audiogeneration. However, unlike previous works using state-spacemodels for audio processing [24], [29], [30], our networkdirectly takes in raw audio waveforms in the -1 to +1 rangeand outputs raw waveforms as well, with no one-hot encodingor spectral processing (e.g. STFT or iSTFT). Furthermore,we retain causality as much as possible for sake of real-timeinference, meaning that we eschew any form of bidirectionalstate-space layers. See Fig. 1 for a schematic drawing.

As with typical auto-encoder networks, the audio featuresare down-sampled in the encoder and then up-sampled inthe decoder. For the re-sampling operation, we use a simpleIThis is a prior not required, as technically we can configure our networkas complex-valued to handle complex features.However, we do not explorethis configuration in this work.

2The size of the internal state h,can be interpreted as the degree ofparametrization of a basis temporal kernel, or some implicit (dilated)“kernelsize" in the frequency domain. We explore this in a future work.

(3)

In the context of recurrent neural networks (RNNs),this isessentially a linear RNN layer, which allows for efficientonline inference and generation (in our case real-time speechenhancement), but at the same time efficient parallelizationduring training.

It is straightforward to check that the discrete-time impulseresponse is given as

$$k[\tau]=C,\overline{{{A}}}^{\prime},\overline{{{B}}}.$$

(4)

where T denotes the kernel timestep. During training, k canbe considered the “full” long 1D convolutional kernel withshape (output channels, input channels, length), in the sensethat the output y can be computed via the long convolutionyj = > u *kij. By the convolution theorem, we can performthis operation in the frequency domain, which becomes apoint-wise product gjf = >, ukijf. The hat symbol denotesthe Fourier transform of the signal (with the index f denoting

Fig. 1. A schematic drawing of the network architecture, with only 2 encoderand decoder blocks shown for simplicity. The actual model has 6 encoder anddecoder blocks. Note that there is no (spectral) processing on the input andoutput waveforms.

ztyf-lq · 2024-09-27T10:17:27Z

是的，有些地方的text按照我的方法处理不是很好，比如：
(4)

where T denotes the kernel timestep. During training, k canbe considered the “full” long 1D convolutional kernel withshape (output channels, input channels, length), in the sensethat the output y can be computed via the long convolutionyj = > u *kij. By the convolution theorem, we can performthis operation in the frequency domain, which becomes apoint-wise product gjf = >, ukijf. The hat symbol denotesthe Fourier transform of the signal (with the index f denoting

这一段有多余的\n符，我正在想新的方法避免这些问题。

GreatV

LGTM
可以先合入，后面补充文档，优化处理逻辑。

luotao1 · 2024-10-15T07:40:59Z

@ztyf-lq Thanks for your contribution! You will receive a beautiful PaddlePaddle gift. Please provide your mailing address by filling out the following questionnaire before October 18th.

Looking forward to the future, we will walk further together in the world of open source!
Click Here ：https://paddle.wjx.cn/vm/h4On9gJ.aspx#

ztyf-lq added 2 commits September 13, 2024 11:38

Add formula recognition in ppstructure,Convert PDF to markdown file

05d75dd

Fix bug in converting to doc in formula recognition

5567967

GreatV requested review from GreatV, jzhang533 and UserWangZz September 13, 2024 10:57

GreatV reviewed Sep 13, 2024

View reviewed changes

ppstructure/recovery/recovery_to_markdown.py Outdated Show resolved Hide resolved

ztyf-lq added 2 commits September 14, 2024 12:31

Merge branch 'main' of https://github.com/PaddlePaddle/PaddleOCR into…

d05a723

… new_branch

modify time

3dfca04

jzhang533 reviewed Sep 27, 2024

View reviewed changes

Correct spelling errors in args_formula

b99e965

GreatV approved these changes Sep 27, 2024

View reviewed changes

jzhang533 merged commit 269e5b8 into PaddlePaddle:main Sep 29, 2024
3 checks passed

github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.在ppstructure管道中添加latex_ocr公式识别功能；2.添加pdf转markdown文件功能 #13868

1.在ppstructure管道中添加latex_ocr公式识别功能；2.添加pdf转markdown文件功能 #13868

ztyf-lq commented Sep 13, 2024 •

edited by GreatV

Loading

CLAassistant commented Sep 13, 2024 •

edited

Loading

GreatV commented Sep 13, 2024

GreatV commented Sep 13, 2024

GreatV commented Sep 14, 2024

ztyf-lq commented Sep 20, 2024

jzhang533 Sep 27, 2024

ztyf-lq Sep 27, 2024

ztyf-lq Sep 27, 2024

jzhang533 commented Sep 27, 2024

ztyf-lq commented Sep 27, 2024

GreatV commented Sep 27, 2024

ztyf-lq commented Sep 27, 2024 •

edited

Loading

GreatV commented Sep 27, 2024

ztyf-lq commented Sep 27, 2024 •

edited

Loading

GreatV left a comment

luotao1 commented Oct 15, 2024

1.在ppstructure管道中添加latex_ocr公式识别功能；2.添加pdf转markdown文件功能 #13868

1.在ppstructure管道中添加latex_ocr公式识别功能；2.添加pdf转markdown文件功能 #13868

Conversation

ztyf-lq commented Sep 13, 2024 • edited by GreatV Loading

CLAassistant commented Sep 13, 2024 • edited Loading

GreatV commented Sep 13, 2024

GreatV commented Sep 13, 2024

GreatV commented Sep 14, 2024

ztyf-lq commented Sep 20, 2024

jzhang533 Sep 27, 2024

Choose a reason for hiding this comment

ztyf-lq Sep 27, 2024

Choose a reason for hiding this comment

ztyf-lq Sep 27, 2024

Choose a reason for hiding this comment

jzhang533 commented Sep 27, 2024

ztyf-lq commented Sep 27, 2024

GreatV commented Sep 27, 2024

ztyf-lq commented Sep 27, 2024 • edited Loading

GreatV commented Sep 27, 2024

ztyf-lq commented Sep 27, 2024 • edited Loading

GreatV left a comment

Choose a reason for hiding this comment

luotao1 commented Oct 15, 2024

ztyf-lq commented Sep 13, 2024 •

edited by GreatV

Loading

CLAassistant commented Sep 13, 2024 •

edited

Loading

ztyf-lq commented Sep 27, 2024 •

edited

Loading

ztyf-lq commented Sep 27, 2024 •

edited

Loading