Skip to content

Commit

Permalink
Add TODO list
Browse files Browse the repository at this point in the history
  • Loading branch information
wangbinDL committed Jul 19, 2024
1 parent cac61d8 commit 33d24b0
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 0 deletions.
10 changes: 10 additions & 0 deletions README-zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,16 @@ python pdf_extract.py --pdf data/pdfs/ocr_1.pdf
> 本项目专注使用模型对`多样性`文档进行`高质量`内容提取,不涉及提取后内容拼接成新文档,如PDF转Markdown。如果有此类需求,请参考我们另一个Github项目: [MinerU](https://github.com/opendatalab/MinerU)

## 待办事项

- [ ] **表格解析**:开发能够将表格图像转换成对应的LaTeX/Markdown格式源码的功能。
- [ ] **化学方程式检测**:实现对化学方程式的自动检测。
- [ ] **化学方程式/图解识别**:开发识别并解析化学方程式的模型。
- [ ] **阅读顺序排序模型**:构建模型以确定文档中文本的正确阅读顺序。

**PDF-Extract-Kit** 旨在提供高质量PDF文件的提取能力。我们鼓励社区提出具体且有价值的需求,并欢迎大家共同参与,以不断改进PDF-Extract-Kit工具,推动科研及产业发展。


## 协议

本仓库的代码依照 [Apache-2.0](LICENSE) 协议开源。
Expand Down
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,15 @@ Parameter explanations:

> This project is dedicated to using models for high-quality content extraction from documents on diversity. It does not involve reassembling the extracted content into new documents, such as converting PDFs to Markdown. For those needs, please refer to our other GitHub project: [MinerU](https://github.com/opendatalab/MinerU)
## TODO List

- [ ] **Table Parsing**: Develop a feature to convert table images into corresponding LaTeX/Markdown format source code.
- [ ] **Chemical Equation Detection**: Implement automatic detection of chemical equations.
- [ ] **Chemical Equation/Diagram Recognition**: Develop a model to recognize and parse chemical equations and diagrams.
- [ ] **Reading Order Sorting Model**: Build a model to determine the correct reading order of text in documents.

**PDF-Extract-Kit** aims to provide high-quality PDF extraction capabilities. We encourage the community to propose specific and valuable requirements and welcome everyone to participate in continuously improving the PDF-Extract-Kit tool to advance scientific research and industrial development.

## License

This repository is licensed under the [Apache-2.0 License](LICENSE).
Expand Down

0 comments on commit 33d24b0

Please sign in to comment.