PDFInterpreter

This is a Python script for parsing PDF files and interpreting and analyzing their text using OpenAI GPT. The script first reads the PDF file, breaking its content down into individual pages. Then, it processes the text of each page, breaking it down into paragraphs. Next, it sends the paragraphs to OpenAI GPT to obtain interpretations and analyses of the text. Finally, it converts the GPT responses to Markdown format and turns them back into PDF files. You can place PDF files you want to parse in the input folder, and the processed PDF files will be saved in the output folder.

这是一个用于解析PDF文件并使用OpenAI GPT进行文本解释和分析的Python脚本。脚本首先读取PDF文件，将其内容分解为单独的页面。然后，对每个页面的文本进行处理，将其分解为段落。接下来，将段落发送到OpenAI GPT以获得对文本的解释和分析。最后，将GPT的响应转换为Markdown格式，并将其转换回PDF文件。您可以在 input 文件夹中放置需要解析的PDF文件，处理后的PDF文件将保存在 output 文件夹中。

Project Structure

input/ - Folder to place your input PDF files.
output/ - Folder where the processed PDF files will be saved.
main.py - Main script to run the project.
PDFReader.html - A PDF reader to read two PDFs at the same time.

Set Up

Download the repository
Install the requirements

pip install -r requirements.txt

Install wkhtmltopdf

Go to https://wkhtmltopdf.org/downloads.html to download wkhtmltopdf

Change your openai api key in env.txt (You can get your key on https://platform.openai.com/account/api-keys)
Change your target language and model in env.txt (Chinese and gpt-3.5-turbo are set as default)
Rename the file 'env.txt' to '.env'

Usage

Copy your PDF files to input/ folder
Start

python main.py

Read your PDFs

By running PDFReader.html, you can read two PDFs at the same time, which is convenient for comparing PDFs.

General errors

Please make sure you have installed wkhtmltopdf
If you encounter a wkhtmltopdf error, please add the path to your wkhtmltopdf path in the .env file
For MacOS, the default path is /usr/local/bin/wkhtmltopdf. You can get your wkhtmltopdf path by

which wkhtmltopdf

For Windows, the default path is C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe. You can get your wkhtmltopdf path by

where wkhtmltopdf

注意事项/已知限制

由于尚未获得GPT-4.0的访问权限，该项目目前无法读取PDF中的图片。会持续关注GPT-4.0的更新，并在获得访问权限后更新项目以支持图片解析。
本项目在处理大型PDF文件时可能会遇到性能瓶颈。将继续优化代码以提高处理速度和效率。

Known Limitations/Notes

As we currently do not have access to GPT-4.0, this project is unable to read images within PDFs. Will continue to monitor updates on GPT-4.0 and update the project to support image interpretation once access is granted.
The project may encounter performance bottlenecks when processing large PDF files. Will continue to optimize the code to improve processing speed and efficiency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFInterpreter

Project Structure

Set Up

Usage

Read your PDFs

General errors

注意事项/已知限制

Known Limitations/Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
input		input
output		output
LICENSE		LICENSE
PDFReader.html		PDFReader.html
README.md		README.md
env.txt		env.txt
main.py		main.py
requirements.txt		requirements.txt

License

Hongbin-Bao/PDFInterpreter

Folders and files

Latest commit

History

Repository files navigation

PDFInterpreter

Project Structure

Set Up

Usage

Read your PDFs

General errors

注意事项/已知限制

Known Limitations/Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages