IngesterConverter

Document Conversion Tool

This script is a document conversion tool that converts PDF and Word documents (.docx) to text files (.txt). The script also tracks processed files using a SQLite database, allowing for skipping previously processed files with no modifications. Multithreading is used to process multiple files concurrently for improved performance.

Features

Convert PDF and Word documents (.docx) to text files
Keep track of processed files with a SQLite database
Use multithreading for concurrent file processing
Log output for easy tracking and debugging

Requirements

Python 3.6 or higher
pdfplumber
textract
python-docx
SQLite (included in Python standard library)

Installation

Clone the repository or download the script.
Install the required dependencies using pip:

pip install pdfplumber textract python-docx

Usage

Create a folder named "input_folder" in the same directory as the script.
Place the PDF and/or Word documents you want to convert in the "input_folder".
Run the script:

python3 convert.py

The script will process the documents and create an "output_folder" with the extracted text files.

Notes

Currently, the script only supports PDF and .docx files. You can modify the script to support other file formats.
The script uses multithreading to improve performance, but it may not be ideal for extremely large files or a very high number of files. Adjust the max_workers parameter in ThreadPoolExecutor according to your needs and system capabilities.
You can adjust the number of threads by modifying the max_workers parameter of the ThreadPoolExecutor. The current value is set to os.cpu_count(), which uses the number of CPU cores as the maximum number of worker threads.
- To change the number of threads, simply replace os.cpu_count() with the desired number. For example, if you want to use 8 threads, you would modify the line as follows:

with ThreadPoolExecutor(max_workers=8) as executor:

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitattributes		.gitattributes
README.md		README.md
file_history.db		file_history.db
ingesterconverter.py		ingesterconverter.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IngesterConverter

Document Conversion Tool

Features

Requirements

Installation

Usage

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

jacobyoby/ingesterconverter

Folders and files

Latest commit

History

Repository files navigation

IngesterConverter

Document Conversion Tool

Features

Requirements

Installation

Usage

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages