File Converter and Viewer

Description

The File Converter and Viewer is a powerful Streamlit application that allows users to upload, process, and view various file types including PDFs, images, CSV, Excel, and Word documents. It leverages the marker-pdf library to convert PDFs and images into markdown format, providing a unified viewing experience for all file types.

Features

Upload multiple files of different types (PDF, JPG, PNG, CSV, XLSX, DOCX)
Convert images to PDF for unified processing
Process PDFs and images using the marker-pdf library
Convert CSV and Excel files to markdown format
Extract text from Word documents
Display processed content in an interactive Streamlit interface

Requirements

Python 3.7+
Streamlit
pandas
openpyxl
python-docx
Pillow
reportlab
marker-pdf
PyTorch (CPU or GPU version)

Installation

Clone the repository:

git clone https://github.com/taofiqsulayman/pdf2markdown.git
cd pdf2markdown

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```

Install PyTorch and related packages:

For CPU:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

For GPU (CUDA 11.8):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

For specific PyTorch and related package based on your hardware, "visit: https://pytorch.org/get-started/locally/"

Install marker-pdf:

pip install marker-pdf

Note for Mac users:

pip install pdftext==0.3.7 marker_pdf==0.2.6

Usage

Ensure you're in the project directory and your virtual environment is activated.
Run the Streamlit app:
```
streamlit run app.py
```
Open your web browser and go to the URL provided by Streamlit (usually http://localhost:8501).
Use the file uploader to select one or more files of supported types.
Click the "Process Files" button to convert and view the files.
Explore the processed content in the expandable sections below.

Use Cases

Document Conversion: Quickly convert PDFs and images to markdown format for easy viewing and sharing.
Data Analysis: Upload CSV or Excel files to view their contents in a formatted markdown table.
Text Extraction: Extract and view text content from Word documents.
Batch Processing: Process multiple files of different types in a single operation.
Content Aggregation: Combine content from various file types into a single, easy-to-navigate interface.

Troubleshooting

If you encounter issues with PDF processing, ensure that the marker command is properly installed and accessible in your system's PATH.
For image processing problems, check that you have the necessary dependencies for image-to-PDF conversion (Pillow and reportlab).
If you're using a GPU and experiencing CUDA errors, make sure you've installed the correct version of PyTorch for your CUDA version.
For Mac users experiencing issues with pdftext or marker_pdf, try the specific versions mentioned in the installation notes.

Contact

For any questions, issues, or suggestions, please open an issue on the GitHub repository or contact the maintainer at [sulaymantaofiq@gmail.com].

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
.gitignore		.gitignore
app.py		app.py
app_v2.py		app_v2.py
app_v3.py		app_v3.py
app_v4.py		app_v4.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File Converter and Viewer

Table of Contents

Description

Features

Requirements

Installation

Usage

Use Cases

Troubleshooting

Contact

About

Releases

Packages

Languages

taofiqsulayman/pdf2markdown

Folders and files

Latest commit

History

Repository files navigation

File Converter and Viewer

Table of Contents

Description

Features

Requirements

Installation

Usage

Use Cases

Troubleshooting

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages