Skip to content

Extract bookmarks from PDFs and split documents using a companion CSV—served as a FastAPI web service and shipped as a Podman container.

License

Notifications You must be signed in to change notification settings

pjfsu/split-pdf-bookmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

split-pdf-bookmarks

Extract bookmarks from PDFs and split documents using a companion CSV—served as a FastAPI web service and shipped as a Podman container.

Features

  • Bookmark Export
    Parse PDF outlines and generate per-level CSV files containing start/end page ranges.

  • PDF Splitting
    Accept a CSV with split,name,from,to and split the original PDF into multiple fragments.

  • Containerized API
    Lightweight container runs a FastAPI app on port 8080.

  • CLI Client (Bash)
    Includes a script to upload PDFs, retrieve ZIPs, and save results to a dedicated per-file directory.

Project Layout

split-pdf-bookmarks/
├── app
│   ├── __init__.py
│   ├── main.py
│   ├── routers
│   │   ├── bookmarks.py
│   │   ├── split.py
│   │   └── utils.py
│   └── services
│       ├── bookmarks.py
│       ├── exceptions.py
│       ├── split.py
│       └── utils.py
├── LICENSE
├── podman
│   ├── app
│   │   ├── Containerfile
│   │   ├── entrypoint.sh
│   │   └── requirements.txt
│   └── tests
│       ├── Containerfile
│       └── requirements.txt
├── README.md
├── split-pdf-bookmarks.sh
└── tests
    ├── conftest.py
    ├── __init__.py
    └── test_bookmarks_zip.py

Example Workflow

0. Create a symlink

ln -s "$(realpath split-pdf-bookmarks.sh)" ~/.local/bin/split-pdf-bookmarks

1. Start the container

podman run -d -p 8080:8080 docker.io/pjfsu/split-pdf-bookmarks:latest

You can use another host port (container port is 8080)

2. Export bookmarks

split-pdf-bookmarks "Effective DevOps.pdf"
ls -1 "Effective DevOps"/
bookmarks.zip

3. Unzip bookmarks

unzip "Effective DevOps"/bookmarks.zip -d "Effective DevOps"/
ls -1 "Effective DevOps"/
bookmarks_level_0.csv
bookmarks_level_1.csv
bookmarks_level_2.csv
bookmarks_level_3.csv
bookmarks.zip

4. Edit CSV to select bookmarks

Set "split" to "y" for the entries you want to extract:

vim "Effective DevOps"/bookmarks_level_1.csv
"split","name","from","to"
"n","Introducing Effective Devops",22,22
...
"y","Chapter 1. The Big Picture",33,42
"y","Chapter 2. What Is Devops?",43,48
...
"n","Chapter 20. Further Resources",387,392

5. Split bookmarks

split-pdf-bookmarks "Effective DevOps.pdf" "Effective DevOps/bookmarks_level_1.csv"
ls -1 "Effective DevOps"/*zip
'Effective DevOps/bookmarks.zip'
'Effective DevOps/pdfs.zip'

6. Unzip PDFs

unzip "Effective DevOps"/pdfs.zip -d "Effective DevOps"
ls -1 "Effective DevOps"/Chapter*
'Effective DevOps/Chapter 1. The Big Picture.pdf'
'Effective DevOps/Chapter 2. What Is Devops.pdf'

API Reference

/api/bookmarks/zip

POST a pdf --> returns ZIP of per-level bookmarks in CSV.

/api/split

POST a pdf + csvfile --> returns ZIP of PDF fragments.

CSV Format for Splitting

split,name,from,to
  • split: "y"/"n" means the row will/won't be used
  • name: filename for the generated PDF
  • from, to: start/end page (inclusive)

Bash Client

Use split-pdf-bookmarks.sh to send requests locally:

  • Automatically detects running container
  • Determines endpoint based on arguments
  • Creates a dedicated output folder named after the input PDF

Usage:

./split-pdf-bookmarks.sh book.pdf               # Export bookmarks
./split-pdf-bookmarks.sh book.pdf bookmarks.csv # Split PDF

Tests

podman network create testnet
podman build -t split-pdf-bookmarks-tests -f ./podman/tests/Containerfile .
podman run -d --rm --network testnet -p 8080:8080 --name split-pdf-bookmarks docker.io/pjfsu/split-pdf-bookmarks:latest
podman run --rm --network testnet -e API_URL=http://split-pdf-bookmarks:8080 split-pdf-bookmarks-tests:latest

License

GPLv3 License. See LICENSE for terms.

Future Ideas

  • Web UI front-end for preview and interaction

EOR (End Of Repository)

I hope this program is useful to you. Thank you very much for visiting this repository!

Espero que este programa te sea útil. Muchas gracias por visitar este repositorio!

Espero que este programa séache de utilidade. Moitas grazas por visitar este repositorio!

About

Extract bookmarks from PDFs and split documents using a companion CSV—served as a FastAPI web service and shipped as a Podman container.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published