Info: | See the tutorials & documentation for more information. |
---|---|
Author & Maintainer: | Maksym Polshcha <maxp@sterch.net> |
See GitHub for the latest source.
- pdfreader is a Pythonic API for:
- extracting texts, images and other data from PDF documents (plain or protected)
- accessing different objects within PDF documents
- pdfreader is NOT a tool (maybe one day it become!):
- to create or update PDF files
- to split PDF files into pages or other pieces
- convert PDFs to any other format
Nevertheless it can be used as a part of such tools.
See Tutorials & Documentation.
- Extracts texts (plain text and formatted text objects)
- Extract PDF forms data (pure strings and formatted text objects)
- Supports all PDF encodings, CMap, predefined cmaps.
- Extracts images and image masks as Pillow/PIL Images
- Supports encrypted and password-protected PDF documents
- Allows browse any document objects, resources and extract any data you need (fonts, annotations, metadata, multimedia, etc.)
- Follows PDF-1.7 specification
- Lazy objects access allows to process huge PDF documents quite fast
pdfreader can be installed with pip:
$ python -m pip install pdfreader
Or easy_install
from
setuptools:
$ python -m easy_install pdfreader
You can also download the project source and do:
$ python setup.py install
Tutorial, real-life examples and documentation
pdfreader uses GitHub issues to keep track of bugs, feature requests, etc.
- Document management - Potable document format - PDF 1.7
- Adobe CMap and CIDFont Files Specification
- PostScript Language Reference Manual
- Adobe CMap resources
- Adobe glyph list specification (AGL)
If this project is helpful, you can treat me to coffee :-)