An all-in-one converter to make your files LLM-understandable
For example, sample docx file will be converted to:
convenient JSON data structure
{
"extracted": [
{
"metainfo": {
"doi": "",
"title": "Accessible Document Sample",
"document_type": "Technical Guide",
"search_description": "Accessible Document Sample, headings, lists, images, tables, columns, screen readers, web accessibility",
"abstract": "This document demonstrates accessibility techniques for headings, lists, images, tables, and columns. It includes eight section headings, ordered and unordered lists, links to different locations and downloadable documents, and images with alternate text. The document also features simple and complex tables, including one with merged cells. Additionally, it showcases the use of columns and explains how to create them correctly for accessibility. The document is designed to be completely accessible using assistive technologies such as screen readers.",
"authors": [],
"pub_date": "",
"mentioned_date": "",
"language": "en",
"detected_language": "en",
"file_features": {
"filename": "sample3.docx",
"file": null,
"is_scanned": null
},
"npages": null,
"thumbnail": "<base64 encoded thumbnail>",
"extra_metainfo": null,
"ai_metadata": ""
},
"chunks": [
{
"page": null,
"coordinates": null,
"section": "",
"prefix": "",
"non_embeddable_content": "",
"text": " # Sample Document\nThis document was created using accessibility techniques for headings, lists, image alternate text, tables, and columns. It should\nbe completely accessible using assistive technologies such as screen readers.\n## Headings\nThere are eight section headings in this document. At the beginning, \\\"Sample Document\\\" is a level 1 heading. The main section\nheadings, such as \\\"Headings\\\" and \\\"Lists\\\" are level 2 headings. The Tables section contains two sub-headings, \\\"Simple Table\\\"\nand \\\"Complex Table,\\\" which are both level 3 headings.\n## Lists\nThe following outline of the sections of this document is an ordered (numbered) list with six items. The fifth item, \\\"Tables,\\\"\ncontains a nested unordered (bulleted) list with two items.\n1. Headings\n2. Lists 3. Links\n4. Images\n5. Tables\n- Simple Tables\n- Complex Tables\n6. Columns\n## Links\nIn web documents, links can point different locations on the page, different pages, or even downloadable documents, such as Word\ndocuments or PDFs:\n[Top of this Page](#sample-document)\\\n[Sample Document](http://www.dhs.state.il.us/page.aspx?item=67072)\\\n[Sample Document (docx)](http://www.dhs.state.il.us/OneNetLibrary/27897/documents/Initiatives/IITAA/Sample-Document.docx) ## Images\nFor example, there is an image of the web accessibility symbol to the left of this paragraph. Its alternate text is \\\"Web Access\nSymbol\\\".\nAlt text should communicate what an image means, not how it looks.\ngraphs, require long descriptions, but not all document types allow that. In web pages, long descriptions may b",
"suffix": "e provided in\nseveral ways: on the page below the image, via a link below the image, or via a link o",
"locked": false,
"chunk_type": "text"
},
{
"page": null,
"coordinates": null,
"section": "",
"prefix": "uire long descriptions, but not all document types allow that. In web pages, long descriptions may b",
"non_embeddable_content": "",
"text": "e provided in\nseveral ways: on the page below the image, via a link below the image, or via a link on the image.\n## Tables\n### Simple Tables\nSimple tables have a uniform number of columns and rows, without any merged cells:\n ----------------------------------------------------------------------------------------------------------------------------------\n **Screen Reader** **Responses** **Share**\n ------------------------------------------------------- ------------------------------------------- ------------------------------ JAWS 853 49%\n NVDA 238 14%\n Window-Eyes 214 12%\n System Access 181 10%\n VoiceOver 159 9%\n ----------------------------------------------------------------------------------------------------------------------------------\n### Complex Tables\nThe following is a complex table, using merged cells as headers for sections within the table. This can\\'t be made accessible in\nall types of documents:\n ----------------------------------------------------------------------------------------------------------------------------------",
"suffix": "--\n **May 2012** **September 2010",
"locked": false,
"chunk_type": "text"
},
{
"page": null,
"coordinates": null,
"section": "",
"prefix": "----------------------------------------------------------------------------------------------------",
"non_embeddable_content": "",
"text": "--\n **May 2012** **September 2010** \n ------------------------ -------------------------- -------------------------- -------------------------- -------------------------- **Screen Reader** **Responses** **Share** **Responses** **Share**\n JAWS 853 49% 727 59%\n NVDA 238 14% 105 9%\n Window-Eyes 214 12% 138 11%\n System Access 181 10% 58 5%\n VoiceOver 159 9% 120 10%\n ------------------------------------------------------------------------------------------------------------------------------------\n## Columns\nThis is an example of columns. With columns, the page is split into two or more horizontal sections. Unlike tables, in which you\nusually read across a row and then down to the next, in columns, you read down a column and then across to the next.\\\nWhen columns are not created correctly, screen readers may run lines together, reading the first line of the first column, then\nthe first line of the second column, then the second line of the first column, and so on. Obviously, that is not accessible",
"suffix": "",
"locked": false,
"chunk_type": "text"
}
],
"full_text": "# Sample Document\nThis document was created using accessibility techniques for headings, lists, image alternate text, tables, and columns. It should\nbe completely accessible using assistive technologies such as screen readers.\n## Headings\nThere are eight section headings in this document. At the beginning, \\\"Sample Document\\\" is a level 1 heading. The main section\nheadings, such as \\\"Headings\\\" and \\\"Lists\\\" are level 2 headings. The Tables section contains two sub-headings, \\\"Simple Table\\\"\nand \\\"Complex Table,\\\" which are both level 3 headings.\n## Lists\nThe following outline of the sections of this document is an ordered (numbered) list with six items. The fifth item, \\\"Tables,\\\"\ncontains a nested unordered (bulleted) list with two items.\n1. Headings\n2. Lists\n3. Links\n4. Images\n5. Tables\n- Simple Tables\n- Complex Tables\n6. Columns\n## Links\nIn web documents, links can point different locations on the page, different pages, or even downloadable documents, such as Word\ndocuments or PDFs:\n[Top of this Page](#sample-document)\\\n[Sample Document](http://www.dhs.state.il.us/page.aspx?item=67072)\\\n[Sample Document (docx)](http://www.dhs.state.il.us/OneNetLibrary/27897/documents/Initiatives/IITAA/Sample-Document.docx)\n## Images\nFor example, there is an image of the web accessibility symbol to the left of this paragraph. Its alternate text is \\\"Web Access\nSymbol\\\".\nAlt text should communicate what an image means, not how it looks.\ngraphs, require long descriptions, but not all document types allow that. In web pages, long descriptions may be provided in\nseveral ways: on the page below the image, via a link below the image, or via a link on the image.\n## Tables\n### Simple Tables\nSimple tables have a uniform number of columns and rows, without any merged cells:\n ----------------------------------------------------------------------------------------------------------------------------------\n **Screen Reader** **Responses** **Share**\n ------------------------------------------------------- ------------------------------------------- ------------------------------\n JAWS 853 49%\n NVDA 238 14%\n Window-Eyes 214 12%\n System Access 181 10%\n VoiceOver 159 9%\n ----------------------------------------------------------------------------------------------------------------------------------\n### Complex Tables\nThe following is a complex table, using merged cells as headers for sections within the table. This can\\'t be made accessible in\nall types of documents:\n ------------------------------------------------------------------------------------------------------------------------------------\n **May 2012** **September 2010** \n ------------------------ -------------------------- -------------------------- -------------------------- --------------------------\n **Screen Reader** **Responses** **Share** **Responses** **Share**\n JAWS 853 49% 727 59%\n NVDA 238 14% 105 9%\n Window-Eyes 214 12% 138 11%\n System Access 181 10% 58 5%\n VoiceOver 159 9% 120 10%\n ------------------------------------------------------------------------------------------------------------------------------------\n## Columns\nThis is an example of columns. With columns, the page is split into two or more horizontal sections. Unlike tables, in which you\nusually read across a row and then down to the next, in columns, you read down a column and then across to the next.\\\nWhen columns are not created correctly, screen readers may run lines together, reading the first line of the first column, then\nthe first line of the second column, then the second line of the first column, and so on. Obviously, that is not accessible.\n"
}
],
"errors": []
}
It works with office documents, scientific articles, technical drawings, images (e.g. scans) and many other file formats. It can be used to convert files to plain text, extract metadata, generate thumbnails and chunk the text into smaller pieces. The library is designed to be extensible, allowing for easy addition of new file formats and processing methods.
While there are multiple solutions for conversion of various file formats to plain text (e.g. unstructured), all of them are lacking some of the features, identified as "must have" for Quiddity AI:
- Transformation of complex file formats, including tables, PDFs and scans into plain text
- Handling of the metadata and not just file content (e.g. to make the files sortable by date)
- Built-in chunking
- Convenient extending to support more file formats
- Generation of thumbnails
PDFerret satifies all these requirements, while also being efficient in terms of compute time and LLM tokens usage.
Due to the numerous dependencies, a containerized installation is highly recommended. Use
docker compose up -d
to run the pre-built container. Alternatively, use
docker compose -f docker-compose-build.yml up -d
to build the container from scratch. Both container files will download the required dependencies. The container will be available at localhost:58080
.`
The API provides an endpoint to process multiple document files and extract structured information. There are single endpoint available:
/process_files_by_stream
: This endpoint allows you to send multiple files in a single request and receive the processed results in a single response.
Additionally, see localhost:58080/docs
for the Swagger UI, which provides an interactive interface for testing the API.
Below is an example of how to use the /process_files_by_stream
endpoint:
POST /process_files_by_stream
accept: application/json
vision_model
: The name of the vision model in LLMonkey to use for processing (e.g.,Mistral_Pixtral
).text_model
: The name of the text model in in LLMonkey to use for processing (e.g.,Nebius_Llama_3_1_70B_fast
).lang
: The default language for processing (e.g.,en
). Optional.return_images
: Whether to include thumbnails in the response (true
orfalse
) as base64 encoded image. Optional.perfile_settings
: A dictionary of file-specific settings, such as language or additional metadata.
Tha perfile_settings
should match following Pydantic model:
class PerFileSettings(BaseModel):
lang: Literal["", "en", "de"] = ""
extra_metainfo: dict[str, str] = {}
It allows to specify language and additional metadata for each file. Additional metadata can include any important information as key-value pairs. For example, you can include author information, document type, or any other relevant details. It will be processed by LLM and included in the output if any relevant information is found in this field.
The request body should include:
- Files: A list of document files to process, sent as multipart form data.
- Params: A JSON object containing the parameters described above.
Example params
object:
{
"vision_model": "Mistral_Pixtral",
"text_model": "Nebius_Llama_3_1_70B_fast",
"lang": "en",
"return_images": true,
"perfile_settings": {
"test_de.doc": {"lang": "de"},
"test.doc": {
"lang": "en",
"extra_metainfo": {"Author information": "John Doe"}
}
}
}
see tests/test_api.py for an example of usage. Note that other tests besides test_api.py are obsolete.
The response will follow structure:
{
"extracted": [
<list of extracted documents (PDFDoc objects) in the same order as they were sent>
],
"errors": [
<list of errors, if any occurred during processing>
]
}
See src/pdferret/datamodels.py for the definition of the PDFDoc object and other data models used in the library.
Example of a PDFDoc object
PDFDoc object will contain the following fields:
{
"metainfo": {
"doi": "",
"title": "",
"document_type": "",
"search_description": "",
"abstract": "",
"authors": [],
"pub_date": "",
"mentioned_date": "",
"language": "",
"detected_language": "",
"file_features": {
"filename": "",
"file": null,
"is_scanned": null
},
"npages": null,
"thumbnail": "<base64 encoded thumbnail>",
"extra_metainfo": null,
"ai_metadata": ""
},
"chunks": [
<list of chunks>
],
"full_text": "<full text of the document>"
}
The chunks
field will contain a list of chunks, each with the following fields:
{
"page": null, // page number of the chunk
"coordinates": null, // coordinates of the chunk in the document, not implemented yet
"section": "", // section name of the chunk, not implemented yet
"prefix": "", // prefix of the chunk
"non_embeddable_content": "", // non-embeddable content of the chunk e.g. images
"text": "<chunk text>", // the text of the chunk
"suffix": "<chunk suffix>", // the suffix of the chunk
"locked": false, // shows if the chunk can be concatenated with the next chunk, only used under the hood
"chunk_type": "<type of chunk>"
}
The chunk_type
field can be one of the following:
text
: Regular text chunkfigure
: Image or figure chunktable
: Table chunkequation
: Equation chunkother
: Other type of chunk
- To install the package, use
pip install .
in the source folder, which will install package with all dependencies - On minimal Ubuntu systems (e.g. in a python Docker image),
sudo apt install libgl1
might be needed for opencv - PDFerret relies on Tika for processing general documents. This requries up to date Tika (tested on apache/tika:3.0.0.0-BETA2-full) up and running on localhost:9998. You can overwrite tika server address by setting env var
PDFERRET_TIKA_SERVER_URL
. Please note that python tika package used as a client in this lib can download and run it's own version of Tika if the server is not found, which can lead to unpredictable results. In this case it might help to setTIKA_CLIENT_ONLY=1
in docker-compose file.
Following env variables are supported to configure PDFerret:
PDFERRET_GROBID_URL
- sets url of GROBID, used by extractorsPDFERRET_NPROC
- sets number of processors used for parallel processing for both metainfo and text extractorsPDFERRET_BATCH_SIZE
- sets batch size for parallel processing, i.e. how many items are processed between fork and join. Must be at leastPDFERRET_NPROC
, but shouldn't have strong influence on performance otherwisePDFERRET_MAX_PAGES
- all pdfs will be cropped to first MAX_PAGES WARNING! Currently not implementedPDFERRET_TIKA_SERVER_URL
- address of the TikaPDFERRET_TIKA_OCR_STRATEGY
- controls how Tika will handle pdfs without text. Must be one of 'AUTO', 'OCR_ONLY', 'NO_OCR', 'OCR_AND_TEXT_EXTRACTION', defaults to 'NO_OCR'PDFERRET_VISUAL_MAX_PAGES
- sets how many pages will be used for extracting information with vision model. Defaults to 3.- LLMonkey API keys are also required for some extractors, see llmonkey documentation for more information
PDFERRET_MAX_CHUNK_LEN
- maximum length of chunk for chunking algoPDFERRET_CHUNK_OVERLAP
- overlap of chunks for chunking algo
Create a credentials file before building the Docker container using:
gcloud auth application-default login
It will then be mounted to the container.
Probably the most important part to update is the recipes in pdferret/recipes
. They define how to extract information from different types of documents. Optionally, a new processors can be created, subclassing pdferret.base.BaseProcessor
and implementing process_single
method. The process_single
method will be parallelized depending on the parallel
attribute of the processor, which can be set to thread
, process
or none
. Alternatively, if different parallelization is needed, the _process_batch
method can be implemented.
Most of the tests are not yet updated to v2, so they will not work with the current version of the library. However, the tests in tests/test_api.py
should work. To run them, use pytest tests/test_api.py
.
Library still contains a lot of not used code from the previous version, including Grobid and Unstructured extractors. They are not used in the current version of the library and probably should be removed in the future.