OneOffTech Parse client

Parse client is a library to interact with OneOffTech Parse service. OneOffTech Parse is designed to extract text from PDF files preserving the structure of the document to improve interaction with Large Language Models (LLMs).

OneOffTech Parse is based on Parxy extractor. The client is also suitable to connect to self-hosted versions of Parxy.

Note

The Parse client package is under development and is not ready for production use.

Installation

You can install the package via Composer:

composer require oneofftech/parse-client

Usage

The Parse client is able to connect to self-hosted instances of the Parxy extractor service or the cloud hosted OneOffTech Parse service.

Use with self-hosted instance

Before proceeding a running instance of Parxy is required. Once you have a running instance, you can instantiate the connector by passing the url that the extractor service is listening on.

use OneOffTech\Parse\Client\Connectors\ParseConnector;

$client = new ParseConnector(baseUrl: "http://localhost:5000");

/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
$document = $client->parse("https://domain.internal/document.pdf");

Note

The URL of the document must be accessible without authentication.
Documents are downloaded for the time of processing and then the file is immediately deleted.

Use the cloud hosted service

Important

The cloud hosted service is currently in private beta. Drop us a message.

Go to parse.oneofftech.de and obtain an access token. Instantiate the client and provide a URL of a PDF document.

use OneOffTech\Parse\Client\Connectors\ParseConnector;

$client = new ParseConnector("token");

/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
$document = $client->parse("https://domain.internal/document.pdf");

Note

The URL of the document must be accessible without authentication.
Documents are downloaded for the time of processing and then the file is immediately deleted.

Specify the preferred extraction method

Parse service supports different processors, pymupdf or pdfact, unstructured and llamaparse. You can specify the preferred processor for each request.

use OneOffTech\Parse\Client\ParseOption;
use OneOffTech\Parse\Client\DocumentProcessor;
use OneOffTech\Parse\Client\Connectors\ParseConnector;

$client = new ParseConnector("token");

/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
$document = $client->parse(
    url: "https://domain.internal/document.pdf", 
    options: new ParseOption(DocumentProcessor::PYMUPDF)
);

PDFAct vs PyMuPDF

PDFAct offers more flexibility than PyMuPDF. You should evaluate the extraction method best suitable for your application. Here is a small comparison of the two methods.

feature	PDFAct	PyMuPDF
Text extraction	✅	✅
Pagination	✅	✅
Headings identification	✅	-
Text styles (e.g. bold or italic)	✅	-
Page header	✅	-
Page footer	✅	-

Document structure

Parse is designed to preserve the document's structure hence the content is returned in a hierarchical fashion.

Document
 ├─Page
 │  ├─Text (category: heading)
 │  └─Text (category: body)
 └─Page
    ├─Text (category: heading)
    └─Text (category: body)

For a more in-depth explanation of the structure see Parse Document Model.

Testing

Parse client is tested using PEST. Tests run for each commit and pull request.

To execute the test suite run:

composer test

Changelog

Please see CHANGELOG for more information on what has changed recently.

Contributing

Thank you for considering contributing to the Parse client! The contribution guide can be found in the CONTRIBUTING.md file.

Security Vulnerabilities

Please review our security policy on how to report security vulnerabilities.

Credits

Supporters

The project is provided and supported by OneOff-Tech (UG).

License

The MIT License (MIT). Please see License File for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github		.github
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
composer.json		composer.json
phpunit.xml.dist		phpunit.xml.dist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OneOffTech Parse client

Installation

Usage

Use with self-hosted instance

Use the cloud hosted service

Specify the preferred extraction method

PDFAct vs PyMuPDF

Document structure

Testing

Changelog

Contributing

Security Vulnerabilities

Credits

Supporters

License

About

Uh oh!

Releases 2

Uh oh!

Contributors 3

Uh oh!

Languages

License

OneOffTech/parse-client

Folders and files

Latest commit

History

Repository files navigation

OneOffTech Parse client

Installation

Usage

Use with self-hosted instance

Use the cloud hosted service

Specify the preferred extraction method

PDFAct vs PyMuPDF

Document structure

Testing

Changelog

Contributing

Security Vulnerabilities

Credits

Supporters

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Contributors 3

Uh oh!

Languages