Skip to content

Commit

Permalink
Merge pull request #65 from enoch3712/64-new-readme
Browse files Browse the repository at this point in the history
new read me with new examples
  • Loading branch information
enoch3712 authored Nov 13, 2024
2 parents 600aaad + 57c4781 commit 8415907
Showing 1 changed file with 247 additions and 66 deletions.
313 changes: 247 additions & 66 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,133 +12,314 @@

# ExtractThinker

Library to extract data from files and documents agnostically using LLMs. `extract_thinker` provides ORM-style interaction between files and LLMs, allowing for flexible and powerful document extraction workflows.
ExtractThinker is a flexible document intelligence tool that leverages Large Language Models (LLMs) to extract and classify structured data from documents, functioning like an ORM for seamless document processing workflows.

## Features
**TL;DR Document Intelligence for LLMs**

- Supports multiple document loaders including Tesseract OCR, Azure Form Recognizer, AWS TextExtract, Google Document AI.
- Customizable extraction using contract definitions.
- Asynchronous processing for efficient document handling.
- Built-in support for various document formats.
- ORM-style interaction between files and LLMs.
## 🚀 Key Features

<p align="center">
<img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/b1b8800c-3c55-4ee5-92fe-b8b663c7a81f" alt="Extract Thinker Features Diagram" width="300"/>
</p>
- **Flexible Document Loaders**: Support for multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, and more.
- **Customizable Contracts**: Define custom extraction contracts using Pydantic models for precise data extraction.
- **Advanced Classification**: Classify documents or document sections using custom classifications and strategies.
- **Asynchronous Processing**: Utilize asynchronous processing for efficient handling of large documents.
- **Multi-format Support**: Seamlessly work with various document formats like PDFs, images, spreadsheets, and more.
- **ORM-style Interaction**: Interact with documents and LLMs in an ORM-like fashion for intuitive development.
- **Splitting Strategies**: Implement lazy or eager splitting strategies to process documents page by page or as a whole.
- **Integration with LLMs**: Easily integrate with different LLM providers like OpenAI, Anthropic, Cohere, and more.
- **Community-driven Development**: Inspired by the LangChain ecosystem with a focus on intelligent document processing.
![image](https://github.com/user-attachments/assets/844b425c-0bb7-4abc-9d08-96e4a736d096)

## Installation
## 📦 Installation

To install `extract_thinker`, you can use `pip`:
Install ExtractThinker using pip:

```bash
pip install extract_thinker
```

## Usage
Here's a quick example to get you started with extract_thinker. This example demonstrates how to load a document using Tesseract OCR and extract specific fields defined in a contract.
## 🛠️ Usage

### Basic Extraction Example

Here's a quick example to get you started with ExtractThinker. This example demonstrates how to load a document using Tesseract OCR and extract specific fields defined in a contract.

```python
import os
from dotenv import load_dotenv
from extract_thinker import DocumentLoaderTesseract, Extractor, Contract
from extract_thinker import Extractor, DocumentLoaderTesseract, Contract

load_dotenv()
cwd = os.getcwd()

class InvoiceContract(Contract):
invoice_number: str
invoice_date: str

tesseract_path = os.getenv("TESSERACT_PATH")
test_file_path = os.path.join(cwd, "test_images", "invoice.png")
# Set the path to your Tesseract executable
test_file_path = os.path.join("path_to_your_files", "invoice.pdf")

# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(
DocumentLoaderTesseract(tesseract_path)
)
extractor.load_llm("claude-3-haiku-20240307")
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini") # or any other supported model

# Extract data from the document
result = extractor.extract(test_file_path, InvoiceContract)

print("Invoice Number: ", result.invoice_number)
print("Invoice Date: ", result.invoice_date)
print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)
```

### Classification Example

ExtractThinker allows you to classify documents or parts of documents using custom classifications:

```python
import os
from dotenv import load_dotenv
from extract_thinker import (
Extractor, Classification, Process, ClassificationStrategy,
DocumentLoaderPyPdf, Contract
)

load_dotenv()

class InvoiceContract(Contract):
invoice_number: str
invoice_date: str

class DriverLicenseContract(Contract):
name: str
license_number: str

# Initialize the extractor and load the document loader
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")

# Define classifications
classifications = [
Classification(
name="Invoice",
description="An invoice document",
contract=InvoiceContract,
extractor=extractor,
),
Classification(
name="Driver License",
description="A driver's license document",
contract=DriverLicenseContract,
extractor=extractor,
),
]

# Classify the document directly using the extractor
result = extractor.classify(
"path_to_your_document.pdf", # Can be a file path or IO stream
classifications,
image=True # Set to True for image-based classification
)

# The result will be a ClassificationResponse object with 'name' and 'confidence' fields
print(f"Document classified as: {result.name}")
print(f"Confidence level: {result.confidence}")
```

## Splitting Files Example
You can also split and process documents using extract_thinker. Here's how you can do it:
### Splitting Files Example

ExtractThinker allows you to split and process documents using different strategies. Here's how you can split a document and extract data based on classifications.

```python
import os
from dotenv import load_dotenv
from extract_thinker import DocumentLoaderTesseract, Extractor, Process, Classification, ImageSplitter
from extract_thinker import (
Extractor,
Process,
Classification,
ImageSplitter,
DocumentLoaderTesseract,
Contract,
SplittingStrategy,
)

load_dotenv()

class DriverLicense(Contract):
# Define your DriverLicense contract fields here
pass
class DriverLicenseContract(Contract):
name: str
license_number: str

class InvoiceContract(Contract):
invoice_number: str
invoice_date: str

# Initialize the extractor and load the document loader
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
extractor.load_llm("gpt-3.5-turbo")
extractor.load_document_loader(DocumentLoaderPyPdf())
extractor.load_llm("gpt-4o-mini")

# Define classifications
classifications = [
Classification(name="Driver License", description="This is a driver license", contract=DriverLicense, extractor=extractor),
Classification(name="Invoice", description="This is an invoice", contract=InvoiceContract, extractor=extractor)
Classification(
name="Driver License",
description="A driver's license document",
contract=DriverLicenseContract,
extractor=extractor,
),
Classification(
name="Invoice",
description="An invoice document",
contract=InvoiceContract,
extractor=extractor,
),
]

# Initialize the process and load the splitter
process = Process()
process.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
process.load_splitter(ImageSplitter())
process.load_document_loader(DocumentLoaderPyPdf())
process.load_splitter(ImageSplitter(model="gpt-4o-mini"))

# Load and process the document
path_to_document = "path_to_your_multipage_document.pdf"
split_content = (
process.load_file(path_to_document)
.split(classifications, strategy=SplittingStrategy.LAZY)
.extract()
)

path = "..."
# Process the extracted content as needed
for item in split_content:
if isinstance(item, InvoiceContract):
print("Extracted Invoice:")
print("Invoice Number:", item.invoice_number)
print("Invoice Date:", item.invoice_date)
elif isinstance(item, DriverLicenseContract):
print("Extracted Driver License:")
print("Name:", item.name)
print("License Number:", item.license_number)

split_content = process.load_file(path)\
.split(classifications)\
.extract()
```

### Batch Processing Example

# Process the split_content as needed
You can also perform batch processing of documents:

```python
from extract_thinker import Extractor, Contract

class ReceiptContract(Contract):
store_name: str
total_amount: float

extractor = Extractor()
extractor.load_llm("gpt-4o-mini")

# List of file paths or streams
document = "receipt1.jpg"

batch_job = extractor.extract_batch(
source=document,
response_model=ReceiptContract,
vision=True,
)

# Monitor the batch job status
print("Batch Job Status:", await batch_job.get_status())

# Retrieve results once processing is complete
results = await batch_job.get_result()
for result in results.parsed_results:
print("Store Name:", result.store_name)
print("Total Amount:", result.total_amount)
```

## Infrastructure
### Local LLM Integration Example

The `extract_thinker` project is inspired by the LangChain ecosystem, featuring a modular infrastructure with templates, components, and core functions to facilitate robust document extraction and processing.
ExtractThinker supports custom LLM integrations. Here's how you can use a custom LLM:

<p align="center">
<img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/996fb2de-0558-4f13-ab3d-7ea56a593951" alt="Extract Thinker Logo" width="400"/>
</p>
```python
from extract_thinker import Extractor, LLM, DocumentLoaderTesseract, Contract

class InvoiceContract(Contract):
invoice_number: str
invoice_date: str

# Initialize the extractor
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))

# Load a custom LLM (e.g., Ollama)
llm = LLM('ollama/phi3', api_base='http://localhost:11434')
extractor.load_llm(llm)

# Extract data
result = extractor.extract("invoice.png", InvoiceContract)
print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)
```

## 📚 Documentation and Resources

- **Examples**: Check out the examples directory for Jupyter notebooks and scripts demonstrating various use cases.
- **Medium Articles**: Read articles about ExtractThinker on the author's Medium page.
- **Test Suite**: Explore the test suite in the tests/ directory for more advanced usage examples and test cases.

## 🧩 Integration with LLM Providers

ExtractThinker supports integration with multiple LLM providers:

## 📖 Examples
- **OpenAI**: Use models like gpt-3.5-turbo, gpt-4, etc.
- **Anthropic**: Integrate with Claude models.
- **Cohere**: Utilize Cohere's language models.
- **Azure OpenAI**: Connect with Azure's OpenAI services.
- **Local Models**: Ollama compatible models.

| Notebook | Description |
|----------|-------------|
| [Basic Usage](examples/notebooks/basic_example.ipynb) | Basic usage of ExtractThinker with PyPDF loader and GPT-4o-mini for invoice data extraction |
## ⚙️ How It Works

## Why Just Not LangChain?
While LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). Although achieving 100% accuracy in IDP remains a challenge, leveraging LLMs brings us significantly closer to this goal.
ExtractThinker uses a modular architecture inspired by the LangChain ecosystem:

## Additional Examples
You can find more examples in the repository. These examples cover various use cases and demonstrate the flexibility of extract_thinker. Also check my the medium of the author that contains several examples about the library
- **Document Loaders**: Responsible for loading and preprocessing documents from various sources and formats.
- **Extractors**: Orchestrate the interaction between the document loaders and LLMs to extract structured data.
- **Splitters**: Implement strategies to split documents into manageable chunks for processing.
- **Contracts**: Define the expected structure of the extracted data using Pydantic models.
- **Classifications**: Classify documents or document sections to apply appropriate extraction contracts.
- **Processes**: Manage the workflow of loading, classifying, splitting, and extracting data from documents.

## Contributing
We welcome contributions from the community! If you would like to contribute, please follow these steps:
![image](https://github.com/user-attachments/assets/b12ba937-20a8-47da-a778-c126bc1748b3)

Fork the repository.
Create a new branch for your feature or bugfix.
Write tests for your changes.
Run tests to ensure everything is working correctly.
Submit a pull request with a description of your changes.
## 📝 Why Use ExtractThinker?

## Community
Júlio Almeida
https://pub.towardsai.net/extractthinker-ai-document-intelligence-with-llms-72cbce1890ef
While general frameworks like LangChain offer a broad range of functionalities, ExtractThinker is specialized for Intelligent Document Processing (IDP). It simplifies the complexities associated with IDP by providing:

- **Specialized Components**: Tailored tools for document loading, splitting, and extraction.
- **High Accuracy with LLMs**: Leverages the power of LLMs to improve the accuracy of data extraction and classification.
- **Ease of Use**: Intuitive APIs and ORM-style interactions reduce the learning curve.
- **Community Support**: Active development and support from the community.

## 🤝 Contributing

We welcome contributions from the community! To contribute:

1. Fork the repository
2. Create a new branch for your feature or bugfix
3. Write tests for your changes
4. Run tests to ensure everything is working correctly
5. Submit a pull request with a description of your changes

## 🌟 Community and Support

Stay updated and connect with the community:

- [Claude 3.5 — The King of Document Intelligence](https://medium.com/gitconnected/claude-3-5-the-king-of-document-intelligence-f57bea1d209d?sk=124c5abb30c0e7f04313c5e20e79c2d1)
- [Classification Tree for LLMs](https://medium.com/gitconnected/classification-tree-for-llms-32b69015c5e0?sk=8a258cf74fe3483e68ab164e6b3aaf4c)
- [Advanced Document Classification with LLMs](https://medium.com/gitconnected/advanced-document-classification-with-llms-8801eaee3c58?sk=f5a22ee72022eb70e112e3e2d1608e79)
- [Phi-3 and Azure: PDF Data Extraction | ExtractThinker](https://medium.com/towards-artificial-intelligence/phi-3-and-azure-pdf-data-extraction-extractthinker-cb490a095adb?sk=7be7e625b8f9932768442f87dd0ebcec)
- [ExtractThinker: Document Intelligence for LLMs](https://medium.com/towards-artificial-intelligence/extractthinker-ai-document-intelligence-with-llms-72cbce1890ef)

- [GitHub Issues](https://github.com/enoch3712/Open-DocLLM/issues)

## 📄 License

## License
This project is licensed under the Apache License 2.0. See the LICENSE file for more details.

## Contact
For any questions or issues, please open an issue on the GitHub repository.

For any questions or issues, please open an issue on the GitHub repository or reach out via email.

0 comments on commit 8415907

Please sign in to comment.