Merge pull request #65 from enoch3712/64-new-readme

new read me with new examples
enoch3712 · Nov 13, 2024 · 8415907 · 8415907
2 parents 600aaad + 57c4781
commit 8415907
Showing 1 changed file with 247 additions and 66 deletions.
diff --git a/README.md b/README.md
@@ -12,133 +12,314 @@
 
 # ExtractThinker
 
-Library to extract data from files and documents agnostically using LLMs. `extract_thinker` provides ORM-style interaction between files and LLMs, allowing for flexible and powerful document extraction workflows.
+ExtractThinker is a flexible document intelligence tool that leverages Large Language Models (LLMs) to extract and classify structured data from documents, functioning like an ORM for seamless document processing workflows.
 
-## Features
+**TL;DR Document Intelligence for LLMs**
 
-- Supports multiple document loaders including Tesseract OCR, Azure Form Recognizer, AWS TextExtract, Google Document AI.
-- Customizable extraction using contract definitions.
-- Asynchronous processing for efficient document handling.
-- Built-in support for various document formats.
-- ORM-style interaction between files and LLMs.
+## 🚀 Key Features
 
-<p align="center">
-  <img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/b1b8800c-3c55-4ee5-92fe-b8b663c7a81f" alt="Extract Thinker Features Diagram" width="300"/>
-</p>
+- **Flexible Document Loaders**: Support for multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, and more.
+- **Customizable Contracts**: Define custom extraction contracts using Pydantic models for precise data extraction.
+- **Advanced Classification**: Classify documents or document sections using custom classifications and strategies.
+- **Asynchronous Processing**: Utilize asynchronous processing for efficient handling of large documents.
+- **Multi-format Support**: Seamlessly work with various document formats like PDFs, images, spreadsheets, and more.
+- **ORM-style Interaction**: Interact with documents and LLMs in an ORM-like fashion for intuitive development.
+- **Splitting Strategies**: Implement lazy or eager splitting strategies to process documents page by page or as a whole.
+- **Integration with LLMs**: Easily integrate with different LLM providers like OpenAI, Anthropic, Cohere, and more.
+- **Community-driven Development**: Inspired by the LangChain ecosystem with a focus on intelligent document processing.
+![image](https://github.com/user-attachments/assets/844b425c-0bb7-4abc-9d08-96e4a736d096)
 
-## Installation
+## 📦 Installation
 
-To install `extract_thinker`, you can use `pip`:
+Install ExtractThinker using pip:
 
 ```bash
 pip install extract_thinker
 ```
 
-## Usage
-Here's a quick example to get you started with extract_thinker. This example demonstrates how to load a document using Tesseract OCR and extract specific fields defined in a contract.
+## 🛠️ Usage
+
+### Basic Extraction Example
+
+Here's a quick example to get you started with ExtractThinker. This example demonstrates how to load a document using Tesseract OCR and extract specific fields defined in a contract.
 
 ```python
 import os
 from dotenv import load_dotenv
-from extract_thinker import DocumentLoaderTesseract, Extractor, Contract
+from extract_thinker import Extractor, DocumentLoaderTesseract, Contract
 
 load_dotenv()
-cwd = os.getcwd()
 
 class InvoiceContract(Contract):
     invoice_number: str
     invoice_date: str
 
-tesseract_path = os.getenv("TESSERACT_PATH")
-test_file_path = os.path.join(cwd, "test_images", "invoice.png")
+# Set the path to your Tesseract executable
+test_file_path = os.path.join("path_to_your_files", "invoice.pdf")
 
+# Initialize the extractor
 extractor = Extractor()
-extractor.load_document_loader(
-    DocumentLoaderTesseract(tesseract_path)
-)
-extractor.load_llm("claude-3-haiku-20240307")
+extractor.load_document_loader(DocumentLoaderPyPdf())
+extractor.load_llm("gpt-4o-mini")  # or any other supported model
 
+# Extract data from the document
 result = extractor.extract(test_file_path, InvoiceContract)
 
-print("Invoice Number: ", result.invoice_number)
-print("Invoice Date: ", result.invoice_date)
+print("Invoice Number:", result.invoice_number)
+print("Invoice Date:", result.invoice_date)
+```
+
+### Classification Example
+
+ExtractThinker allows you to classify documents or parts of documents using custom classifications:
+
+```python
+import os
+from dotenv import load_dotenv
+from extract_thinker import (
+    Extractor, Classification, Process, ClassificationStrategy,
+    DocumentLoaderPyPdf, Contract
+)
+
+load_dotenv()
+
+class InvoiceContract(Contract):
+    invoice_number: str
+    invoice_date: str
+
+class DriverLicenseContract(Contract):
+    name: str
+    license_number: str
+
+# Initialize the extractor and load the document loader
+extractor = Extractor()
+extractor.load_document_loader(DocumentLoaderPyPdf())
+extractor.load_llm("gpt-4o-mini")
+
+# Define classifications
+classifications = [
+    Classification(
+        name="Invoice",
+        description="An invoice document",
+        contract=InvoiceContract,
+        extractor=extractor,
+    ),
+    Classification(
+        name="Driver License",
+        description="A driver's license document",
+        contract=DriverLicenseContract,
+        extractor=extractor,
+    ),
+]
+
+# Classify the document directly using the extractor
+result = extractor.classify(
+    "path_to_your_document.pdf",  # Can be a file path or IO stream
+    classifications,
+    image=True  # Set to True for image-based classification
+)
+
+# The result will be a ClassificationResponse object with 'name' and 'confidence' fields
+print(f"Document classified as: {result.name}")
+print(f"Confidence level: {result.confidence}")
 ```
 
-## Splitting Files Example
-You can also split and process documents using extract_thinker. Here's how you can do it:
+### Splitting Files Example
+
+ExtractThinker allows you to split and process documents using different strategies. Here's how you can split a document and extract data based on classifications.
 
 ```python
 import os
 from dotenv import load_dotenv
-from extract_thinker import DocumentLoaderTesseract, Extractor, Process, Classification, ImageSplitter
+from extract_thinker import (
+    Extractor,
+    Process,
+    Classification,
+    ImageSplitter,
+    DocumentLoaderTesseract,
+    Contract,
+    SplittingStrategy,
+)
 
 load_dotenv()
 
-class DriverLicense(Contract):
-    # Define your DriverLicense contract fields here
-    pass
+class DriverLicenseContract(Contract):
+    name: str
+    license_number: str
 
 class InvoiceContract(Contract):
     invoice_number: str
     invoice_date: str
 
+# Initialize the extractor and load the document loader
 extractor = Extractor()
-extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
-extractor.load_llm("gpt-3.5-turbo")
+extractor.load_document_loader(DocumentLoaderPyPdf())
+extractor.load_llm("gpt-4o-mini")
 
+# Define classifications
 classifications = [
-    Classification(name="Driver License", description="This is a driver license", contract=DriverLicense, extractor=extractor),
-    Classification(name="Invoice", description="This is an invoice", contract=InvoiceContract, extractor=extractor)
+    Classification(
+        name="Driver License",
+        description="A driver's license document",
+        contract=DriverLicenseContract,
+        extractor=extractor,
+    ),
+    Classification(
+        name="Invoice",
+        description="An invoice document",
+        contract=InvoiceContract,
+        extractor=extractor,
+    ),
 ]
 
+# Initialize the process and load the splitter
 process = Process()
-process.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
-process.load_splitter(ImageSplitter())
+process.load_document_loader(DocumentLoaderPyPdf())
+process.load_splitter(ImageSplitter(model="gpt-4o-mini"))
+
+# Load and process the document
+path_to_document = "path_to_your_multipage_document.pdf"
+split_content = (
+    process.load_file(path_to_document)
+    .split(classifications, strategy=SplittingStrategy.LAZY)
+    .extract()
+)
 
-path = "..."
+# Process the extracted content as needed
+for item in split_content:
+    if isinstance(item, InvoiceContract):
+        print("Extracted Invoice:")
+        print("Invoice Number:", item.invoice_number)
+        print("Invoice Date:", item.invoice_date)
+    elif isinstance(item, DriverLicenseContract):
+        print("Extracted Driver License:")
+        print("Name:", item.name)
+        print("License Number:", item.license_number)
 
-split_content = process.load_file(path)\
-    .split(classifications)\
-    .extract()
+```
+
+### Batch Processing Example
 
-# Process the split_content as needed
+You can also perform batch processing of documents:
+
+```python
+from extract_thinker import Extractor, Contract
+
+class ReceiptContract(Contract):
+    store_name: str
+    total_amount: float
+
+extractor = Extractor()
+extractor.load_llm("gpt-4o-mini")
+
+# List of file paths or streams
+document = "receipt1.jpg"
+
+batch_job = extractor.extract_batch(
+    source=document,
+    response_model=ReceiptContract,
+    vision=True,
+)
+
+# Monitor the batch job status
+print("Batch Job Status:", await batch_job.get_status())
+
+# Retrieve results once processing is complete
+results = await batch_job.get_result()
+for result in results.parsed_results:
+    print("Store Name:", result.store_name)
+    print("Total Amount:", result.total_amount)
 ```
 
-## Infrastructure
+### Local LLM Integration Example
 
-The `extract_thinker` project is inspired by the LangChain ecosystem, featuring a modular infrastructure with templates, components, and core functions to facilitate robust document extraction and processing. 
+ExtractThinker supports custom LLM integrations. Here's how you can use a custom LLM:
 
-<p align="center">
-  <img src="https://github.com/enoch3712/Open-DocLLM/assets/9283394/996fb2de-0558-4f13-ab3d-7ea56a593951" alt="Extract Thinker Logo" width="400"/>
-</p>
+```python
+from extract_thinker import Extractor, LLM, DocumentLoaderTesseract, Contract
+
+class InvoiceContract(Contract):
+    invoice_number: str
+    invoice_date: str
+
+# Initialize the extractor
+extractor = Extractor()
+extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
+
+# Load a custom LLM (e.g., Ollama)
+llm = LLM('ollama/phi3', api_base='http://localhost:11434')
+extractor.load_llm(llm)
+
+# Extract data
+result = extractor.extract("invoice.png", InvoiceContract)
+print("Invoice Number:", result.invoice_number)
+print("Invoice Date:", result.invoice_date)
+```
+
+## 📚 Documentation and Resources
+
+- **Examples**: Check out the examples directory for Jupyter notebooks and scripts demonstrating various use cases.
+- **Medium Articles**: Read articles about ExtractThinker on the author's Medium page.
+- **Test Suite**: Explore the test suite in the tests/ directory for more advanced usage examples and test cases.
+
+## 🧩 Integration with LLM Providers
+
+ExtractThinker supports integration with multiple LLM providers:
 
-## 📖 Examples
+- **OpenAI**: Use models like gpt-3.5-turbo, gpt-4, etc.
+- **Anthropic**: Integrate with Claude models.
+- **Cohere**: Utilize Cohere's language models.
+- **Azure OpenAI**: Connect with Azure's OpenAI services.
+- **Local Models**: Ollama compatible models.
 
-| Notebook | Description |
-|----------|-------------|
-| [Basic Usage](examples/notebooks/basic_example.ipynb) | Basic usage of ExtractThinker with PyPDF loader and GPT-4o-mini for invoice data extraction |
+## ⚙️ How It Works
 
-## Why Just Not LangChain?
-While LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). Although achieving 100% accuracy in IDP remains a challenge, leveraging LLMs brings us significantly closer to this goal.
+ExtractThinker uses a modular architecture inspired by the LangChain ecosystem:
 
-## Additional Examples
-You can find more examples in the repository. These examples cover various use cases and demonstrate the flexibility of extract_thinker. Also check my the medium of the author that contains several examples about the library
+- **Document Loaders**: Responsible for loading and preprocessing documents from various sources and formats.
+- **Extractors**: Orchestrate the interaction between the document loaders and LLMs to extract structured data.
+- **Splitters**: Implement strategies to split documents into manageable chunks for processing.
+- **Contracts**: Define the expected structure of the extracted data using Pydantic models.
+- **Classifications**: Classify documents or document sections to apply appropriate extraction contracts.
+- **Processes**: Manage the workflow of loading, classifying, splitting, and extracting data from documents.
 
-## Contributing
-We welcome contributions from the community! If you would like to contribute, please follow these steps:
+![image](https://github.com/user-attachments/assets/b12ba937-20a8-47da-a778-c126bc1748b3)
 
-Fork the repository.
-Create a new branch for your feature or bugfix.
-Write tests for your changes.
-Run tests to ensure everything is working correctly.
-Submit a pull request with a description of your changes.
+## 📝 Why Use ExtractThinker?
 
-## Community
-Júlio Almeida
-    https://pub.towardsai.net/extractthinker-ai-document-intelligence-with-llms-72cbce1890ef
+While general frameworks like LangChain offer a broad range of functionalities, ExtractThinker is specialized for Intelligent Document Processing (IDP). It simplifies the complexities associated with IDP by providing:
+
+- **Specialized Components**: Tailored tools for document loading, splitting, and extraction.
+- **High Accuracy with LLMs**: Leverages the power of LLMs to improve the accuracy of data extraction and classification.
+- **Ease of Use**: Intuitive APIs and ORM-style interactions reduce the learning curve.
+- **Community Support**: Active development and support from the community.
+
+## 🤝 Contributing
+
+We welcome contributions from the community! To contribute:
+
+1. Fork the repository
+2. Create a new branch for your feature or bugfix
+3. Write tests for your changes
+4. Run tests to ensure everything is working correctly
+5. Submit a pull request with a description of your changes
+
+## 🌟 Community and Support
+
+Stay updated and connect with the community:
+
+- [Claude 3.5 — The King of Document Intelligence](https://medium.com/gitconnected/claude-3-5-the-king-of-document-intelligence-f57bea1d209d?sk=124c5abb30c0e7f04313c5e20e79c2d1)
+- [Classification Tree for LLMs](https://medium.com/gitconnected/classification-tree-for-llms-32b69015c5e0?sk=8a258cf74fe3483e68ab164e6b3aaf4c)
+- [Advanced Document Classification with LLMs](https://medium.com/gitconnected/advanced-document-classification-with-llms-8801eaee3c58?sk=f5a22ee72022eb70e112e3e2d1608e79)
+- [Phi-3 and Azure: PDF Data Extraction | ExtractThinker](https://medium.com/towards-artificial-intelligence/phi-3-and-azure-pdf-data-extraction-extractthinker-cb490a095adb?sk=7be7e625b8f9932768442f87dd0ebcec)
+- [ExtractThinker: Document Intelligence for LLMs](https://medium.com/towards-artificial-intelligence/extractthinker-ai-document-intelligence-with-llms-72cbce1890ef)
+
+- [GitHub Issues](https://github.com/enoch3712/Open-DocLLM/issues)
+
+## 📄 License
 
-## License
 This project is licensed under the Apache License 2.0. See the LICENSE file for more details.
 
 ## Contact
-For any questions or issues, please open an issue on the GitHub repository.
+
+For any questions or issues, please open an issue on the GitHub repository or reach out via email.