|
| 1 | +# World-class Document Processing Pipeline with Ground X |
| 2 | + |
| 3 | +This application demonstrates how to build a Document Processing Pipeline that processes complex documents with tables, figures, and dense text using GroundX's state-of-the-art parsing technology. Users can upload documents and receive comprehensive insights including extracted text, semantic analysis, key insights, and interactive AI-powered document queries. |
| 4 | + |
| 5 | +We use: |
| 6 | + |
| 7 | +- Ground X for SOTA document processing and X-Ray analysis |
| 8 | +- Streamlit for the UI |
| 9 | +- Ollama for serving LLM locally |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## Setup and Installation |
| 14 | + |
| 15 | +Ensure you have Python 3.8.1 or later installed on your system. |
| 16 | + |
| 17 | +Install dependencies: |
| 18 | + |
| 19 | +```bash |
| 20 | +uv sync |
| 21 | +``` |
| 22 | + |
| 23 | +Copy `.env.example` to `.env` and configure the following environment variables: |
| 24 | + |
| 25 | +``` |
| 26 | +GROUNDX_API_KEY=your_groundx_api_key_here |
| 27 | +``` |
| 28 | + |
| 29 | +```bash |
| 30 | +# Install Ollama from https://ollama.ai/ |
| 31 | +# Pull the required model |
| 32 | +ollama pull phi3:mini |
| 33 | +# Start Ollama service |
| 34 | +ollama serve |
| 35 | +``` |
| 36 | + |
| 37 | +Run the Streamlit app: |
| 38 | + |
| 39 | +```bash |
| 40 | +streamlit run app.py |
| 41 | +``` |
| 42 | + |
| 43 | +## Project Structure |
| 44 | + |
| 45 | +``` |
| 46 | +groundX-doc-pipeline/ |
| 47 | +├── app.py # Main Streamlit application (uses groundx_utils.py) |
| 48 | +├── groundx_utils.py # Utility functions for Ground X operations |
| 49 | +├── .env # Environment variables (create from .env.example) |
| 50 | +├── file/ # Folder containing files for running evaluation |
| 51 | +└── README.md # This file |
| 52 | +
|
| 53 | +📁 Evaluation Tools: |
| 54 | +├── evaluation_geval.py # GEval framework evaluation |
| 55 | +└── run_evaluation_cli.py # CLI evaluation runner |
| 56 | +``` |
| 57 | + |
| 58 | +## Usage |
| 59 | + |
| 60 | +1. Upload a document using the sidebar (supports PDF, PNG, JPG, JPEG, DOCX) |
| 61 | +2. Wait for the document to be processed by Ground X |
| 62 | +3. Explore the X-Ray analysis results in different tabs: |
| 63 | + - JSON Output: Raw analysis data |
| 64 | + - Narrative Summary: Extracted narratives |
| 65 | + - File Summary: Document overview |
| 66 | + - Suggested Text: AI-suggested content |
| 67 | + - Extracted Text: Raw text extraction |
| 68 | + - Keywords: Document keywords |
| 69 | +4. Use the chat interface to ask questions about your document |
| 70 | + |
| 71 | +## Features |
| 72 | + |
| 73 | +The app implements a world-class document processing workflow: |
| 74 | + |
| 75 | +- **Ground X Bucket Management**: Automatic bucket creation and document organization |
| 76 | +- **Document Ingestion**: Support for PDF, Word docs, images, and more |
| 77 | +- **X-Ray Analysis**: Rich structured data with summaries, page chunks, keywords, and metadata |
| 78 | +- **Context Engineering**: Intelligent context preparation for LLM queries |
| 79 | +- **AI Chat Interface**: Interactive Q&A powered by local LLM |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +## 📬 Stay Updated with Our Newsletter! |
| 84 | + |
| 85 | +**Get a FREE Data Science eBook** 📖 with 150+ essential lessons in Data Science when you subscribe to our newsletter! Stay in the loop with the latest tutorials, insights, and exclusive resources. [Subscribe now!](https://join.dailydoseofds.com) |
| 86 | + |
| 87 | +[](https://join.dailydoseofds.com) |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +## Contribution |
| 92 | + |
| 93 | +Contributions are welcome! Please fork the repository and submit a pull request with your improvements. |
0 commit comments