One-liner:
AI-powered backend system to extract structured data from multi-page invoices and PDFs and Images at scale, flag errors, and store results efficiently.
- Features
- High-Level Architecture
- Dead Letter Queue (DLQ)
- Retry Strategy with Exponential Backoff
- Tech Stack
- Installation & Setup
- Scripts
- Environment Variables
- API Endpoints
- Sample Output
- Validation & Error Handling
- Scaling Workers & Concurrency
- Token Efficiency
- Postman Collection
- License
- Batch & Async Processing: Upload multiple invoices; files queued in RabbitMQ.
- High Concurrency: Single worker can process ~100 files in parallel; scale horizontally with multiple workers.
- GenAI-Powered Extraction: Extracts vendor details, invoice metadata, line items, totals, and payment info.
- Error Detection: Flags mismatches in subtotal, sales tax, shipping, and total.
- Multi-format Support: PDFs, images, scanned, and multi-page invoices.
- Token-Efficient YAML Output: Saves 20–30% of AI token costs; converted to JSON for DB storage.
Workflow Summary:
- Users upload invoices via
/api/file/upload. - Files queued in RabbitMQ.
- Workers fetch id's and run GenAI extraction (parallel processing).
- YAML output generated → converted to JSON → stored in MongoDB.
- Validation checks applied; errors flagged.
- Processed results accessible via API.
- Invoices or jobs that fail processing (e.g., unreadable files, persistent GenAI errors) are automatically routed to a Dead Letter Queue (DLQ).
- DLQ allows you to review failed jobs, retry them manually, or trigger alerts.
- Keeps the main RabbitMQ queue unblocked and ensures smooth processing of other invoices.
- Common use cases:
- Corrupted PDF or image
- Unsupported invoice format
- Persistent AI extraction failure
- or any other Error
- Failed jobs are retried automatically using an exponential backoff strategy.
- Wait time increases after each failed attempt (e.g., 60s → 120s → 180s → 240s) to prevent overloading the system.
- Jobs exceeding the maximum retry count are sent to the Dead Letter Queue (DLQ) for manual review or alerts.
- Ensures smooth processing while handling temporary errors gracefully.
| Layer | Technology |
|---|---|
| Backend Framework | Express.js |
| Database | MongoDB |
| Queue / Async Jobs | RabbitMQ |
| Rate Limiting | Redis |
| Containerization | Docker & Docker Compose |
| AI / Data Extraction | Generative AI (Gemini) |
⚠️ Ensure.envincludes your Gemini API key.
# Build and start containers
docker compose --build -d
# Access the app
http://localhost:3000# Start required services
docker compose up -d
# Start development server
npm run dev
# Start background worker
npm run worker
# OR start with PM2 for production
npm run start"scripts": {
"dev": "node --env-file=.env --watch index.js",
"worker": "node --env-file=.env worker/file.worker.js",
"queue:flush": "node --env-file=.env scripts/amqp.flush.js",
"start": "pm2 start ecosystem.config.cjs"
}GEMINI_API_KEY=<your-api-key>
AMQP_URL=amqp://user:password@localhost:5672
MONGODB_URI=mongodb://localhost:27017/invoices
REDIS_HOST=localhost
REDIS_PORT=6379
MAX_FILE_SIZE=200
PORT=3000
WORKER_CONCURRENCY=100- Endpoint:
/api/file/upload - Method: POST
- Request: Multipart/form-data (multiple files)
- Form Key: files
- Response:
{
"message": "Files uploaded and queued",
"file_ids": [array of file id's]
}- Endpoint:
/api/file - Method: GET
- Request Body Example:
- filters (status): pending, processing, processed, error
- flag: true or false (to fetch the flagged files having error in calculation or other)
{
"flag": true,
"status": "processed"
}- Endpoint:
/api/file/{invoice_id} - Method: GET
- FormData Example: Include file(s) to process if needed
items:
- name: "Activator CREAM 5-gal 5 gallons/pailt"
quantity: 2
rate: 269.00
- name: "AEB2461 PTFE 10\"x36YDS 5mil Glass Cloth Fabric / No Adh."
quantity: 5
rate: 64.00
- name: "Permabond 106 Cyanoacrylate 1-oz 10 bottles/case"
quantity: 7
rate: 139.00
- name: "Araldite 2014 HT Epoxy Paste GRAY 50ml 2:1 6 cartridges/box | 120 cartridges/case"
quantity: 1
rate: 719.00
payment_info:
subtotal: 1187
sales_tax_percentage: 8
shipping_handling_cost: 50
total: 1330.75
errors:
- "Subtotal mismatch: calculated 2550.00 vs declared 1187.00"
- "Sales Tax amount mismatch: calculated 204.00 vs declared 94.75 for sales tax rate 8%"
- "Total mismatch: calculated 2804.00 vs declared 1330.75"Stored in MongoDB as JSON (YAML converted to JSON internally).
- Flags subtotal, sales tax, shipping, and total mismatches.
- Invalid invoices are stored with an
errorsarray for review.
- Worker Concurrency: Default 100 files per worker.
- Multiple Workers: Run multiple workers to scale horizontally.
- File Size Limit: Default 200MB; configurable.
- Rate Limiting: ioredis ensures stable API usage.
- Tips: Monitor CPU/memory and RabbitMQ queues for optimal throughput.
- YAML formatting reduces AI token usage by ~20–30%.
- reduced characters by 30% (in current scenario) → faster output.
- Read more: How I Saved Millions in GenAI Token Costs
- Import Postman collection from
postman_collection.json. - Base URL:
http://localhost:3000 - Test endpoints: upload invoices, fetch all files, fetch file by ID.
This project is licensed under the MIT License.