HADES-Lab transforms long-form technical corpora into a context-preserving knowledge graph stored in ArangoDB. The stack is built around the Conveyance Framework, enforcing late chunking, Unix-socket HTTP/2 transports, and reproducible benchmarks for every workflow.
- Late chunking extraction and embedding pipeline (Docling + Jina/SentenceTransformers) for PDFs, code, and hybrid corpora.
- Hardened ArangoDB HTTP/2 clients with dedicated RO/RW Unix-socket proxies at
/run/hades/readonly/arangod.sockand/run/hades/readwrite/arangod.sock. - Modular workflows in
core/workflowswith state management, resumable ingest, and Conveyance-aware orchestration. - Observability and benchmarking via
core/monitoring,tests/benchmarks, and structured reports inbenchmarks/reports/. - Legacy prototypes (PostgreSQL metadata, MCP servers) are archived under
Acheron/and kept out of the production path.
HADES-Lab/
├── core/ # Production modules enforcing the Conveyance pipeline
│ ├── config/ # YAML-first configuration loader and defaults
│ ├── extractors/ # Docling/Tree-sitter powered content extraction (late chunking inputs)
│ ├── embedders/ # Jina & SentenceTransformers backends with chunk orchestration guarantees
│ ├── processors/ # High-level processors wiring extractors, embedders, and storage
│ ├── logging/ # Structured logging helpers emitting Conveyance metrics (writes to `core/logs/`)
│ ├── monitoring/ # Throughput, latency, and progress instrumentation utilities
│ ├── database/arango/ # HTTP/2 clients, Go proxies, PHP Unix bridge
│ ├── workflows/ # Late-chunking orchestrators and CLI entry points
│ └── tools/ # Domain utilities (ArXiv ingestion, RAG helpers, etc.)
├── tests/ # pytest suites mirroring core packages and workflows
├── docs/ # Conveyance framework, PRDs, benchmarks, deployment notes
├── setup/ # Environment bootstrap & storage verification scripts (Postgres removed)
├── dev-utils/ # Operator utilities and ingest monitoring helpers
├── benchmarks/ # Captured benchmark artefacts (`reports/` JSON)
├── AGENTS.md # Contributor guide and coding conventions
├── Acheron/ # Archived experiments and legacy implementations (read-only)
└── …
core/extractorsisolates document parsing, OCR, and structural enrichment. Docling-backed extractors emit full-document payloads before late chunking, while Tree-sitter utilities surface code symbols.core/embeddersimplements the late chunking vector generators (Jina V4, SentenceTransformers variants) and the factory wiring batch sizes, devices, and fp16 modes.core/processorscomposes extractors, embedders, and database adapters into reusable document processors consumed by workflows.core/loggingcentralizes structured logging and Conveyance-specific log fields so telemetry lands consistently incore/logs/or downstream sinks.core/monitoringprovides progress/throughput trackers, performance collectors, and metrics surfaces used by workflows, CLI monitors, and regression tests.core/database/arangoships the optimized HTTP/2 memory client and Unix-socket proxies; use it instead of legacy HTTP bridges.core/workflowsexposes the CLI entry points (workflow_pdf.py,workflow_arxiv_initial_ingest.py, etc.) that orchestrate the subsystems end to end.
- Install dependencies:
poetry install. - Configure environment:
cp .env.example .envand setARANGO_PASSWORD,ARANGO_RO_SOCKET,ARANGO_RW_SOCKET, GPU flags, and any pipeline overrides. - Build proxies:
go build ./core/database/arango/proxies/...(binaries incmd/{roproxy,rwproxy}); run withLISTEN_SOCKET=/run/hades/readonly/arangod.sock UPSTREAM_SOCKET=/run/arangodb3/arangod.sock go run ./cmd/roproxy(repeat for RW). - Verify tooling:
poetry run python setup/verify_environment.py(MCP readiness output is legacy and can be ignored) andpoetry run python setup/verify_storage.pyonce ArangoDB is reachable. - Optional automation:
bash setup/setup_local.shperforms the checks above and validates GPU availability.
poetry run python -m compileall core— quick syntax sweep after editing Python modules.poetry run ruff check/poetry run ruff format— lint and auto-format to project standards.poetry run pytest [-k pattern]— execute unit and integration tests aligned withtests/.go build ./core/database/arango/proxies/...— ensure RO/RW proxy binaries remain buildable.poetry run python tests/benchmarks/conveyance_logger.py --help— emit Conveyance evidence for benchmark runs.
The optimized memory client (core/database/arango/memory_client.py) negotiates HTTP/2 over Unix sockets and prefers the hardened proxies:
- Read-only socket:
/run/hades/readonly/arangod.sock - Read-write socket:
/run/hades/readwrite/arangod.sock
Override sockets with environment variables ARANGO_RO_SOCKET, ARANGO_RW_SOCKET, or bypass proxies via ARANGO_SOCKET. Proxy binaries accept LISTEN_SOCKET and UPSTREAM_SOCKET overrides—create directories ahead of time and keep permissions at 0640 (RO) / 0600 (RW). Regression tests and latency sampling live in tests/benchmarks/arango_connection_test.py.
Late chunking is mandatory across every workflow. Key entry points:
core/workflows/workflow_arxiv_initial_ingest.py— CLI for large-scale ingest (poetry run python core/workflows/workflow_arxiv_initial_ingest.py --help) combining Docling extraction, Jina V4 embeddings, and HTTP/2 storage writes.core/workflows/workflow_pdf_batch.py/workflow_pdf.py— reusable batch/single PDF orchestrators with checkpointing support.core/workflows/workflow_arxiv_single_pdf.py— programmable single-paper Conveyance bundle generator for agent pipelines.
Supporting modules live in core/extractors, core/processors, core/embedders, and core/database. Compose new workflows via the factories in those packages while preserving the late chunking guarantee.
Benchmark evidence for Arango transports and workflow throughput resides in docs/benchmarks/ with JSON artefacts tracked under benchmarks/reports/. Use tests/benchmarks/arango_connection_test.py for transport latency sampling and the scripts in dev-utils/ (for example monitor_workflow_logs.py) for ingest telemetry. After deployments, run poetry run python setup/verify_storage.py to confirm collection/index health.
AGENTS.md— contributor guidelines, naming conventions, and testing expectations.docs/CONVEYANCE_FRAMEWORK.md— theoretical baseline for Conveyance and late chunking.docs/prd/— product requirements and design decisions (completed PRDs live indocs/prd/completed/).docs/deploy/&setup/— runbooks for proxy/systemd configuration and environment validation.
Legacy PostgreSQL and MCP artifacts remain only for historical context (Acheron/, CLAUDE.md) and are not part of the supported architecture.
Licensed under the Apache License 2.0 (LICENSE).