r3d91ll · r3d91ll · Sep 21, 2025 · Sep 17, 2025 · Sep 17, 2025 · Sep 17, 2025
diff --git a/.coderabbit.yaml b/.coderabbit.yaml
@@ -52,6 +52,7 @@ reviews:
     - "!**/*.pyo"
     - "!**/*.egg-info/**"
     - "!**/Acheron/**"         # Our deprecated code archive
+    - "!**/tests/**"             # Exclude tests from review scope
 
   # Path-specific instructions - apply to all paths
   path_instructions:
@@ -99,6 +100,20 @@ reviews:
         - Zero-propagation check
         - Evidence links
 
+    - path: "core/workflows/**/*.py"
+      instructions: |
+        Embedding strategy expectations:
+        - Prefer JinaV4Embedder with late chunking for high-fidelity workflows.
+        - When SentenceTransformersEmbedder is selected for throughput, document the trade-off explicitly.
+        - Confirm late chunking paths never discard tail tokens.
+
+    - path: "core/embedders/**/*.py"
+      instructions: |
+        Embedders must honor the standard EmbeddingConfig interface:
+        - Accept EmbeddingConfig or equivalent dict input.
+        - Ensure fallback embeddings match EMBEDDING_DIM.
+        - Preserve late chunking semantics (encode once, pool later).
+
   # Pre-merge checks for PR descriptions
   pre_merge_checks:
     custom_checks:
@@ -113,4 +128,4 @@ reviews:
 
 # Chat settings
 chat:
-  auto_reply: true
+  auto_reply: true
diff --git a/core/.env.example → .env.example b/core/.env.example → .env.example
@@ -4,12 +4,19 @@
 # =========================
 # Database Configuration
 # =========================
-# Using ARANGO_ prefix for compatibility with existing scripts
-ARANGO_HOST=192.168.1.69
-ARANGO_PORT=8529
+# ArangoDB Authentication
 ARANGO_USERNAME=root
 ARANGO_PASSWORD=your_password_here
-ARANGO_DATABASE=academy_store
+ARANGO_DATABASE=arxiv_repository
+
+# Socket Configuration (HADES uses Unix sockets only)
+ARANGO_USE_PROXIES=true
+ARANGO_RO_SOCKET=/run/hades/readonly/arangod.sock
+ARANGO_RW_SOCKET=/run/hades/readwrite/arangod.sock
+
+# Direct socket for development/debugging only
+# ARANGO_USE_PROXIES=false
+# ARANGO_SOCKET=/run/arangodb3/arangodb.sock
 
 # Alternative: Use HADES_ prefix (takes priority if both are set)
 # HADES_DB_HOST=192.168.1.69

diff --git a/.gitignore b/.gitignore
@@ -3,6 +3,9 @@ __pycache__/
 *.py[cod]
 *$py.class
 
+# PHP dependencies (Composer)
+vendor/
+
 # C extensions
 *.so
 
@@ -207,9 +210,25 @@ credentials/
 config.ini
 config.yaml
 config.yml
+*_config.json
+secrets*.json
+*.env
+*.env.local
+*.env.production
 !config.example.*
 !config.sample.*
 !config.template.*
+!*.example.env
+
+# Database passwords and API keys
+*password*.txt
+*credential*.json
+*secret*.yml
+arxiv_repository.env
+arango_password.txt
+!*example*
+!*template*
+!*sample*
 
 # Data files (customize as needed)
 /data/
@@ -388,4 +407,11 @@ arxiv_pipeline_v2_results_*.json
 tools/arxiv/REORGANIZATION_NOTICE.md
 tools/arxiv/SCRIPT_REORGANIZATION_ANALYSIS.md
 tools/arxiv/utils/SCRIPT_ANALYSIS.md
-scratch/*
+scratch/*
+dev-utils/*
+core/database/arango/proxies/.gocache/
+*.gocache/
+**/.gocache/
+core/database/arango/proxies/roproxy
+core/database/arango/proxies/rwproxy
+benchmarks/reports/*
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,60 @@
+# Repository Guidelines
+
+## Project Structure & Module Organization
+
+- `core/` holds Python source, including database clients (`core/database`), embedders (`core/embedders`), and workflows (`core/workflows`).
+- `core/database/arango/proxies/` contains the Go RO/RW proxy sources (`cmd/roproxy`, `cmd/rwproxy`).
+- `docs/` captures product requirements, benchmarks, and deployment runbooks; `setup/` contains automation scripts for local/bootstrap installs.
+- Tests live under `tests/` and align with the package structure; use matching module paths when adding new suites.
+
+## Build, Test, and Development Commands
+
+- `poetry install` – create/update the virtualenv with project dependencies.
+- `poetry run python -m compileall core` – quick syntax/bytecode sweep; run after touching Python modules.
+- `poetry run pytest` – execute the Python test suite; add `-k` to target specific modules.
+- `go build ./core/database/arango/proxies/...` – verify the RO/RW proxy binaries build cleanly.
+- `poetry run ruff check` / `poetry run ruff format` – lint and auto-format Python code before committing.
+- `poetry run python tests/benchmarks/conveyance_logger.py …` – translate benchmark JSON into Conveyance log entries (see docs/benchmarks/arango_phase4_summary.md for examples).
+
+## Coding Style & Naming Conventions
+
+- Python: 4-space indentation, type hints required for new public APIs, and docstrings for modules/classes/functions exposed outside a file.
+- Follow PEP 8 naming (snake_case for functions/vars, PascalCase for classes). Keep module names lowercase.
+- Go: leverage `gofmt` and conventional CamelCase identifiers; ensure proxy allowlists remain alphabetical.
+- Avoid committing secrets; configuration belongs in `.env` or systemd drop-ins (e.g., `ARANGO_PASSWORD`, `ARANGO_RO_SOCKET`).
+
+## Testing Guidelines
+
+- Prefer `pytest`-style tests named `test_<feature>.py` with functions `test_<behavior>`.
+- Mock external systems (ArangoDB, gRPC) and exercise both RO and RW paths of the memory client.
+- Aim for ≥80 % docstring/type coverage (mirrors CI expectations) and capture evidence for Conveyance calculations when adding benchmarks.
+
+## Commit & Pull Request Guidelines
+
+- Commit messages follow the repository’s style: imperative mood with a concise summary (e.g., “Add HTTP/2 proxy rollback guard”), optionally referencing issues (`#51`).
+- Each PR should include: linked issue, Conveyance Summary, W/R/H/T mapping, Performance Evidence, and Tests & Compatibility sections.
+- Attach benchmark artefacts under `benchmarks/reports/` when changing performance-sensitive code, and note any required manual login steps (e.g., rebuilding systemd units).
+
+## Conveyance Framework Expectations
+
+- Frame design and review decisions using the efficiency view: `C = (W·R·H / T) · Ctx^α`, where `Ctx = wL·L + wI·I + wA·A + wG·G` and α is typically 1.7.
+- Always report which factors improved (W, R, H, T, or Ctx) and include a zero-propagation check (if any base factor is 0 or T → ∞, declare C = 0).
+- Benchmark notes must map measurements to W/R/H/T and cite context scores (L/I/A/G) so reviewers can recompute Conveyance.
+- Late chunking is mandatory: encode documents once, then derive context-aware chunks. Choose embedders per workload (SentenceTransformers for high throughput, JinaV4 for fidelity) and document the trade-off in PRs.
+- Further details on the conveyance framework can be found in docs/CONVEYANCE_FRAMEWORK.md
+
+## Security & Configuration Tips
+
+- Never hard-code credentials. Use environment variables (`ARANGO_RO_SOCKET`, `ARANGO_RW_SOCKET`, `ARANGO_HTTP_BASE_URL`) and keep `.env` out of version control.
+- Run `setup/verify_storage.py` after deployments to confirm collection/index health before ingest workflows.
+
+## CRITICAL: Late Chunking Principle
+
+**MANDATORY**: All text chunking MUST use late chunking. Never use naive chunking.
+
+### Why Late Chunking is Required
+
+From the Conveyance Framework: **C = (W·R·H/T)·Ctx^α**
+
+- **Naive chunking** breaks context awareness → Ctx approaches 0 → **C = 0** (zero-propagation)
+- **Late chunking** preserves full document context → Ctx remains high → **C is maximized**
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -2,6 +2,9 @@
 
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 
+***THIS PROJECT USES jina_embeddings_v4***
+***ALL CONFIGURATION SCRIPTS AND YAML FILES GO IN THE core/config/ DIRECTORY***
+
 ## HADES — Conveyance Framework (System Prompt)
 
 **Mission:**
@@ -100,8 +103,8 @@ For each run/condition:
 
 From the Conveyance Framework: **C = (W·R·H/T)·Ctx^α**
 
-- **Naive chunking** breaks context awareness → Ctx approaches 0 → **C = 0** (zero-propagation)
-- **Late chunking** preserves full document context → Ctx remains high → **C is maximized**
+* **Naive chunking** breaks context awareness → Ctx approaches 0 → **C = 0** (zero-propagation)
+* **Late chunking** preserves full document context → Ctx remains high → **C is maximized**
 
 ### Implementation Requirements
 
@@ -121,9 +124,9 @@ chunks = create_chunks_with_context(full_encoding, chunk_size=512)  # Context-aw
 
 ### Embedder Selection
 
-- **High throughput (48+ papers/sec)**: Use `SentenceTransformersEmbedder`
-- **Sophisticated processing**: Use `JinaV4Embedder` (transformers)
-- **Both MUST use late chunking**: This is non-negotiable
+* **High throughput (48+ papers/sec)**: Use `SentenceTransformersEmbedder`
+* **Sophisticated processing**: Use `JinaV4Embedder` (transformers)
+* **Both MUST use late chunking**: This is non-negotiable
 
 ## 🚨 CRITICAL: Development Cycle
 
@@ -251,7 +254,7 @@ python workflow_pdf_batch.py \
     --num-workers 32
 
 # Run ArXiv metadata workflow
-python workflow_arxiv_metadata.py \
+python workflow_arxiv_initial_ingest.py \
     --config ../config/workflows/arxiv_metadata_default.yaml
 ```
 
@@ -289,30 +292,30 @@ tail -f core/logs/*.log
 The HADES system implements a parallel processing architecture optimized for the Conveyance Framework equation **C = (W·R·H/T)·Ctx^α**:
 
 1. **Workflow Layer** (`core/workflows/`)
-   - `workflow_base.py`: Abstract base class for all workflows
-   - `workflow_arxiv_parallel.py`: Production multi-GPU parallel processing
-   - `workflow_pdf_batch.py`: Direct PDF processing without database dependencies
-   - `workflow_arxiv_memory.py`: Memory-optimized processing for large documents
+   * `workflow_base.py`: Abstract base class for all workflows
+   * `workflow_arxiv_parallel.py`: Production multi-GPU parallel processing
+   * `workflow_pdf_batch.py`: Direct PDF processing without database dependencies
+   * `workflow_arxiv_memory.py`: Memory-optimized processing for large documents
 
 2. **Embedder Layer** (`core/embedders/`)
-   - `JinaV4Embedder`: 2048-dimensional embeddings with 32k context window
-   - `SentenceTransformersEmbedder`: High-throughput embeddings
-   - All embedders implement late chunking (mandatory)
+   * `JinaV4Embedder`: 2048-dimensional embeddings with 32k context window
+   * `SentenceTransformersEmbedder`: High-throughput embeddings
+   * All embedders implement late chunking (mandatory)
 
 3. **Storage Layer** (`core/database/`)
-   - **ArangoDB**: Graph database for processed documents and embeddings
-   - **PostgreSQL**: Complete ArXiv metadata (2.7M+ papers)
-   - **LMDB**: High-performance key-value storage for caching
+   * **ArangoDB**: Graph database for processed documents and embeddings
+   * **PostgreSQL**: Complete ArXiv metadata (2.7M+ papers)
+   * **LMDB**: High-performance key-value storage for caching
 
 4. **Monitoring Layer** (`core/monitoring/`)
-   - Real-time progress tracking
-   - Performance metrics collection
-   - GPU utilization monitoring
-   - Memory usage tracking
+   * Real-time progress tracking
+   * Performance metrics collection
+   * GPU utilization monitoring
+   * Memory usage tracking
 
 ### Module Organization
 
-```
+```dir
 HADES-Lab/
 ├── core/                      # Core infrastructure (reusable)
 │   ├── workflows/            # Processing workflows
@@ -334,20 +337,20 @@ HADES-Lab/
 
 ### Key Technical Features
 
-- **Parallel GPU Processing**: Multi-worker architecture with GPU isolation
-- **Late Chunking**: Preserves context across chunk boundaries (mandatory)
-- **Atomic Transactions**: All-or-nothing database operations
-- **Memory Optimization**: Streaming processing for large documents
-- **Error Recovery**: Checkpoint-based resumption
-- **Phase Separation**: Extraction → Embedding pipeline
+* **Parallel GPU Processing**: Multi-worker architecture with GPU isolation
+* **Late Chunking**: Preserves context across chunk boundaries (mandatory)
+* **Atomic Transactions**: All-or-nothing database operations
+* **Memory Optimization**: Streaming processing for large documents
+* **Error Recovery**: Checkpoint-based resumption
+* **Phase Separation**: Extraction → Embedding pipeline
 
 ### Performance Characteristics
 
-- **Throughput**: 40+ papers/second with parallel processing
-- **GPU Memory**: 7-8GB per worker with fp16
-- **Batch Sizes**: 1000 records (I/O), 128 embeddings (GPU)
-- **Context Window**: 32k tokens (Jina v4)
-- **Embedding Dimensions**: 2048 (Jina v4), 768 (Sentence Transformers)
+* **Throughput**: 40+ papers/second with parallel processing
+* **GPU Memory**: 7-8GB per worker with fp16
+* **Batch Sizes**: 1000 records (I/O), 128 embeddings (GPU)
+* **Context Window**: 32k tokens (Jina v4)
+* **Embedding Dimensions**: 2048 (Jina v4), 768 (Sentence Transformers)
 
 ## Acheron Protocol - Code Preservation
 
@@ -358,4 +361,4 @@ HADES-Lab/
 mv old_file.py Acheron/module_name/old_file_2025-01-20_14-30-25.py
 ```
 
-This preserves the archaeological record of development decisions.
+This preserves the archaeological record of development decisions.
diff --git a/README.md b/README.md
@@ -0,0 +1,82 @@
+# ArangoDB Optimized HTTP/2 Client
+
+## Benchmarks
+
+| Operation                                   | PHP Subprocess | HTTP/2 (direct) | HTTP/2 via proxies |
+|---------------------------------------------|----------------|-----------------|--------------------|
+| GET single doc (hot cache, p50)             | ~100 ms        | ~0.4 ms         | ~0.6 ms            |
+| GET single doc (hot cache, p95 target)      | n/a            | 1.0 ms          | 1.0 ms             |
+| Insert 1000 docs (waitForSync=false, p50)   | ~400–500 ms    | ~6 ms           | ~7 ms              |
+| Query (LIMIT 1000, batch size 1000, p50)    | ~200 ms        | ~0.7 ms         | ~0.8 ms            |
+
+## Usage
+
+### Client
+```python
+from core.database.arango.optimized_client import ArangoHttp2Client, ArangoHttp2Config
+
+config = ArangoHttp2Config(
+    database="arxiv_repository",
+    socket_path="/run/hades/readonly/arangod.sock",
+    username="arxiv_reader",
+    password="...",
+)
+with ArangoHttp2Client(config) as client:
+    doc = client.get_document("arxiv_metadata", "0704_0003")
+    print(doc)
+```
+
+### Workflow Integration
+```python
+from core.database.database_factory import DatabaseFactory
+
+memory_client = DatabaseFactory.get_arango_memory_service()
+try:
+    documents = memory_client.execute_query(
+        "FOR doc IN @@collection LIMIT 5 RETURN doc",
+        {"@collection": "arxiv_metadata"},
+    )
+finally:
+    memory_client.close()
+```
+
+### Proxy Binaries
+1. Build: `cd core/database/arango/proxies && go build ./...`
+2. Run RO proxy: `go run ./cmd/roproxy`
+3. Run RW proxy: `go run ./cmd/rwproxy`
+
+Sockets default to `/run/hades/readonly/arangod.sock` and `/run/hades/readwrite/arangod.sock` (systemd-managed). Ensure permissions (0640/0600) and adjust via env vars `LISTEN_SOCKET`, `UPSTREAM_SOCKET`.
+
+### Benchmark CLI (Phase 4)
+
+`tests/benchmarks/arango_connection_test.py` now supports:
+
+- TTFB and E2E timing (full body consumption).
+- Cache-busting via multiple `--key` values or varying bind variables.
+- Adjustable payload size (`--doc-bytes`), `waitForSync`, and concurrency (`--concurrency`).
+- JSON report emission (`--report-json`) for regression tracking.
+
+Example:
+
+```
+poetry run python tests/benchmarks/arango_connection_test.py \
+    --socket /run/hades/readonly/arangod.sock \
+    --database arxiv_repository \
+    --collection arxiv_metadata \
+    --key 0704_0001 --key 0704_0002 \
+    --iterations 20 --concurrency 4 \
+    --report-json reports/get_hot.json
+```
+
+### Testing Infrastructure
+
+- The HTTP/2 memory client is now the default access path for automated tests.
+- Run `poetry run pytest tests/core/database/test_memory_client_config.py` for a quick sanity check.
+- Future regression suites should share the proxy-aware fixtures so workflows exercise the same transport stack.
+
+### Production Hardening Notes
+
+- Treat the RO (`/run/hades/readonly/arangod.sock`) and RW (`/run/hades/readwrite/arangod.sock`) proxies as the security boundary. Plan to ship them via systemd socket units with explicit `SocketUser`/`SocketGroup` assignments and 0640/0600 modes.
+- Arango HTTP responses are enforced to negotiate HTTP/2; mismatches raise immediately.
+- Reference benchmark summary: see `docs/benchmarks/arango_phase4_summary.md` for the latest latency table.
+- Systemd templates for the proxies live in `docs/deploy/arango_proxy_systemd.md`.
diff --git a/composer.json b/composer.json
@@ -0,0 +1,5 @@
+{
+    "require": {
+        "triagens/arangodb": "^3.8"
+    }
+}