Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
60ff21c
fix: Critical Jina embedder performance and config issues (Issue #45)
r3d91ll Sep 17, 2025
c28ea36
fix: Exclude test files from CodeRabbit review path filters
r3d91ll Sep 17, 2025
ba173e3
refactor: Remove redundant SentenceTransformersEmbedder (Issue #46)
r3d91ll Sep 17, 2025
9a17dca
fix: Critical workflow issues preventing data loss and crashes (Issue…
r3d91ll Sep 17, 2025
dd42c43
refactor: Update .gitignore to include additional environment and cre…
r3d91ll Sep 17, 2025
8645ccd
Merge remote-tracking branch 'origin/main' into feature/issue-46-remo…
r3d91ll Sep 17, 2025
7227c48
chore: Move tests and dev-utils to local-only, reorganize setup scripts
r3d91ll Sep 17, 2025
5a4dc62
Remove deprecated setup scripts and test files
r3d91ll Sep 17, 2025
6042fd6
Refactor type hints and improve code documentation across multiple mo…
r3d91ll Sep 18, 2025
f077583
Remove the ArXiv Size-Sorted Processing Workflow (Simplified) script,…
r3d91ll Sep 18, 2025
4c82dd6
feat: Add ArXiv initial ingest workflow and PHP ArangoDB bridge
r3d91ll Sep 18, 2025
afd4e64
refactor: Improve connection handling and enhance bulk insert safety …
r3d91ll Sep 18, 2025
a90a8c9
feat: update dependencies and add grpcio packages
r3d91ll Sep 20, 2025
f24a1ed
chore: remove obsolete benchmark report files and update .gitignore t…
r3d91ll Sep 20, 2025
3f516e5
Refactor storage backend imports and error handling
r3d91ll Sep 20, 2025
2284909
fix: correct spelling in comments and update test file exclusion note
r3d91ll Sep 20, 2025
ae03738
Refactor ArXiv ingestion workflow and database schema
r3d91ll Sep 20, 2025
ce4f964
Refactor ArangoDB collections and workflows for arXiv ingestion
r3d91ll Sep 20, 2025
9b401fc
chore: update .gitignore to exclude Go cache files and remove obsolet…
r3d91ll Sep 20, 2025
ac06da3
Merge pull request #53 from r3d91ll/pr52
r3d91ll Sep 20, 2025
c318fac
Implement Virtual Context Management via ArangoDB and vLLM Model Mana…
r3d91ll Sep 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion .coderabbit.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ reviews:
- "!**/*.pyo"
- "!**/*.egg-info/**"
- "!**/Acheron/**" # Our deprecated code archive
- "!**/tests/**" # Exclude tests from review scope

# Path-specific instructions - apply to all paths
path_instructions:
Expand Down Expand Up @@ -99,6 +100,20 @@ reviews:
- Zero-propagation check
- Evidence links

- path: "core/workflows/**/*.py"
instructions: |
Embedding strategy expectations:
- Prefer JinaV4Embedder with late chunking for high-fidelity workflows.
- When SentenceTransformersEmbedder is selected for throughput, document the trade-off explicitly.
- Confirm late chunking paths never discard tail tokens.

- path: "core/embedders/**/*.py"
instructions: |
Embedders must honor the standard EmbeddingConfig interface:
- Accept EmbeddingConfig or equivalent dict input.
- Ensure fallback embeddings match EMBEDDING_DIM.
- Preserve late chunking semantics (encode once, pool later).

Comment on lines +103 to +116
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Embedding path instructions LGTM; ensure code conforms.

Guidance matches learnings (late chunking, JinaV4 vs ST trade-offs). Verify core/workflows actually document trade-offs and never drop tail tokens.


🏁 Script executed:

#!/bin/bash
rg -n -g 'core/workflows/**/*.py' -C2 -P '(JinaV4Embedder|SentenceTransformersEmbedder|late[_\s-]?chunk)'

Length of output: 7257


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "1) Search workflows for trade-off keywords..."
rg -n -S -g 'core/workflows/**/*.py' -C3 -P '\b(throughput|fidelity|trade|trade-off|tradeoffs|trade-offs)\b' || true
echo
echo "2) Search workflows for token/tail/truncate/drop/discard..."
rg -n -S -g 'core/workflows/**/*.py' -C3 -P '\b(tail|token(s)?|truncate|truncat|discard|drop)\b' || true
echo
echo "3) Search embedders for embed_with_late_chunking and embedder classes/keywords..."
rg -n -S -g 'core/embedders/**/*.py' -C5 -P '(class\s+\w*Embedder|def\s+embed_with_late_chunking|embed_with_late_chunking|late[_\s-]?chunk|tokenizer|stride|truncate|truncat|tail|token(s)?|EMBEDDING_DIM|EmbeddingConfig)' || true
echo
echo "4) Files across core/ that reference embed_with_late_chunking..."
rg -n -S -g 'core/**/*.py' -C5 -P 'embed_with_late_chunking' || true
echo
echo "5) Print file heads and occurrences for files that implement embed_with_late_chunking..."
files=$(rg -n -S -g 'core/**/*.py' -P 'embed_with_late_chunking' --hidden --no-line-number --files || true)
if [ -n "$files" ]; then
  for f in $files; do
    echo
    echo "---- $f ----"
    rg -n -C5 -P 'embed_with_late_chunking' "$f" || true
    echo "----- file head (first 240 lines) -----"
    sed -n '1,240p' "$f" || true
  done
else
  echo "No files implementing embed_with_late_chunking found."
fi

Length of output: 75273


Embedding path instructions LGTM — critical fixes required

  • Workflows warn about SentenceTransformers deprecation but do not explicitly state the throughput vs. fidelity trade-off; add explicit trade-off messaging where embedder_type is handled (core/workflows/workflow_pdf.py:234-239; core/workflows/workflow_pdf_batch.py:444-448; core/workflows/workflow_arxiv_initial_ingest.py:91-95).
  • Late-chunking can discard tail tokens: core/embedders/embedders_jina.py::encode_full_document truncates to MAX_TOKENS (32,768) via tokenizer truncation, and workflows call embed_with_late_chunking directly (core/workflows/workflow_pdf.py:449-451; core/workflows/workflow_pdf_batch.py:506-507). Require change: ensure embed_with_late_chunking automatically falls back to process_long_document for texts > MAX_TOKENS, or update workflows to detect long documents and call process_long_document.
🤖 Prompt for AI Agents
In .coderabbit.yaml around lines 103 to 116, update the embedder/workflow
instructions so they mandate two fixes: (1) In core/workflows/workflow_pdf.py
(lines ~234-239), core/workflows/workflow_pdf_batch.py (lines ~444-448), and
core/workflows/workflow_arxiv_initial_ingest.py (lines ~91-95) add explicit
messaging when embedder_type == SentenceTransformersEmbedder that documents the
throughput vs. fidelity trade-off (high throughput but lower embedding fidelity
and potential deprecation) where the embedder selection is handled; (2) In
core/embedders/embedders_jina.py ensure encode_full_document and
embed_with_late_chunking do not lose tail tokens by implementing an automatic
fallback: if tokenized length > MAX_TOKENS (32768) have embed_with_late_chunking
call the existing process_long_document flow (or perform streaming/chunking that
preserves tails) instead of relying on tokenizer truncation, or alternatively
update the workflows (core/workflows/workflow_pdf.py lines ~449-451 and
core/workflows/workflow_pdf_batch.py lines ~506-507) to detect long documents
and call process_long_document before invoking embed_with_late_chunking.

# Pre-merge checks for PR descriptions
pre_merge_checks:
custom_checks:
Expand All @@ -113,4 +128,4 @@ reviews:

# Chat settings
chat:
auto_reply: true
auto_reply: true
15 changes: 11 additions & 4 deletions core/.env.example → .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,19 @@
# =========================
# Database Configuration
# =========================
# Using ARANGO_ prefix for compatibility with existing scripts
ARANGO_HOST=192.168.1.69
ARANGO_PORT=8529
# ArangoDB Authentication
ARANGO_USERNAME=root
ARANGO_PASSWORD=your_password_here
ARANGO_DATABASE=academy_store
ARANGO_DATABASE=arxiv_repository

# Socket Configuration (HADES uses Unix sockets only)
ARANGO_USE_PROXIES=true
ARANGO_RO_SOCKET=/run/hades/readonly/arangod.sock
ARANGO_RW_SOCKET=/run/hades/readwrite/arangod.sock

# Direct socket for development/debugging only
# ARANGO_USE_PROXIES=false
# ARANGO_SOCKET=/run/arangodb3/arangodb.sock

# Alternative: Use HADES_ prefix (takes priority if both are set)
# HADES_DB_HOST=192.168.1.69
Expand Down
28 changes: 27 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ __pycache__/
*.py[cod]
*$py.class

# PHP dependencies (Composer)
vendor/

# C extensions
*.so

Expand Down Expand Up @@ -207,9 +210,25 @@ credentials/
config.ini
config.yaml
config.yml
*_config.json
secrets*.json
*.env
*.env.local
*.env.production
!config.example.*
!config.sample.*
!config.template.*
!*.example.env

# Database passwords and API keys
*password*.txt
*credential*.json
*secret*.yml
arxiv_repository.env
arango_password.txt
!*example*
!*template*
!*sample*

# Data files (customize as needed)
/data/
Expand Down Expand Up @@ -388,4 +407,11 @@ arxiv_pipeline_v2_results_*.json
tools/arxiv/REORGANIZATION_NOTICE.md
tools/arxiv/SCRIPT_REORGANIZATION_ANALYSIS.md
tools/arxiv/utils/SCRIPT_ANALYSIS.md
scratch/*
scratch/*
dev-utils/*
core/database/arango/proxies/.gocache/
*.gocache/
**/.gocache/
core/database/arango/proxies/roproxy
core/database/arango/proxies/rwproxy
benchmarks/reports/*
60 changes: 60 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Repository Guidelines

## Project Structure & Module Organization

- `core/` holds Python source, including database clients (`core/database`), embedders (`core/embedders`), and workflows (`core/workflows`).
- `core/database/arango/proxies/` contains the Go RO/RW proxy sources (`cmd/roproxy`, `cmd/rwproxy`).
- `docs/` captures product requirements, benchmarks, and deployment runbooks; `setup/` contains automation scripts for local/bootstrap installs.
- Tests live under `tests/` and align with the package structure; use matching module paths when adding new suites.

## Build, Test, and Development Commands

- `poetry install` – create/update the virtualenv with project dependencies.
- `poetry run python -m compileall core` – quick syntax/bytecode sweep; run after touching Python modules.
- `poetry run pytest` – execute the Python test suite; add `-k` to target specific modules.
- `go build ./core/database/arango/proxies/...` – verify the RO/RW proxy binaries build cleanly.
- `poetry run ruff check` / `poetry run ruff format` – lint and auto-format Python code before committing.
- `poetry run python tests/benchmarks/conveyance_logger.py …` – translate benchmark JSON into Conveyance log entries (see docs/benchmarks/arango_phase4_summary.md for examples).

## Coding Style & Naming Conventions

- Python: 4-space indentation, type hints required for new public APIs, and docstrings for modules/classes/functions exposed outside a file.
- Follow PEP 8 naming (snake_case for functions/vars, PascalCase for classes). Keep module names lowercase.
- Go: leverage `gofmt` and conventional CamelCase identifiers; ensure proxy allowlists remain alphabetical.
- Avoid committing secrets; configuration belongs in `.env` or systemd drop-ins (e.g., `ARANGO_PASSWORD`, `ARANGO_RO_SOCKET`).

## Testing Guidelines

- Prefer `pytest`-style tests named `test_<feature>.py` with functions `test_<behavior>`.
- Mock external systems (ArangoDB, gRPC) and exercise both RO and RW paths of the memory client.
- Aim for ≥80 % docstring/type coverage (mirrors CI expectations) and capture evidence for Conveyance calculations when adding benchmarks.

## Commit & Pull Request Guidelines

- Commit messages follow the repository’s style: imperative mood with a concise summary (e.g., “Add HTTP/2 proxy rollback guard”), optionally referencing issues (`#51`).
- Each PR should include: linked issue, Conveyance Summary, W/R/H/T mapping, Performance Evidence, and Tests & Compatibility sections.
- Attach benchmark artefacts under `benchmarks/reports/` when changing performance-sensitive code, and note any required manual login steps (e.g., rebuilding systemd units).

## Conveyance Framework Expectations

- Frame design and review decisions using the efficiency view: `C = (W·R·H / T) · Ctx^α`, where `Ctx = wL·L + wI·I + wA·A + wG·G` and α is typically 1.7.
- Always report which factors improved (W, R, H, T, or Ctx) and include a zero-propagation check (if any base factor is 0 or T → ∞, declare C = 0).
- Benchmark notes must map measurements to W/R/H/T and cite context scores (L/I/A/G) so reviewers can recompute Conveyance.
- Late chunking is mandatory: encode documents once, then derive context-aware chunks. Choose embedders per workload (SentenceTransformers for high throughput, JinaV4 for fidelity) and document the trade-off in PRs.
- Further details on the conveyance framework can be found in docs/CONVEYANCE_FRAMEWORK.md

## Security & Configuration Tips

- Never hard-code credentials. Use environment variables (`ARANGO_RO_SOCKET`, `ARANGO_RW_SOCKET`, `ARANGO_HTTP_BASE_URL`) and keep `.env` out of version control.
- Run `setup/verify_storage.py` after deployments to confirm collection/index health before ingest workflows.

## CRITICAL: Late Chunking Principle

**MANDATORY**: All text chunking MUST use late chunking. Never use naive chunking.

### Why Late Chunking is Required

From the Conveyance Framework: **C = (W·R·H/T)·Ctx^α**

- **Naive chunking** breaks context awareness → Ctx approaches 0 → **C = 0** (zero-propagation)
- **Late chunking** preserves full document context → Ctx remains high → **C is maximized**
69 changes: 36 additions & 33 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

***THIS PROJECT USES jina_embeddings_v4***
***ALL CONFIGURATION SCRIPTS AND YAML FILES GO IN THE core/config/ DIRECTORY***

## HADES — Conveyance Framework (System Prompt)

**Mission:**
Expand Down Expand Up @@ -100,8 +103,8 @@ For each run/condition:

From the Conveyance Framework: **C = (W·R·H/T)·Ctx^α**

- **Naive chunking** breaks context awareness → Ctx approaches 0 → **C = 0** (zero-propagation)
- **Late chunking** preserves full document context → Ctx remains high → **C is maximized**
* **Naive chunking** breaks context awareness → Ctx approaches 0 → **C = 0** (zero-propagation)
* **Late chunking** preserves full document context → Ctx remains high → **C is maximized**

### Implementation Requirements

Expand All @@ -121,9 +124,9 @@ chunks = create_chunks_with_context(full_encoding, chunk_size=512) # Context-aw

### Embedder Selection

- **High throughput (48+ papers/sec)**: Use `SentenceTransformersEmbedder`
- **Sophisticated processing**: Use `JinaV4Embedder` (transformers)
- **Both MUST use late chunking**: This is non-negotiable
* **High throughput (48+ papers/sec)**: Use `SentenceTransformersEmbedder`
* **Sophisticated processing**: Use `JinaV4Embedder` (transformers)
* **Both MUST use late chunking**: This is non-negotiable

## 🚨 CRITICAL: Development Cycle

Expand Down Expand Up @@ -251,7 +254,7 @@ python workflow_pdf_batch.py \
--num-workers 32

# Run ArXiv metadata workflow
python workflow_arxiv_metadata.py \
python workflow_arxiv_initial_ingest.py \
--config ../config/workflows/arxiv_metadata_default.yaml
```

Expand Down Expand Up @@ -289,30 +292,30 @@ tail -f core/logs/*.log
The HADES system implements a parallel processing architecture optimized for the Conveyance Framework equation **C = (W·R·H/T)·Ctx^α**:

1. **Workflow Layer** (`core/workflows/`)
- `workflow_base.py`: Abstract base class for all workflows
- `workflow_arxiv_parallel.py`: Production multi-GPU parallel processing
- `workflow_pdf_batch.py`: Direct PDF processing without database dependencies
- `workflow_arxiv_memory.py`: Memory-optimized processing for large documents
* `workflow_base.py`: Abstract base class for all workflows
* `workflow_arxiv_parallel.py`: Production multi-GPU parallel processing
* `workflow_pdf_batch.py`: Direct PDF processing without database dependencies
* `workflow_arxiv_memory.py`: Memory-optimized processing for large documents

2. **Embedder Layer** (`core/embedders/`)
- `JinaV4Embedder`: 2048-dimensional embeddings with 32k context window
- `SentenceTransformersEmbedder`: High-throughput embeddings
- All embedders implement late chunking (mandatory)
* `JinaV4Embedder`: 2048-dimensional embeddings with 32k context window
* `SentenceTransformersEmbedder`: High-throughput embeddings
* All embedders implement late chunking (mandatory)

3. **Storage Layer** (`core/database/`)
- **ArangoDB**: Graph database for processed documents and embeddings
- **PostgreSQL**: Complete ArXiv metadata (2.7M+ papers)
- **LMDB**: High-performance key-value storage for caching
* **ArangoDB**: Graph database for processed documents and embeddings
* **PostgreSQL**: Complete ArXiv metadata (2.7M+ papers)
* **LMDB**: High-performance key-value storage for caching

4. **Monitoring Layer** (`core/monitoring/`)
- Real-time progress tracking
- Performance metrics collection
- GPU utilization monitoring
- Memory usage tracking
* Real-time progress tracking
* Performance metrics collection
* GPU utilization monitoring
* Memory usage tracking

### Module Organization

```
```dir
HADES-Lab/
├── core/ # Core infrastructure (reusable)
│ ├── workflows/ # Processing workflows
Expand All @@ -334,20 +337,20 @@ HADES-Lab/

### Key Technical Features

- **Parallel GPU Processing**: Multi-worker architecture with GPU isolation
- **Late Chunking**: Preserves context across chunk boundaries (mandatory)
- **Atomic Transactions**: All-or-nothing database operations
- **Memory Optimization**: Streaming processing for large documents
- **Error Recovery**: Checkpoint-based resumption
- **Phase Separation**: Extraction → Embedding pipeline
* **Parallel GPU Processing**: Multi-worker architecture with GPU isolation
* **Late Chunking**: Preserves context across chunk boundaries (mandatory)
* **Atomic Transactions**: All-or-nothing database operations
* **Memory Optimization**: Streaming processing for large documents
* **Error Recovery**: Checkpoint-based resumption
* **Phase Separation**: Extraction → Embedding pipeline

### Performance Characteristics

- **Throughput**: 40+ papers/second with parallel processing
- **GPU Memory**: 7-8GB per worker with fp16
- **Batch Sizes**: 1000 records (I/O), 128 embeddings (GPU)
- **Context Window**: 32k tokens (Jina v4)
- **Embedding Dimensions**: 2048 (Jina v4), 768 (Sentence Transformers)
* **Throughput**: 40+ papers/second with parallel processing
* **GPU Memory**: 7-8GB per worker with fp16
* **Batch Sizes**: 1000 records (I/O), 128 embeddings (GPU)
* **Context Window**: 32k tokens (Jina v4)
* **Embedding Dimensions**: 2048 (Jina v4), 768 (Sentence Transformers)

## Acheron Protocol - Code Preservation

Expand All @@ -358,4 +361,4 @@ HADES-Lab/
mv old_file.py Acheron/module_name/old_file_2025-01-20_14-30-25.py
```

This preserves the archaeological record of development decisions.
This preserves the archaeological record of development decisions.
82 changes: 82 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# ArangoDB Optimized HTTP/2 Client

## Benchmarks

| Operation | PHP Subprocess | HTTP/2 (direct) | HTTP/2 via proxies |
|---------------------------------------------|----------------|-----------------|--------------------|
| GET single doc (hot cache, p50) | ~100 ms | ~0.4 ms | ~0.6 ms |
| GET single doc (hot cache, p95 target) | n/a | 1.0 ms | 1.0 ms |
| Insert 1000 docs (waitForSync=false, p50) | ~400–500 ms | ~6 ms | ~7 ms |
| Query (LIMIT 1000, batch size 1000, p50) | ~200 ms | ~0.7 ms | ~0.8 ms |

## Usage

### Client
```python
from core.database.arango.optimized_client import ArangoHttp2Client, ArangoHttp2Config

config = ArangoHttp2Config(
database="arxiv_repository",
socket_path="/run/hades/readonly/arangod.sock",
username="arxiv_reader",
password="...",
)
with ArangoHttp2Client(config) as client:
doc = client.get_document("arxiv_metadata", "0704_0003")
print(doc)
```

### Workflow Integration
```python
from core.database.database_factory import DatabaseFactory

memory_client = DatabaseFactory.get_arango_memory_service()
try:
documents = memory_client.execute_query(
"FOR doc IN @@collection LIMIT 5 RETURN doc",
{"@collection": "arxiv_metadata"},
)
finally:
memory_client.close()
```

### Proxy Binaries
1. Build: `cd core/database/arango/proxies && go build ./...`
2. Run RO proxy: `go run ./cmd/roproxy`
3. Run RW proxy: `go run ./cmd/rwproxy`

Sockets default to `/run/hades/readonly/arangod.sock` and `/run/hades/readwrite/arangod.sock` (systemd-managed). Ensure permissions (0640/0600) and adjust via env vars `LISTEN_SOCKET`, `UPSTREAM_SOCKET`.

### Benchmark CLI (Phase 4)

`tests/benchmarks/arango_connection_test.py` now supports:

- TTFB and E2E timing (full body consumption).
- Cache-busting via multiple `--key` values or varying bind variables.
- Adjustable payload size (`--doc-bytes`), `waitForSync`, and concurrency (`--concurrency`).
- JSON report emission (`--report-json`) for regression tracking.

Example:

```
poetry run python tests/benchmarks/arango_connection_test.py \
--socket /run/hades/readonly/arangod.sock \
--database arxiv_repository \
--collection arxiv_metadata \
--key 0704_0001 --key 0704_0002 \
--iterations 20 --concurrency 4 \
--report-json reports/get_hot.json
```

### Testing Infrastructure

- The HTTP/2 memory client is now the default access path for automated tests.
- Run `poetry run pytest tests/core/database/test_memory_client_config.py` for a quick sanity check.
- Future regression suites should share the proxy-aware fixtures so workflows exercise the same transport stack.

### Production Hardening Notes

- Treat the RO (`/run/hades/readonly/arangod.sock`) and RW (`/run/hades/readwrite/arangod.sock`) proxies as the security boundary. Plan to ship them via systemd socket units with explicit `SocketUser`/`SocketGroup` assignments and 0640/0600 modes.
- Arango HTTP responses are enforced to negotiate HTTP/2; mismatches raise immediately.
- Reference benchmark summary: see `docs/benchmarks/arango_phase4_summary.md` for the latest latency table.
- Systemd templates for the proxies live in `docs/deploy/arango_proxy_systemd.md`.
5 changes: 5 additions & 0 deletions composer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"require": {
"triagens/arangodb": "^3.8"
}
}
Loading