Skip to content

Conversation

@r3d91ll
Copy link
Owner

@r3d91ll r3d91ll commented Sep 20, 2025

Title: Integrate HTTP/2 Arango client and proxies (Phase 3–4)

Summary

  • add ArangoMemoryClient + optimized_client (HTTP/2 over UDS) and migrate ingestion/workflows to use it
  • create buildable RO/RW Go proxies with socket allowlists (now managed by systemd under /run/hades/…) and update docs
  • capture Phase 4 bench results (hot/cold) + workflow fixes (embed_batch_with_late_chunking, cursor PUT/DELETE allowlist, monitor defaults)

Tests

  • poetry run python -m compileall core
  • ingest workflow on host (poetry run python core/workflows/workflow_arxiv_initial_ingest.py …)
  • proxy binaries rebuilt + systemd restart
  • Phase 4 benchmarks recorded in docs/benchmarks/arango_phase4_summary.md

Summary by CodeRabbit

  • New Features

    • Optimized ArangoDB HTTP/2 client and memory-backed service, RO/RW Unix-socket proxies, PHP bridge, and ArXiv Initial Ingest workflow with GPU workers and batch storage.
  • Changes

    • Standardized on Jina embeddings v4 and config-object usage; updated ArangoDB env/templates and expanded .gitignore; added Composer & Go module artifacts; improved CLI/setup scripts.
  • Documentation

    • Added benchmarks, systemd proxy guides, PRDs, Conveyance Framework, workflow guides, and system requirements.
  • Removals

    • Removed SentenceTransformers embedder, legacy ArXiv workflows, and multiple dev monitoring/setup utilities.

r3d91ll and others added 13 commits September 17, 2025 13:28
- Fix O(N²) redundant encoding in _chunk_with_context_windows
  - Was encoding document N times for N chunks
  - Now redirects to efficient embed_with_late_chunking
  - Massive performance improvement for chunked documents

- Fix config extraction to extract ALL fields
  - chunk_size_tokens no longer hardcoded to 1000
  - chunk_overlap_tokens no longer hardcoded to 200
  - Now properly extracts all EmbeddingConfig fields

- Add CUDA availability check
  - Automatically falls back to CPU if CUDA not available
  - Prevents crashes on CPU-only systems

- Add truncation warning for documents > 32k tokens
  - Warns users when documents will be truncated
  - Suggests using process_long_document() for longer texts

- Add type annotations for mypy compliance

- Create embedder config YAML files
  - /core/config/embedders/jina_v4.yaml with full configuration
  - /core/config/embedders/default.yaml pointing to Jina v4

This resolves all critical issues identified in Issue #45 code review.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Archived SentenceTransformersEmbedder to Acheron/ with timestamp
  - Was hardcoded for Jina v4 despite being 'generic'
  - 600+ lines of redundant code duplicating JinaV4Embedder

- Updated embedder factory
  - Default now 'jina' instead of 'sentence'
  - Added comprehensive documentation for extending
  - Redirects 'sentence' requests to JinaV4Embedder with warning

- Updated all imports and references
  - test_benchmark_embedders.py now uses JinaV4Embedder
  - workflow_arxiv_metadata.py references updated

This completes the standardization on Jina v4 as the single embedder,
following the 'one model, one RAG' architecture while preserving
extensibility for future models.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…44)

🔴 CRITICAL FIXES (Data Loss/Crashes):

1. Storage Drain Bug (Line 364)
   - Storage thread now continues draining queue after stop_event
   - Prevents data loss on shutdown

2. GPU Assignment Bug (Line 256)
   - Uses actual GPU count instead of worker count
   - Prevents crashes when workers > GPUs

3. Worker Error Masking (Line 58)
   - Initialize batches_processed before try block
   - Prevents UnboundLocalError masking real failures

🟡 HIGH PRIORITY FIXES:

4. GPU Detection (Lines 237-249)
   - Detects available GPUs before starting workers
   - Exits with clear error if workers > 0 but no GPUs

5. Spawn Context (Lines 252, 259)
   - Uses spawn context consistently for CUDA safety
   - Prevents CUDA initialization corruption

6. Embedder Config (Lines 65-67, 294-296)
   - Worker now receives and uses CLI parameters
   - Respects --embedding-batch-size, chunk sizes

🟢 QUALITY IMPROVEMENTS:

7. Password Handling (Lines 175-180)
   - Warns if password env var not set
   - Clearer error messages

8. Metadata Error Handling (Lines 341-344, 360-363)
   - Checks and logs import_bulk errors
   - No more silent failures

9. Type Annotations
   - Added throughout for mypy compliance
   - Better code documentation

This completes all items in Issue #44, making the workflow production-ready.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…dential files; remove unused ArangoDB Unix socket client
- Added tests/ and dev-utils/ to .gitignore (local development only)
- Removed all test files from git tracking
- Removed all dev-utils scripts from git tracking
- Moved setup scripts to setup/ directory with proper path handling
- Fixed setup scripts to work from setup/ but install to project root

These directories contain local development and testing files that should not be tracked in the repository.
- Deleted setup_database.config.example and setup_database.sh as they are no longer needed.
- Removed setup_local.sh, setup_private_repo.sh, setup_python_env.sh, and related test files (test_chunking_fix.py, test_simple_chunking.py) to streamline the repository.
- Eliminated verify_environment.py and verify_storage.py scripts to focus on essential functionalities.
…dules

- Updated type hints to use Optional and other specific types for better clarity.
- Enhanced docstrings for methods in `FileSystemDocumentProvider` and `UniversalBibliographyExtractor`.
- Introduced helper methods in `StateManager` to ensure nested sections are dictionaries.
- Improved error handling and logging in `StateManager` and `CheckpointManager`.
- Adjusted imports in `__init__.py` for LocalStorage to use StorageManager.
- Refined type annotations in workflow classes for better type safety.
- Updated `DocumentTask` dataclass to use field defaults for metadata.
- Modified mypy configuration for a more permissive initial setup, allowing gradual improvements.
… including all associated functionality for processing ArXiv abstracts, GPU worker management, and database interactions.
- Implemented `check_resume_state.py` to monitor ArangoDB state and last processed records.
- Created `workflow_arxiv_initial_ingest.py` for processing ArXiv dataset with size-sorted strategy.
- Documented the workflow in `workflow_arxiv_initial_ingest.md`.
- Established PHP bridge for ArangoDB operations in `php_unix_bridge.php`.
- Updated setup scripts to include PHP and Composer checks.
- Added system requirements documentation for HADES-Lab.
- Added grpcio and grpcio-tools version 1.75.0 to the project dependencies.
- Introduced h2 version 4.3.0 and hpack version 4.1.0 as new dependencies.
- Updated the protobuf dependency to include it in both main and dev groups.
- Removed the tools package from the installable packages list.
- Added an exclusion for a specific directory in the ruff configuration.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 20, 2025

Walkthrough

Replaces legacy ArangoDB Unix/TCP clients and transactional manager with a minimal HTTP/2 Unix-socket client, a memory-oriented Arango client, and RO/RW Go proxies; standardizes embedders on Jina v4 (removes SentenceTransformers); adds new initial ingest workflow and many docs; removes several legacy workflows and dev utilities; updates configs, types, and exports.

Changes

Cohort / File(s) Summary
Repo & meta
/.coderabbit.yaml, /.gitignore, pyproject.toml
Review/path filters updated; expanded ignore patterns; added h2/defusedxml/grpc deps and mypy/ruff excludes; composer.json added.
Environment & examples
/.env.example, core/config/arxiv_repository.env, config/.gitignore
New Arango env template, env var adjustments, added ARANGO_USERNAME/PASSWORD and proxy/socket vars; removed some credential ignores in config/.gitignore.
Config system
core/config/__init__.py, core/config/config_loader.py, core/config/README.md, core/config/embedders/*.yaml
Typing refinements (Optional/ Tuple), get_config signature typing, docs updated, new embedder YAML defaults targeting Jina v4.
Arango HTTP/2 & memory client
core/database/arango/optimized_client.py, core/database/arango/memory_client.py, core/database/arango/__init__.py, core/database/database_factory.py, core/database/__init__.py
New ArangoHttp2Client/ArangoHttp2Config/ArangoHttpError and ArangoMemoryClient with config resolution, error mapping, bulk/query/transaction helpers; DatabaseFactory gains get_arango_memory_service; exports updated.
Legacy Arango removals
core/database/arango/arango_client.py, core/database/arango/arango_unix_client.py, core/database/arango_unix_client.py
Removed ArangoDBManager, retry_with_backoff, and prior Unix-socket Python client implementations.
Proxies & bridge
core/database/arango/proxies/*, core/database/arango/php_unix_bridge.php, core/database/arango/proxies/go.mod, core/database/arango/proxies/cmd/*
Adds Go RO/RW HTTP/2 Unix-socket reverse proxies and entrypoints, proxy_common utilities, PHP Unix-bridge script, and related go.mod.
Embedders (Jina v4)
core/embedders/embedders_jina.py, core/embedders/embedders_factory.py, core/embedders/__init__.py, core/embedders/README.md, core/config/embedders/*.yaml
Introduces EmbeddingConfig, JinaV4Embedder signature/typing updates, factory defaults to jina, late-chunking strategy updates, docs switched to jina-embeddings-v4.
Embedders removed
core/embedders/embedders_sentence.py
Deletes SentenceTransformers embedder and ChunkWithEmbedding dataclass.
Workflows added
core/workflows/workflow_arxiv_initial_ingest.py, core/workflows/workflow_arxiv_initial_ingest.md, core/workflows/README.md
New ArxivInitialIngestWorkflow using memory client + proxies, worker_process_with_gpu, CLI options and docs.
Workflows removed
core/tools/arxiv/arxiv_pipeline.py, core/workflows/workflow_arxiv_metadata.py, core/workflows/workflow_arxiv_parallel.py, core/workflows/workflow_arxiv_sorted.py, core/workflows/arxiv_ingestion.md
Removes multiple legacy ingestion/parallel/sorted pipelines and related docs and CLI scripts.
Workflows state & storage
core/workflows/state/state_manager.py, core/workflows/storage/__init__.py, core/workflows/workflow_base.py, core/workflows/workflow_pdf*.py
Adds nested-section helpers, typed checkpoint sets, LocalStorage alias rebinding, lazy getattr for backends, and minor annotations; workflow_pdf now re-exports document processor types.
Processors & chunking
core/processors/__init__.py, core/processors/text/chunking_strategies.py, core/processors/document_processor.py
New DocumentProcessor and dataclasses; exports updated; token/sliding window chunking added, FixedSizeChunking removed; richer metadata and boundary handling.
Extractors
core/extractors/*
Broadened input types to Union[str, Path], optional Tree-sitter wiring, small factory symbol rename (LaTeX), and added annotations.
Monitoring & metrics
core/monitoring/*
Typing updates, create_performance_monitor signature change, numeric guards in conveyance calculation, progress tracker signature Optional types, and internal restructuring.
Tools & managers
core/tools/arxiv/*, core/tools/rag_utils/academic_citation_toolkit.py
Switched managers to memory client, type refinements, collection renames to arxiv_metadata/arxiv_abstract_chunks/arxiv_abstract_embeddings, and ArXiv API client improvements.
Dev-utils: removed/added
dev-utils/*, core/database/utils/check_resume_state.py, setup/verify_storage.py
Removed many monitoring/setup utilities; added check_resume_state and verify_storage helpers.
Docs & PRDs
docs/*, setup/php_arango_setup.md, docs/benchmarks/*, docs/prd/*, AGENTS.md, CONVEYANCE_FRAMEWORK.md
New benchmark summaries, PRDs (HTTP/2, gRPC memory service, model manager), PHP bridge setup, conveyance framework, agents guide, and many doc additions/updates.
Setup scripts
setup/setup_local.sh, setup/setup_python_env.sh, setup/setup_database.sh
Improved project-root resolution, PHP/Composer checks, changed default socket paths, memory-client-based verification and adjusted invocations.
Logging
core/logging/__init__.py, core/logging/conveyance.py
New conveyance module with ConveyanceContext, load_metric/build_record/log_conveyance helpers; top-level logging exports.
Minor typing/formatting
Many modules (monitoring, extractors, embedders, workflows, config_loader, etc.)
Local type annotations, Optional[...] clarifications, improved typing for locals, and newline/format fixes across many files.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant WF as ArxivInitialIngestWorkflow
  participant DF as DatabaseFactory
  participant MC as ArangoMemoryClient
  participant RW as RW Proxy (Unix)
  participant RO as RO Proxy (Unix)
  participant ADB as ArangoDB

  WF->>DF: get_arango_memory_service(...)
  DF->>DF: resolve_memory_config(...)
  DF->>MC: instantiate ArangoMemoryClient(config)

  WF->>MC: create_collections(defs)
  MC->>RW: HTTP/2 write requests
  RW->>ADB: forward write
  ADB-->>RW: response
  RW-->>MC: response

  WF->>MC: bulk_insert(docs)
  MC->>RW: HTTP/2 import
  RW->>ADB: forward import
  ADB-->>RW: result
  RW-->>MC: result

  WF->>MC: execute_query(AQL)
  MC->>RO: HTTP/2/read cursor
  RO->>ADB: forward read
  ADB-->>RO: cursor batches
  RO-->>MC: aggregated results
Loading
sequenceDiagram
  autonumber
  participant Client as PythonClient
  participant RO as RO Proxy
  participant RW as RW Proxy
  participant ADB as ArangoDB

  rect rgb(245,250,255)
  note right of RO: RO enforces read-only (GET/HEAD/OPTIONS, safe cursor POST)
  end

  Client->>RO: GET /_api/document/...
  RO->>ADB: proxy (HTTP/2 over unix)
  ADB-->>RO: 200/JSON
  RO-->>Client: 200/JSON

  Client->>RW: POST /_api/import (NDJSON)
  RW->>ADB: proxy (HTTP/2 over unix)
  ADB-->>RW: 201/result
  RW-->>Client: 201/result
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Possibly related issues

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.57% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Pr Description Sections ⚠️ Warning I reviewed the PR description provided in the Objectives section. It clearly states the integration scope and lists several tests that were run, and it references benchmark artifacts (docs/benchmarks/arango_phase4_summary.md) which satisfies Performance Evidence. However, the description does not include an explicit Conveyance Summary or an explicit W/R/H/T Mapping per the Conveyance Framework, and it lacks a Compatibility statement (e.g., breaking changes, migration notes, supported environments). Therefore, the PR description is missing required sections. Please amend the PR description to add: (1) Conveyance Summary: one paragraph stating view (efficiency vs capability), alpha used, and headline C/C_cap outcomes; (2) W/R/H/T Mapping: define what, where, who, time for this change and how each was measured; (3) Performance Evidence: inline 1–2 sentences summarizing p95/p99 gains with a link/reference to docs/benchmarks/arango_phase4_summary.md; (4) Tests & Compatibility: enumerate tested commands/platforms and explicitly call out compatibility/migration notes (e.g., removal of ArangoDBManager/Unix client, new ArangoMemoryClient/HTTP2 path, proxy/systemd requirements, default sockets), plus any breaking API or config changes and their upgrade steps. Once added, this check will pass.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly identifies the core change—adding an Arango HTTP/2 client—and directly matches the PR objective of integrating an HTTP/2 Arango client and proxies; it is concise and relevant to the changeset. It uses a branch-style prefix ("Feature/") and has minor capitalization/formatting inconsistencies, but those are stylistic and do not make the title misleading.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/arango-http2-client

Comment @coderabbitai help to get the list of available commands and usage tips.

@r3d91ll
Copy link
Owner Author

r3d91ll commented Sep 20, 2025

@coderabbitai please review this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 20, 2025

@r3d91ll I'll review your PR that integrates the HTTP/2 Arango client and proxies for Phase 3-4. This looks like a significant enhancement involving ArangoMemoryClient optimization, Go proxy implementation, and workflow migrations.

Let me examine all the changes in detail to provide you with comprehensive feedback.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (21)
core/processors/text/chunking_strategies.py (1)

503-523: Don’t drop trailing tokens; data is sacrosanct.

The remainder gate len(remaining_tokens) >= self.step_size can silently discard tail content. Always emit a final remainder chunk if any tokens remain.

-                if len(remaining_tokens) >= self.step_size:  # Only if substantial
+                if remaining_tokens:  # Always preserve tail
                     chunk_text = ' '.join(remaining_tokens)
                     start_char = len(' '.join(tokens[:last_index]))

Optionally compute start_char via the same prefix_chars to keep complexity linear:

-                    start_char = len(' '.join(tokens[:last_index]))
+                    start_char = prefix_chars[last_index] if last_index < len(prefix_chars) else prefix_chars[-1]
core/tools/arxiv/arxiv_manager.py (3)

101-113: Version suffix not actually ignored in validation (uses arxiv_id instead of base_id).

IDs like 2308.12345v2 can fail validation.

-        if cls.ARXIV_ID_PATTERN.match(arxiv_id):
+        if cls.ARXIV_ID_PATTERN.match(base_id):
             return True, None
 ...
-        if cls.OLD_ARXIV_ID_PATTERN.match(arxiv_id):
+        if cls.OLD_ARXIV_ID_PATTERN.match(base_id):
             return True, None

345-377: Async method blocks the event loop by calling a sync processor.

Run CPU/IO-heavy process_document in a thread executor.

-        processing_result = self.processor.process_document(
-            pdf_path=paper_info.pdf_path,
-            latex_path=paper_info.latex_path,
-            document_id=arxiv_id
-        )
+        loop = asyncio.get_running_loop()
+        processing_result = await loop.run_in_executor(
+            None,
+            self.processor.process_document,
+            paper_info.pdf_path,
+            paper_info.latex_path,
+            arxiv_id
+        )

31-31: Migrate arxiv_manager to the new Arango HTTP/2 memory client

core/tools/arxiv/arxiv_manager.py currently imports ArangoDBManager from core.database.arango.arango_client (line 31). The repo now exposes ArangoMemoryClient and ArangoHttp2Client under core.database.arango and provides a factory at core/database/database_factory.py — replace the legacy import with a call to the factory or import ArangoMemoryClient (core.database.arango) and instantiate the configured HTTP/2 client.
core/tools/arxiv/arxiv_manager.py:31

setup/verify_storage.py (3)

29-44: Outdated collection names and invalid consistency check.

New pipeline stores: arxiv_papers, arxiv_embeddings, arxiv_structures. Comparing counts of papers vs embeddings is meaningless.

-    metadata_count = db.collection('arxiv_metadata').count()
-    chunks_count = db.collection('arxiv_abstract_chunks').count()
-    embeddings_count = db.collection('arxiv_abstract_embeddings').count()
+    papers_count = db.collection('arxiv_papers').count()
+    embeddings_count = db.collection('arxiv_embeddings').count()
+    structures_count = db.collection('arxiv_structures').count()
 ...
-    print(f"  arxiv_metadata:            {metadata_count:,}")
-    print(f"  arxiv_abstract_chunks:     {chunks_count:,}")
-    print(f"  arxiv_abstract_embeddings: {embeddings_count:,}")
+    print(f"  arxiv_papers:              {papers_count:,}")
+    print(f"  arxiv_embeddings:          {embeddings_count:,}")
+    print(f"  arxiv_structures:          {structures_count:,}")
 ...
-    if abs(metadata_count - embeddings_count) <= 10:
-        print(f"  ✅ Counts are aligned (diff: {abs(metadata_count - embeddings_count)})")
-    else:
-        print(f"  ⚠️  Counts differ by {abs(metadata_count - embeddings_count)}")
+    print("  ✅ Collections present (counts above). Embeddings >> papers is expected.")

49-76: Field names and shape checks don’t match new schema.

Use processing_timestamp; compute embedding dim via LENGTH; query embeddings by arxiv_id.

-        cursor = db.aql.execute('''
-            FOR doc IN arxiv_metadata
-                SORT doc.processed_at DESC
-                LIMIT 1
-                RETURN doc
-        ''')
+        cursor = db.aql.execute('''
+            FOR doc IN arxiv_papers
+                SORT doc.processing_timestamp DESC
+                LIMIT 1
+                RETURN doc
+        ''')
 ...
-            print(f"  Processed: {recent_meta.get('processed_at', 'N/A')}")
+            print(f"  Processed: {recent_meta.get('processing_timestamp', 'N/A')}")
 ...
-            cursor = db.aql.execute('''
-                FOR doc IN arxiv_abstract_embeddings
-                    FILTER doc.arxiv_id == @id
-                    LIMIT 1
-                    RETURN doc
-            ''', bind_vars={'id': arxiv_id})
-
-            embedding = next(cursor, None)
-            if embedding:
-                print(f"  ✅ Has embedding (dim: {embedding.get('embedding_dim', 'N/A')})")
+            cursor = db.aql.execute('''
+                FOR doc IN arxiv_embeddings
+                    FILTER doc.arxiv_id == @id
+                    LIMIT 1
+                    RETURN LENGTH(doc.embedding)
+            ''', bind_vars={'id': arxiv_id})
+            emb_dim = next(cursor, None)
+            if emb_dim:
+                print(f"  ✅ Has embedding (dim: {emb_dim})")
             else:
                 print(f"  ❌ No embedding found")
-
-            # Check for chunks
-            cursor = db.aql.execute('''
-                FOR doc IN arxiv_abstract_chunks
-                    FILTER doc.arxiv_id == @id
-                    RETURN doc
-            ''', bind_vars={'id': arxiv_id})
-
-            chunks = list(cursor)
-            print(f"  ✅ Has {len(chunks)} chunks")
+            # Count chunks (one per embedding)
+            cursor = db.aql.execute('''
+                RETURN LENGTH(
+                  FOR doc IN arxiv_embeddings
+                  FILTER doc.arxiv_id == @id
+                  RETURN 1
+                )
+            ''', bind_vars={'id': arxiv_id})
+            emb_count = next(cursor, 0)
+            print(f"  ✅ Has {emb_count} chunks/embeddings")

19-23: Replace Unix-socket Arango client with Arango memory/HTTP2 client

Replace the call in setup/verify_storage.py:
db = DatabaseFactory.get_arango(database='academy_store', username='root', use_unix=True)
with:
db = DatabaseFactory.get_arango_memory_service(database='academy_store')

Signature (core/database/database_factory.py):
def get_arango_memory_service(cls, *, database: str = "arxiv_repository", username: str = "root", password: Optional[str] = None, socket_path: Optional[str] = None, read_socket: Optional[str] = None, write_socket: Optional[str] = None, use_proxies: Optional[bool] = None, base_url: Optional[str] = None, connect_timeout: Optional[float] = None, read_timeout: Optional[float] = None, write_timeout: Optional[float] = None) -> ArangoMemoryClient

If you relied on use_unix=True, pass socket_path or read_socket/write_socket to the memory service.

core/workflows/README.md (1)

7-8: Correct the narrative to divide by Time (T), not multiply.

Text says “W×R×H×T” which conflicts with the mandated C = (W·R·H/T)·Ctx^α.

-Workflows orchestrate ... through all dimensions (W×R×H×T) while maintaining Context coherence.
+Workflows orchestrate ... across W, R, H while minimizing T (C = (W·R·H/T)·Ctx^α), maintaining Context coherence.
core/workflows/workflow_pdf.py (2)

481-519: Zero-vector fallback may have wrong dimensionality (defaults to 2048).

If the model dimension differs, documents get inconsistent vectors.

-            if embedding is not None:
+            if embedding is not None:
                 # Convert to numpy array and ensure it's 1D float32
                 embedding_array = np.asarray(embedding, dtype=np.float32)
                 if embedding_array.ndim != 1:
                     embedding_array = embedding_array.reshape(-1)
             else:
-                # Create zero embedding if something went wrong
-                embedding_array = np.zeros(self.config.embedding_dim, dtype=np.float32)
+                # Create zero embedding with the model's dimension if available
+                model_dim = None
+                if 'embedding_array' in locals():
+                    model_dim = int(embedding_array.shape[0]) if embedding_array is not None else None
+                if model_dim is None:
+                    model_dim = int(getattr(self.embedder, "embedding_dim", self.config.embedding_dim))
+                embedding_array = np.zeros(model_dim, dtype=np.float32)

Also consider batching:

  • Collect texts = [c['text'] for c in chunks], call embed_texts once, then map results back by index for a 10–30% throughput bump.

213-218: Pass the model via the embedder config (key: "model_name"); do not add a top-level model= kwarg.

JinaV4Embedder.init accepts a single config (EmbeddingConfig or dict) and reads model_name from it; it does not accept model/top-level kwargs — current calls like JinaV4Embedder(device=..., use_fp16=...) are incorrect and will raise. Change instantiations to pass a config/dict with 'model_name' (or update the constructor to accept model/**kwargs).

Files to fix:

  • core/embedders/embedders_jina.py (constructor at ~lines 66–76)
  • core/workflows/workflow_pdf.py (instantiation at lines 213–218)
  • core/workflows/workflow_pdf_batch.py (instantiation at ~line 424)

Suggested minimal fix: pass a dict/config, e.g. JinaV4Embedder({'model_name': self.config.embedding_model, 'device': self.config.device, 'use_fp16': self.config.use_fp16, 'chunk_size_tokens': self.config.chunk_size_tokens, 'chunk_overlap_tokens': self.config.chunk_overlap_tokens})

core/embedders/README.md (1)

294-309: Fix cache example: currently embeds the MD5 hash, not the original text.

super().embed([text_hash]) will generate embeddings for the hex hash string. Cache by hash, embed the original text.

Apply:

 class CachedEmbedder(JinaV4Embedder):
@@
-    @lru_cache(maxsize=10000)
-    def _cached_embed(self, text_hash: str) -> np.ndarray:
-        """Cache embeddings by text hash."""
-        return super().embed([text_hash])[0]
+    @lru_cache(maxsize=10000)
+    def _cached_embed(self, text: str) -> np.ndarray:
+        """Cache embeddings keyed by text content (or a stable hash)."""
+        return super().embed([text])[0]
@@
-        for text in texts:
-            text_hash = hashlib.md5(text.encode()).hexdigest()
-            embeddings.append(self._cached_embed(text_hash))
+        for text in texts:
+            embeddings.append(self._cached_embed(text))
         return np.array(embeddings)
.gitignore (1)

412-416: Don’t ignore the entire tests/ directory.

This drops test sources from VCS and CI. Keep artifact-specific ignores only.

-scratch/*
-dev-utils/*
-tests/
-core/database/arango/proxies/.gocache/
-*.gocache/*
-benchmarks/*
+scratch/*
+dev-utils/*
+# keep test sources tracked; ignore artifacts only (see lines 394–398)
+core/database/arango/proxies/.gocache/
+*.gocache/*
+benchmarks/*
core/monitoring/progress_tracker.py (1)

566-571: Fix Conveyance formula: TIME currently double-inverted (penalizes fast runs).

You normalize TIME as inverse duration (time_score ≈ 1/T). Dividing by time_score yields W·R·H·T. Multiply by time_score instead to satisfy C = (W·R·H/T)·Ctx^α.

-            if time_score > 0:
-                conveyance_score = (where_score * what_score * who_score / time_score) * \
-                                 (context_score ** alpha)
-            else:
-                conveyance_score = 0.0
+            if time_score > 0:
+                conveyance_score = (where_score * what_score * who_score * time_score) * \
+                                   (context_score ** alpha)
+            else:
+                conveyance_score = 0.0
core/tools/arxiv/arxiv_api_client.py (2)

273-288: Avoid per‑entry HEAD probe; it doubles traffic and hits rate limits.

Defer LaTeX availability to download time or make it opt‑in (e.g., has_latex=None until checked). This reduces API calls and respects rate limits.


119-139: Add ArXiv-compliant User‑Agent header, drop unreachable raise, prefer HTTPS

File: core/tools/arxiv/arxiv_api_client.py (lines 119-139)

  • Add a descriptive User‑Agent (with contact) in headers and pass to requests.get to avoid throttling/blocks.
  • Remove the final unreachable raise requests.exceptions.RequestException("Max retries exceeded") (the loop re-raises on last attempt).
  • Prefer HTTPS for the endpoint if not already used.
  • Ensure import os is present at the top of the file.
 def _make_request(self, url: str, params: Optional[Dict[str, Any]] = None) -> requests.Response:
     """Make HTTP request with retries and error handling"""
     self._enforce_rate_limit()
 
     for attempt in range(self.max_retries):
         try:
             logger.debug(f"Request attempt {attempt + 1}: {url}")
-            response = requests.get(url, params=params, timeout=self.timeout)
+            headers = {
+                "User-Agent": os.environ.get(
+                    "HADES_USER_AGENT",
+                    "HADES-Lab/ArXivAPIClient (+set-contact@example.com)"
+                )
+            }
+            response = requests.get(url, params=params, timeout=self.timeout, headers=headers)
             response.raise_for_status()
             return response
             
         except requests.exceptions.RequestException as e:
             logger.warning(f"Request failed (attempt {attempt + 1}): {e}")
             if attempt < self.max_retries - 1:
                 # Exponential backoff
                 delay = (2 ** attempt) * self.rate_limit_delay
                 time.sleep(delay)
             else:
                 raise
-    
-    raise requests.exceptions.RequestException("Max retries exceeded")
core/config/arxiv_repository.env (1)

15-18: Remove hard-coded ARXIV passwords and rotate immediately.

core/config/arxiv_repository.env — plaintext ARXIV_ADMIN/WRITER/READER_PASSWORD on lines 15–17.

  • Rotate/revoke these credentials now.
  • Purge secrets from git history (git filter-repo or BFG) and replace the file with a template; add runtime-only loading to .gitignore.
  • Use a secret manager or runtime env/systemd drop‑ins; stop committing plaintext.
- export ARXIV_ADMIN_PASSWORD=P0Fzu3J5TvF8Y+kq9gVw0AwLZzj8djU7
- export ARXIV_WRITER_PASSWORD=EZDPQYy9KfikGJ2m7NeipCBkFWDHIv/o
- export ARXIV_READER_PASSWORD=KIBJpNXQL0eoVjqUG5Bo2Uq+xYBPhrhJ
+ # export ARXIV_ADMIN_PASSWORD=__SET_AT_RUNTIME__
+ # export ARXIV_WRITER_PASSWORD=__SET_AT_RUNTIME__
+ # export ARXIV_READER_PASSWORD=__SET_AT_RUNTIME__
core/monitoring/__init__.py (1)

282-315: Update calculate_conveyance to use spec-aligned α and clamp time

  • In core/monitoring/init.py, change alpha: float = 1.8 to alpha: float = 1.7 to match the spec’s default (e.g. core/workflows/workflow_arxiv_initial_ingest.py).
  • Remove time == 0 from the zero-propagation gate and instead clamp time via t = max(time, 1e-9) before dividing to avoid divide-by-zero while preserving conveyance semantics.
core/workflows/workflow_pdf_batch.py (3)

545-553: Join key inconsistency: chunks use sanitized_id in document_id.

paper_doc stores document_id (raw), but chunks store document_id = sanitized_id. This breaks joins by document_id.

Apply this diff to use raw document_id for relations while keeping sanitized _key:

-                    txn_db.collection(collections['chunks']).insert({
-                        '_key': chunk_key,
-                        'document_id': sanitized_id,
+                    txn_db.collection(collections['chunks']).insert({
+                        '_key': chunk_key,
+                        'document_id': document_id,
                         'text': chunk.text,
                         'chunk_index': i,
                         'start_char': chunk.start_char,
                         'end_char': chunk.end_char
                     }, overwrite=True)

557-563: Same inconsistency for embeddings: document_id should be raw.

Embeddings currently record document_id = sanitized_id. Align with paper_doc.

-                    embedding_doc = {
+                    embedding_doc = {
                         '_key': f"{chunk_key}_emb",
-                        'document_id': sanitized_id,
+                        'document_id': document_id,
                         'chunk_id': chunk_key,
                         'vector': chunk.embedding.tolist(),
                         'model': 'jina-v4'
                     }

388-418: Replace legacy ArangoDBManager with ArangoMemoryClient (use resolve_memory_config)

Workflow still constructs ArangoDBManager — migrate to ArangoMemoryClient to avoid mixing DB stacks and proxy/HTTP2 drift.

  • core/workflows/workflow_pdf_batch.py — replace from core.database.arango_db_manager import ArangoDBManager (≈line 414) and WORKER_DB_MANAGER = ArangoDBManager(arango_config) (≈line 432) with using resolve_memory_config(...) and ArangoMemoryClient(config); keep the collections override.
  • core/tools/arxiv/arxiv_manager.py — replace from core.database.arango.arango_client import ArangoDBManager (≈line 31) and self.db_manager = ArangoDBManager(arango_config) if arango_config else None (≈line 254) with the memory-client pattern.
  • Use the Memory client for reads; for writes, encapsulate in a transaction helper (add a thin transaction API atop the memory/HTTP2 client if needed).
core/embedders/embedders_jina.py (1)

450-543: Late chunking currently re-encodes context per chunk; violates “encode whole doc first” mandate

Guidelines require: encode full document once, then derive chunk embeddings. Current windowed approach scales O(N) model calls and breaks the “never chunk before encoding” rule.

Apply this change to base late chunking on the already-implemented encode_full_document + embed_chunks_from_tokens:

@@ def embed_with_late_chunking(self,
-        # For longer texts, create overlapping context windows
-        chunks = []
-        chars_per_token = 4  # Rough estimate
-        chunk_size_chars = self.chunk_size_tokens * chars_per_token
-        overlap_chars = self.chunk_overlap_tokens * chars_per_token
-        # We'll use larger context windows for encoding
-        context_size_chars = min(chunk_size_chars * 3, self.MAX_TOKENS * chars_per_token)
-        chunk_index = 0
-        chunk_start = 0
-        while chunk_start < len(text):
-            ...
-        return chunks
+        # Proper late chunking: encode once, slice token embeddings
+        token_embeddings, meta = self.encode_full_document(text, task=task)
+        if token_embeddings.numel() == 0:
+            return []
+        return self.embed_chunks_from_tokens(
+            token_embeddings=token_embeddings,
+            metadata=meta,
+            text=text,
+            chunk_size_tokens=self.chunk_size_tokens,
+            chunk_overlap_tokens=self.chunk_overlap_tokens,
+        )
🧹 Nitpick comments (68)
core/processors/text/chunking_strategies.py (4)

173-201: Type hint is good; avoid O(n^2) start_char computation in loop.

len(' '.join(tokens[:i])) re-joins increasingly large prefixes per iteration. Precompute prefix character offsets once and index them to keep the loop O(n).

@@
-        chunks: List[TextChunk] = []
+        chunks: List[TextChunk] = []
         stride = self.chunk_size - self.chunk_overlap
+        # Precompute start-char offsets for each token index to avoid repeated joins
+        if tokens:
+            prefix_chars = [0, len(tokens[0])]
+            for idx in range(1, len(tokens)):
+                prefix_chars.append(prefix_chars[-1] + 1 + len(tokens[idx]))
+        else:
+            prefix_chars = [0]
@@
-            # Calculate character positions (approximate)
-            start_char = len(' '.join(tokens[:i]))
+            # Calculate character positions (approximate)
+            start_char = prefix_chars[i] if i < len(prefix_chars) else prefix_chars[-1]
             end_char = start_char + len(chunk_text)

277-343: Type hint LGTM; consider stabilizing char offsets.

current_chunk_start advances by len(current_chunk_text)+1 while you sometimes join with "\n\n". For more consistent offsets (and fewer split/join side-effects), track a running cursor over the cleaned source or precompute paragraph start offsets.


481-492: Type hint LGTM; also eliminate repeated joins in SlidingWindow.

Same O(n^2) pattern when computing start_char. Precompute once.

@@
-        chunks: List[TextChunk] = []
+        chunks: List[TextChunk] = []
+        # Precompute start-char offsets for each token index
+        if tokens:
+            prefix_chars = [0, len(tokens[0])]
+            for idx in range(1, len(tokens)):
+                prefix_chars.append(prefix_chars[-1] + 1 + len(tokens[idx]))
+        else:
+            prefix_chars = [0]
@@
-            start_char = len(' '.join(tokens[:i]))
+            start_char = prefix_chars[i] if i < len(prefix_chars) else prefix_chars[-1]

568-569: Add debug trace for factory creations.

Improves observability/ownership without changing behavior.

-        return strategy_class(**kwargs)
+        logger.debug("Creating chunking strategy '%s' with kwargs=%s", strategy_type, kwargs)
+        return strategy_class(**kwargs)
core/workflows/storage/__init__.py (2)

11-11: Avoid exporting None for LocalStorage; fail fast with a clear error.

Returning None defers failure to a later AttributeError. Prefer a lazy, actionable ImportError.

Apply or adapt:

 except ImportError:
-    LocalStorage = None  # type: ignore[misc]
+    LocalStorage = None  # type: ignore[misc]
+
+# Fail-fast, informative fallback
+def __getattr__(name: str):
+    if name == "LocalStorage" and LocalStorage is None:
+        raise ImportError(
+            "LocalStorage backend is unavailable. Install optional deps or select a supported backend."
+        )
+    raise AttributeError(name)

26-31: Sort all to satisfy Ruff (RUF022).

Keeps exports predictable and quiets the linter.

 __all__ = [
-    'StorageBase',
-    'LocalStorage',
-    'S3Storage',
-    'RamFSStorage',
+    'LocalStorage',
+    'RamFSStorage',
+    'S3Storage',
+    'StorageBase',
 ]
core/processors/__init__.py (1)

16-22: Consider alphabetically sorting all for consistency.

The static analysis tool suggests sorting the __all__ list alphabetically. According to few references from reputable sites and repositories, also popularity, Alphabetical ordering is the way... Imports are generally sorted alphabetically and described in various places besides PEP 8. Alphabetically sorted modules are quicker to read and searchable. After all, Python is all about readability. Also, it is easier to verify that something is imported, and avoids duplicate imports.

Apply this diff to sort the __all__ list alphabetically:

 __all__ = [
     'ChunkingStrategy',
     'ChunkingStrategyFactory',
     'SemanticChunking',
-    'TokenBasedChunking',
     'SlidingWindowChunking',
+    'TokenBasedChunking',
 ]
.coderabbit.yaml (2)

55-55: Fix misleading comment for tests exclusion.

The pattern excludes tests, but the comment mentions lock files.

-    - "!**/tests/**"             # Lock files (e.g. poetry.lock)
+    - "!**/tests/**"             # Exclude tests from review

58-103: Add path-specific rule: enforce dual-embedder with late chunking for workflows/embedders.

Codifies the learned policy so reviews consistently check it.

   path_instructions:
     - path: "**/*"
       instructions: |
         # HADES — Conveyance Framework (Efficiency view default)
         ...
+    - path: "core/workflows/**/*.py"
+      instructions: |
+        Embedder selection by workload:
+        - Use SentenceTransformersEmbedder for high throughput (48+ papers/sec).
+        - Use JinaV4Embedder for sophisticated processing.
+        - Both must use late chunking.
+    - path: "core/embedders/**/*.py"
+      instructions: |
+        Enforce late chunking and support both SentenceTransformersEmbedder and JinaV4Embedder per workload.
core/tools/arxiv/arxiv_manager.py (2)

327-343: Be robust to metadata schema variations (authors/categories).

The OAI snapshot varies: authors can be a list of dicts or a comma/semicolon-delimited string; categories often space-delimited string.

-            authors=metadata.get('authors', []).split(', ') if isinstance(metadata.get('authors'), str) else metadata.get('authors', []),
+            authors=(
+                metadata.get('authors', [])
+                if isinstance(metadata.get('authors'), list)
+                else [a.strip() for a in re.split(r'[;,]\s*', metadata.get('authors', '')) if a.strip()]
+            ),
-            categories=metadata.get('categories', '').split() if isinstance(metadata.get('categories'), str) else metadata.get('categories', []),
+            categories=(
+                metadata.get('categories', [])
+                if isinstance(metadata.get('categories'), list)
+                else metadata.get('categories', '').split()
+            ),

Would you confirm the exact authors/categories shapes in your snapshot file?


494-514: Optional: parallelize batch with bounded concurrency.

Sequential awaits underutilize the pipeline; consider bounded gather.

-        results = []
-        
-        for arxiv_id in arxiv_ids:
-            try:
-                result = await self.process_arxiv_paper(arxiv_id, store_in_db)
-                results.append(result)
-            except Exception as e:
-                logger.error(f"Failed to process {arxiv_id}: {e}")
-                # Create failed result
-                from core.workflows.workflow_pdf import ProcessingResult, ExtractionResult
-                failed_result = ProcessingResult(
-                    extraction=ExtractionResult(full_text=""),
-                    chunks=[],
-                    processing_metadata={'arxiv_id': arxiv_id, 'error': str(e)},
-                    total_processing_time=0,
-                    extraction_time=0,
-                    chunking_time=0,
-                    embedding_time=0,
-                    success=False,
-                    errors=[str(e)]
-                )
-                results.append(failed_result)
-        
-        return results
+        async def _process_one(aid: str):
+            try:
+                return await self.process_arxiv_paper(aid, store_in_db)
+            except Exception as e:
+                logger.error(f"Failed to process {aid}: {e}")
+                from core.workflows.workflow_pdf import ProcessingResult, ExtractionResult
+                return ProcessingResult(
+                    extraction=ExtractionResult(full_text=""),
+                    chunks=[],
+                    processing_metadata={'arxiv_id': aid, 'error': str(e)},
+                    total_processing_time=0,
+                    extraction_time=0,
+                    chunking_time=0,
+                    embedding_time=0,
+                    success=False,
+                    errors=[str(e)]
+                )
+
+        sem = asyncio.Semaphore(min(8, max(1, len(arxiv_ids))))
+        async def _guarded(aid: str):
+            async with sem:
+                return await _process_one(aid)
+        return await asyncio.gather(*[_guarded(a) for a in arxiv_ids])
setup/verify_storage.py (1)

90-107: Recent activity query should target arxiv_papers and processing_timestamp.

Small alignment with new schema.

-        one_min_ago = (datetime.now() - timedelta(minutes=1)).isoformat()
+        one_min_ago = (datetime.now() - timedelta(minutes=1)).isoformat()
         cursor = db.aql.execute('''
-            FOR doc IN arxiv_metadata
-                FILTER doc.processed_at >= @time
+            FOR doc IN arxiv_papers
+                FILTER doc.processing_timestamp >= @time
                 COLLECT WITH COUNT INTO count
                 RETURN count
         ''', bind_vars={'time': one_min_ago})
setup/system_requirements.md (2)

110-114: Prefer Composer install via the committed composer.json.

Keep dependency management declarative; avoid ad-hoc require in prod docs.

-   composer require triagens/arangodb
+   composer install  # uses composer.json at repo root

85-88: Add language tag to fenced code block (markdownlint MD040).

Small lint fix.

-```
+```env
 ARANGO_PASSWORD=your_password
 CUDA_VISIBLE_DEVICES=0,1

</blockquote></details>
<details>
<summary>core/embedders/README.md (2)</summary><blockquote>

`11-17`: **Add language to fenced block (markdownlint MD040).**

Directory tree block should be tagged as text.

```diff
-```
+```text
 embedders/
 ├── embedders_base.py     # Abstract base class and interfaces
 ├── embedders_jina.py     # Jina V3/V4 implementation with late chunking
 ├── embedders_factory.py  # Factory pattern for embedder instantiation
 └── __init__.py          # Public API exports

---

`110-118`: **Header/model mismatch: “Jina V3 (1024)” but snippet uses JinaV4Embedder + v4 model.**

Align the heading or the example to avoid confusion.

```diff
-### Jina V3 (1024 dimensions)
+### Jina V4 (1024‑dimension mode)
@@
-embedder = JinaV4Embedder(
-    model_name="jinaai/jina-embeddings-v4",
-    max_seq_length=8192,
-    dimensions=1024
-)
+embedder = JinaV4Embedder(model_name="jinaai/jina-embeddings-v4",
+                          max_seq_length=8192, dimensions=1024)
setup/setup_python_env.sh (1)

62-66: Project-root resolution LGTM; make it symlink-safe.

Prefer resolving symlinks for robustness.

-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+_src="${BASH_SOURCE[0]}"
+SCRIPT_DIR="$(cd "$(dirname "${_src}")" && pwd)"
+# Try to resolve symlinks (portable fallback if readlink -f unavailable)
+if command -v readlink >/dev/null 2>&1; then
+  real="$(readlink -f "${_src}" 2>/dev/null || true)"
+  [ -n "$real" ] && SCRIPT_DIR="$(cd "$(dirname "$real")" && pwd)"
+fi
 PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
 cd "$PROJECT_ROOT"
setup/setup_database.sh (1)

228-233: Quote/array-expand script args to avoid word-splitting.

$PYTHON_ARGS as a flat string risks mis-splitting options containing spaces.

-# Run the simplified Python setup script from project root
-if python3 "$PROJECT_ROOT/dev-utils/setup_arxiv_database_simple.py" $PYTHON_ARGS; then
+# Run the simplified Python setup script from project root
+read -r -a _py_args <<< "$PYTHON_ARGS"
+if python3 "$PROJECT_ROOT/dev-utils/setup_arxiv_database_simple.py" "${_py_args[@]}"; then
README.md (3)

61-69: Add language to fenced block (markdownlint MD040).

Mark the CLI example as bash.

-```
+```bash
 poetry run python tests/benchmarks/arango_connection_test.py \
@@
 --report-json reports/get_hot.json

---

`18-27`: **Avoid using root in examples; show least-privilege user.**

Use the RO user for GETs to reinforce security posture.

```diff
 config = ArangoHttp2Config(
     database="arxiv_repository",
     socket_path="/tmp/arango_ro_proxy.sock",
-    username="root",
-    password="...",
+    username="arxiv_reader",
+    password="…",
 )

48-49: Align default socket paths with systemd deployment (/run/hades).

Docs say proxies ship under /run/hades; reflect that as default.

-Sockets default to `/tmp/arango_ro_proxy.sock` and `/tmp/arango_rw_proxy.sock`. Ensure permissions (0660) and adjust via env vars `LISTEN_SOCKET`, `UPSTREAM_SOCKET`.
+Sockets default to `/run/hades/readonly/arangod.sock` and `/run/hades/readwrite/arangod.sock` (systemd-managed). Ensure permissions (0640/0600) and adjust via env vars `LISTEN_SOCKET`, `UPSTREAM_SOCKET`.
docs/benchmarks/arango_phase4_summary.md (1)

1-31: Add a Conveyance Scorecard summary to this doc.

Append a short scorecard: Time stance (Efficiency), W/R/H/T summary, L/I/A/G in [0,1], α (default 1.7), zero‑propagation check, and links to raw evidence.

core/tools/arxiv/arxiv_api_client.py (2)

352-371: Stream large downloads to disk.

Use stream=True and iter_content to avoid loading full PDFs into memory. If you keep _make_request generic, call requests.get here directly with headers from above.


171-183: Add User-Agent to API query as well.

Mirror the header addition here to comply with ArXiv policy.

core/config/config_loader.py (1)

281-301: Environment load typing LGTM; minor enhancement optional.

Consider JSON decoding arrays/objects (when value starts with “[” or “{”) to allow complex env overrides.

.env.example (2)

7-11: .env key grouping/order nits (optional).

To satisfy dotenv‑linter and improve readability, group as DATABASE → AUTH, and reorder as DATABASE, USERNAME, PASSWORD.

-# ArangoDB Authentication
-ARANGO_USERNAME=root
-ARANGO_PASSWORD=your_password_here
-ARANGO_DATABASE=arxiv_repository
+# ArangoDB
+ARANGO_DATABASE=arxiv_repository
+ARANGO_USERNAME=root
+ARANGO_PASSWORD=your_password_here

12-19: Declare sockets before the flag; match proxy paths.

Minor ordering plus consistency with proxy locations.

-# Socket Configuration (HADES uses Unix sockets only)
-ARANGO_USE_PROXIES=true
-ARANGO_RO_SOCKET=/run/hades/readonly/arangod.sock
-ARANGO_RW_SOCKET=/run/hades/readwrite/arangod.sock
+# Socket Configuration (HADES uses Unix sockets only)
+ARANGO_RO_SOCKET=/run/hades/readonly/arangod.sock
+ARANGO_RW_SOCKET=/run/hades/readwrite/arangod.sock
+ARANGO_USE_PROXIES=true
core/workflows/workflow_arxiv_initial_ingest.md (3)

38-42: Update architecture diagram to remove PHP bridge.

-    ArangoDB (via PHP bridge)
+    ArangoDB (via ArangoMemoryClient → HTTP/2 over UDS proxies)

239-241: Refresh Last updated/version to this change set.

Use an absolute date (e.g., 2025‑09‑20) and increment version to reflect HTTP/2 migration.


200-210: Conveyance section: keep α only on Ctx; add scorecard.

You use the correct formula; please add a short Conveyance Scorecard block (Time stance, W/R/H/T, L/I/A/G, α=1.7, zero‑prop).

docs/prd/go_grpc_neural_pathway_prd.md (1)

114-123: Tie SLOs to benchmark doc and clarify conditions.

Explicitly reference Phase 4 benchmarks (doc link) and specify “hot cache, single host, HTTP/2 over UDS via proxies” to avoid ambiguity.

pyproject.toml (1)

66-90: Consider the implications of relaxed typing configuration.

The mypy configuration has been significantly relaxed from strict typing to permissive. While this may be necessary during the HTTP/2 migration, consider the long-term maintainability implications:

  • ignore_missing_imports = true may hide legitimate import issues
  • allow_untyped_defs/calls = true reduces type safety benefits
  • check_untyped_defs = false disables valuable type checking

Consider creating a gradual migration plan to re-enable stricter typing after the HTTP/2 refactoring is complete. This could involve:

  1. Creating a separate [tool.mypy.strict] configuration for new modules
  2. Gradually enabling checks on a per-module basis using [[tool.mypy.overrides]]
  3. Setting target dates for re-enabling stricter checks
 # Can gradually enable stricter checking later
-# strict = true  # Uncomment when ready
+# TODO: Re-enable strict typing by Q2 2025
+# Target modules: core/database/arango/* (new HTTP/2 clients)
core/config/embedders/default.yaml (1)

1-10: Add missing newline at end of file.

The static analysis correctly identified that the file is missing a trailing newline, which is a POSIX standard requirement.

Apply this fix:

 # Override any specific settings here if needed
-# (Currently using all Jina v4 defaults)
\ No newline at end of file
+# (Currently using all Jina v4 defaults)
core/database/database_factory.py (1)

79-82: Clarify debugging-only network access policy.

The warning message and policy enforcement look good, but consider making the debugging limitation more explicit in production deployments.

Consider adding an environment variable check to make the debugging policy more explicit:

# Network connection - ONLY for human debugging, never for HADES
if host != "localhost":
    if os.environ.get("HADES_ALLOW_NETWORK_DEBUG") != "true":
        raise RuntimeError("Network connections not allowed in HADES. Set HADES_ALLOW_NETWORK_DEBUG=true for debugging only.")
    logger.warning("Network connection requested - this should only be used for debugging")
core/database/arango/proxies/rw_proxy.go (1)

28-31: Avoid logging sensitive payloads.

Ensure logRequests does not log request bodies or auth headers; redact query strings. Cursor POST bodies can contain AQL.

core/database/arango/proxies/ro_proxy.go (1)

60-63: Harden cursor path check.

strings.Contains can match unintended paths. Use an anchored check like ^(/_db/[^/]+)?/_api/cursor(/\d+)?$ after URL‑decoding.

setup/setup_local.sh (5)

46-51: Version check not portable (sort -V).

BSD sort (macOS) lacks -V; script exits despite set -e. Use Python to compare or fallback to dpkg --compare-versions when available.

Apply:

-if [ "$(printf '%s\n' "$required_version" "$python_version" | sort -V | head -n1)" = "$required_version" ]; then
+if python3 - <<'PY' "$python_version" "$required_version"; then
+import sys
+inst, req = sys.argv[1], sys.argv[2]
+def to_t(t): return tuple(map(int, t.split('.')[:2]))
+sys.exit(0 if to_t(inst) >= to_t(req) else 1)
+PY
 then

53-79: PHP/composer detection uses grep -P and brittle checks.

Use PHP itself for version and composer to verify the package is installed; avoid GNU grep -P dependency.

Apply:

-php_version=$(php --version 2>&1 | grep -Po '(?<=PHP )\d+\.\d+' | head -1)
+php_version=$(php -r 'echo PHP_MAJOR_VERSION.".".PHP_MINOR_VERSION;')
@@
-        if [ -f "composer.json" ] && grep -q "triagens/arangodb" composer.json; then
-            print_success "ArangoDB PHP driver is installed"
+        if composer show -i 2>/dev/null | grep -q '^triagens/arangodb '; then
+            print_success "ArangoDB PHP driver is installed"

93-97: PROJECT_ROOT resolution LGTM; add guard for missing path.

Add a quick sanity check before cd.

Apply:

 cd "$PROJECT_ROOT"
+if [ $? -ne 0 ]; then
+  print_error "Failed to cd to PROJECT_ROOT ($PROJECT_ROOT)"; exit 1;
+fi

81-91: Avoid curl|python installer without checksum.

At minimum, show hash verification to reduce supply‑chain risk.

I can add sha256 verification for install.python-poetry.org bootstrap.


127-133: Parameterize ARANGO_HOST via .env before probing.

Defaulting to 192.168.1.69 is surprising; prefer .env if present.

Apply:

-    arango_host=${ARANGO_HOST:-192.168.1.69}
+    [ -f .env ] && set -a && . ./.env && set +a
+    arango_host=${ARANGO_HOST:-127.0.0.1}
docs/issues/46-remove-sentence-transformers.md (2)

3-3: Fix markdownlint MD034 (bare URL).

Wrap the URL as a link.

Apply:

-- Source: https://github.com/r3d91ll/HADES-Lab/issues/46
+- Source: <https://github.com/r3d91ll/HADES-Lab/issues/46>

19-29: Reconcile with prior learning: keep late‑chunking and workload‑based embedder choice.

Long‑term learning notes prefer ST for high throughput and Jina v4 for sophistication. If consolidating on Jina v4, explicitly deprecate ST with rationale and keep late‑chunking guarantees.

I can draft an “Impact & Migration” section and update embedders_factory docstrings accordingly.

setup/php_arango_setup.md (3)

47-50: Clarify status of unix_client.py.

Doc says “attempted Unix socket connection (doesn't work)”; if the file was removed in this PR, note that and link to the new HTTP/2 client and proxies as the path forward.


101-104: Qualify the “~40% faster” claim.

Add a benchmark reference or mark as anecdotal; otherwise readers may assume a hard guarantee.


68-81: Tie UDS path to systemd‑managed sockets.

Recommend using /run/hades/… sockets and group permissions per PR summary; add chmod/chgrp guidance.

docs/prd/grpc_database_service_prd.md (2)

137-162: Inconsistent transport: “Unix sockets only” vs “Port 50051”.

Replace “Port 50051” with a UDS endpoint (e.g., unix:///run/hades/hippocampus.sock) or clarify both modes and default to UDS.

Apply:

-                  │ Port 50051
+                  │ unix:///run/hades/hippocampus.sock

334-358: Add a Conveyance Scorecard to the PRD.

Per guidelines, include Time stance, W/R/H/T summary, L/I/A/G in [0,1], α (default 1.7), zero‑propagation check, and evidence links for the listed targets.

core/database/arango/php_unix_bridge.php (2)

194-223: Stats payload likely uses wrong figure keys.

figures() returns driver‑specific structure; accessing ['figures']['memory']['count']/['disk']['count'] seems invalid. Prefer exposing raw figures or documented fields (e.g., documentsSize, indexes.count).

I can adjust to safe subset: exists, count, figures['documentsSize'], figures['indexes']['count'] when present.


24-38: Make endpoint configurable and prefer UDS when available.

Honor ARANGO_ENDPOINT env (e.g., unix:///run/hades/arangodb.sock) and fall back to tcp://localhost:8529; reflect in outputs to avoid confusion.

Apply:

-$connectionOptions = [
-    // TODO: Fix Unix socket permissions, for now use TCP
-    // ConnectionOptions::OPTION_ENDPOINT => 'unix:///tmp/arangodb.sock',
-    ConnectionOptions::OPTION_ENDPOINT => 'tcp://localhost:8529',
+$endpoint = getenv('ARANGO_ENDPOINT') ?: 'tcp://localhost:8529';
+$connectionOptions = [
+    // Prefer UDS when ARANGO_ENDPOINT is set to unix://…
+    ConnectionOptions::OPTION_ENDPOINT => $endpoint,
@@
-                'endpoint' => 'tcp://localhost:8529'  // TODO: Switch to Unix socket
+                'endpoint' => $endpoint
docs/prd/arango_optimized_connection_prd.md (1)

32-38: Observability: specify trace propagation and metric names.

Define the exact header key (e.g., x-hades-trace), logging format, and minimal metric set (in‑flight streams, queue depth, WAL/cache) with sampling rules. Link these to your benchmark harness.

If you have a header name differing from x-hades-trace in proxies/client, please confirm and I’ll align the PRD text.

core/workflows/workflow_pdf_batch.py (2)

424-429: Config‑select the embedder per workload.

Guidelines: SentenceTransformersEmbedder for high throughput; JinaV4Embedder for sophistication. Make this selectable from config while keeping late chunking.

-    from core.embedders import JinaV4Embedder
+    from core.embedders import JinaV4Embedder, SentenceTransformersEmbedder
@@
-    WORKER_EMBEDDER = JinaV4Embedder(
+    embedder_type = (os.environ.get("EMBEDDER", "") or
+                     embedding_config.get("type", "jina_v4")).lower()
+    if embedder_type == "sentence_transformers":
+        WORKER_EMBEDDER = SentenceTransformersEmbedder(device='cuda', use_fp16=True,
+                                                      chunk_size_tokens=1000, chunk_overlap_tokens=200)
+    else:
+        WORKER_EMBEDDER = JinaV4Embedder(
             device='cuda',
             use_fp16=True,
             chunk_size_tokens=1000,
             chunk_overlap_tokens=200
-    )
+        )

Confirm SentenceTransformersEmbedder implements embed_with_late_chunking; if not, I’ll add it.


190-193: Broad Exception catches hide actionable failure modes.

Narrow exceptions around known operations (I/O, AQL/HTTP, CUDA) or at least log type/traceback to avoid silent data loss. Keep phase‑level isolation.

-                    except Exception as e:
+                    except (TimeoutError, OSError, RuntimeError) as e:
                         results['failed'].append(task.document_id)
-                        logger.error(f"Extraction error for {task.document_id}: {e}")
+                        logger.exception(f"Extraction error for {task.document_id}: {e}")
@@
-    except Exception as e:
+    except (OSError, RuntimeError, ValueError) as e:
         return {
             'success': False,
             'error': str(e)
         }

Also applies to: 602-607

core/database/arango/__init__.py (1)

8-10: Surface change note is good; add migration pointer.

Consider linking to Acheron path or a MIGRATION.md so downstreams replacing ArangoDBManager/unix client know where to look.

CLAUDE.md (1)

201-214: Update env guidance for new socket/proxy path.

Add ARANGO_RO_SOCKET/ARANGO_RW_SOCKET (and optional ARANGO_HTTP_BASE_URL) to reflect UDS + proxies.

 export ARANGO_PASSWORD="your-arango-password"
-export ARANGO_HOST="localhost"
+export ARANGO_HTTP_BASE_URL="http://localhost"
+export ARANGO_RO_SOCKET="/run/hades/readonly/arangod.sock"
+export ARANGO_RW_SOCKET="/run/hades/readwrite/arangod.sock"
core/database/utils/check_resume_state.py (3)

1-1: Shebang vs execution bit.

Either make the file executable (chmod +x) or drop the shebang. Current state triggers linter EXE001.

-#!/usr/bin/env python3
+# Run with: python core/database/utils/check_resume_state.py

13-22: Avoid blind Exception catch on DB connect; print helpful diagnostics.

Catch driver‑specific errors and include server/DB info to aid triage.

-except Exception as e:
+from arango.exceptions import ServerConnectionError, HttpError
+except (ServerConnectionError, HttpError) as e:
     print(f"ERROR: Failed to connect to ArangoDB: {e}", file=sys.stderr)

56-67: Set‑materialization can be heavy; consider streaming diff.

For large embeddings sets, building set(cursor) loads all IDs. If memory spikes, stream and stop after printing first 10 unprocessed.

docs/deploy/arango_proxy_systemd.md (2)

35-37: Install sockets under sockets.target, not multi-user.target

Socket units are typically WantedBy=sockets.target so they come up early and independently of runlevel ordering.

-[Install]
-WantedBy=multi-user.target
+[Install]
+WantedBy=sockets.target

Also applies to: 52-54


45-51: Tighten permissions and paths; align with code defaults

  • Consider moving UPSTREAM_SOCKET from /tmp to /run/arangodb3/arangodb.sock (tmp is world-writable; your client defaults to /run paths).
  • For the RO socket, 0640 is fine; for RW, 0600 is good. Ensure groups “inference” and “consolidation” exist and that your processes actually run under those UIDs/GIDs; otherwise systemd will chown to non-existent principals at start.

Also applies to: 85-91, 102-109, 115-121

core/database/arango/optimized_client.py (2)

97-109: Avoid building huge NDJSON payloads in memory

For large batches this joins the entire iterable into a single bytes object. Consider chunked streaming (iterable of bytes) or bounded batch sizes to cap peak memory.


171-178: Preserve server error diagnostics, but trim message duplication

TRY003 nit: long f-strings in raises. Minor stylistic; optional.

core/embedders/embedders_jina.py (2)

497-498: Remove unused variable

context_size_chars is assigned but unused.

-        context_size_chars = min(chunk_size_chars * 3, self.MAX_TOKENS * chars_per_token)

482-483: Prefer logging.exception for unexpected errors

When catching broad Exceptions, log stack traces with logging.exception.

-            logger.error(f"Worker {worker_id} error: {e}")
+            logger.exception(f"Worker {worker_id} error")

Also applies to: 551-552

core/database/arango/memory_client.py (1)

122-188: Environment resolution: robust defaults and unused arg

  • use_proxies is unused; either wire it to pick /run/hades/* vs direct /run/arangodb3/* sockets, or remove it.
  • Consider honoring ARANGO_PASSWORD_ENV to mirror the CLI path, or ensure callers pass password=None to trigger env resolution (avoid passing empty strings).
core/workflows/workflow_arxiv_initial_ingest.py (3)

468-481: Queue-drain loop: Empty() handling is okay; prefer logging.exception on errors

Minor polish; keeps diagnostics when storage fails.

-            except Exception as e:
-                logger.error(f"Storage error: {e}")
+            except Exception:
+                logger.exception("Storage error")

551-554: Use logging.exception for batch store failures

Same rationale—retain traceback.

-        except Exception as e:
-            logger.error(f"Failed to store batch: {e}")
+        except Exception:
+            logger.exception("Failed to store batch")

589-595: Password env handling: avoid empty-string fallthroughs

In main, setting ARANGO_PASSWORD to '' can be confusing. You already guard in init; consider skipping the env set if the source env is unset to avoid transient empty values.

-    elif not os.environ.get('ARANGO_PASSWORD'):
-        # Try to get from the specified env var
-        os.environ['ARANGO_PASSWORD'] = os.environ.get(args.password_env, '')
+    elif not os.environ.get('ARANGO_PASSWORD') and os.environ.get(args.password_env):
+        os.environ['ARANGO_PASSWORD'] = os.environ[args.password_env]

Also applies to: 184-191

Comment on lines 6 to 10
export ARANGO_UNIX_SOCKET=/tmp/arangodb.sock
export USE_UNIX_SOCKET=true
# Fallback HTTP connection
export ARXIV_DB_HOST=http://localhost:8529

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Socket path conflicts with proxy design.

Use the systemd‑managed allowlisted sockets under /run/hades/{readonly,readwrite}/arangod.sock rather than /tmp. Align these envs with .env.example.

🤖 Prompt for AI Agents
In core/config/arxiv_repository.env around lines 6–10, the ARANGO_UNIX_SOCKET is
pointing at /tmp which conflicts with the proxy design; change it to use the
systemd‑managed allowlisted socket (e.g. /run/hades/readwrite/arangod.sock — or
/run/hades/readonly/arangod.sock if the service is read‑only), keep
USE_UNIX_SOCKET=true, and remove/adjust the fallback ARXIV_DB_HOST so the file
matches .env.example (no localhost:/8529 HTTP fallback when using the systemd
socket).

Comment on lines +36 to +38
database: str = "_system"
socket_path: str = "/tmp/arangodb.sock"
base_url: str = "http://localhost"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Default socket under /run, not /tmp

Using /tmp/arangodb.sock is a security smell and conflicts with your deployment docs that place sockets under /run/hades/* or /run/arangodb3. Switch the default to a root-run directory.

-    socket_path: str = "/tmp/arangodb.sock"
+    socket_path: str = "/run/arangodb3/arangodb.sock"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
database: str = "_system"
socket_path: str = "/tmp/arangodb.sock"
base_url: str = "http://localhost"
database: str = "_system"
socket_path: str = "/run/arangodb3/arangodb.sock"
base_url: str = "http://localhost"
🧰 Tools
🪛 Ruff (0.13.1)

37-37: Probable insecure usage of temporary file or directory: "/tmp/arangodb.sock"

(S108)

🤖 Prompt for AI Agents
In core/database/arango/optimized_client.py around lines 36 to 38, the default
socket_path currently uses /tmp/arangodb.sock; change it to a root-run location
used in deploy docs (for example /run/arangodb3/arangodb.sock or
/run/hades/arangodb.sock) so the default aligns with deployments, and update any
tests/configs or comments that reference the old path; ensure the chosen path is
consistent across the repo and restart/permission notes updated where
appropriate.

@@ -0,0 +1,7 @@
module github.com/r3d91ll/HADES-Lab/core/database/arango/proxies

go 1.23.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Invalid ‘go’ directive; patch versions aren’t allowed.

Use major.minor only (breaks build otherwise).

-go 1.23.4
+go 1.23
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
go 1.23.4
go 1.23
🤖 Prompt for AI Agents
In core/database/arango/proxies/go.mod around line 3, the module uses an invalid
go directive "go 1.23.4" (patch versions are not allowed). Replace the directive
with the major.minor form, e.g. "go 1.23", by editing line 3 accordingly so the
Go toolchain accepts the file.

Comment on lines +28 to +33
server := &http.Server{Handler: logRequests(proxy)}

log.Printf("Read-only proxy listening on %s -> %s", listenSocket, upstreamSocket)
if err := server.Serve(listener); err != nil && err != http.ErrServerClosed {
return fmt.Errorf("proxy server error: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add HTTP server timeouts (RO proxy).

Same hardening as RW to mitigate slowloris/idle connections.

Apply:

@@
-import (
+import (
 	"fmt"
 	"log"
 	"net"
 	"net/http"
 	"strings"
+	"time"
 )
@@
-	server := &http.Server{Handler: logRequests(proxy)}
+	server := &http.Server{
+		Handler:           logRequests(proxy),
+		ReadHeaderTimeout: 2 * time.Second,
+		ReadTimeout:       5 * time.Second,
+		WriteTimeout:      15 * time.Second,
+		MaxHeaderBytes:    1 << 20,
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
server := &http.Server{Handler: logRequests(proxy)}
log.Printf("Read-only proxy listening on %s -> %s", listenSocket, upstreamSocket)
if err := server.Serve(listener); err != nil && err != http.ErrServerClosed {
return fmt.Errorf("proxy server error: %w", err)
}
import (
"fmt"
"log"
"net"
"net/http"
"strings"
"time"
)
server := &http.Server{
Handler: logRequests(proxy),
ReadHeaderTimeout: 2 * time.Second,
ReadTimeout: 5 * time.Second,
WriteTimeout: 15 * time.Second,
MaxHeaderBytes: 1 << 20,
}
log.Printf("Read-only proxy listening on %s -> %s", listenSocket, upstreamSocket)
if err := server.Serve(listener); err != nil && err != http.ErrServerClosed {
return fmt.Errorf("proxy server error: %w", err)
}
🤖 Prompt for AI Agents
In core/database/arango/proxies/ro_proxy.go around lines 28 to 33, the HTTP
server is created without timeouts; add the same hardening as the RW proxy by
constructing the http.Server with explicit timeout fields (e.g. ReadTimeout,
ReadHeaderTimeout, WriteTimeout, IdleTimeout) and keep Handler:
logRequests(proxy); use the same timeout values used in the RW proxy
implementation so slowloris/idle connections are mitigated while keeping the
rest of the Serve logic unchanged.

Comment on lines +168 to +176
// Direct Unix socket connection
config := arango.Config{
Endpoints: []string{"unix:///tmp/arangodb.sock"},
// Connection pooling
MaxIdleConns: 100,
MaxOpenConns: 100,
ConnMaxLifetime: 0, // Never close (local socket)
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use proxy sockets, not direct /tmp/arangodb.sock.

All service traffic must flow through RO/RW proxies under /run/hades/… to enforce allowlists. Update endpoint accordingly.

-    Endpoints: []string{"unix:///tmp/arangodb.sock"},
+    Endpoints: []string{"unix:///run/hades/readwrite/arangod.sock"},
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Direct Unix socket connection
config := arango.Config{
Endpoints: []string{"unix:///tmp/arangodb.sock"},
// Connection pooling
MaxIdleConns: 100,
MaxOpenConns: 100,
ConnMaxLifetime: 0, // Never close (local socket)
}
```
// Direct Unix socket connection
config := arango.Config{
Endpoints: []string{"unix:///run/hades/readwrite/arangod.sock"},
// Connection pooling
MaxIdleConns: 100,
MaxOpenConns: 100,
ConnMaxLifetime: 0, // Never close (local socket)
}
🤖 Prompt for AI Agents
In docs/prd/go_grpc_neural_pathway_prd.md around lines 168 to 176, the example
connects directly to /tmp/arangodb.sock; update the Endpoints to use the Hades
proxy socket under /run/hades/ instead (for example: Endpoints:
[]string{"unix:///run/hades/arangodb-proxy.sock"}) and note to choose the
appropriate RO or RW proxy (e.g. arangodb-ro.sock or arangodb-rw.sock) depending
on the client role; ensure the example uses the proxy path and mention verifying
the proxy socket exists and has correct permissions.

Comment on lines +27 to +36
### PHP (Required for ArangoDB Bridge)
- PHP 8.3+ CLI
- PHP Extensions:
- php8.3-cli
- php8.3-curl
- php8.3-mbstring
- php8.3-zip
- Composer (PHP package manager)
- ArangoDB PHP driver (`triagens/arangodb`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Reframe PHP bridge as optional/legacy.

The new HTTP/2 client + proxies path should be the default; PHP bridge only if needed.

-### PHP (Required for ArangoDB Bridge)
+### PHP (Optional legacy ArangoDB bridge)
 ...
-- ArangoDB PHP driver (`triagens/arangodb`)
+- ArangoDB PHP driver (`triagens/arangodb`) — only if using the legacy PHP bridge
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### PHP (Required for ArangoDB Bridge)
- PHP 8.3+ CLI
- PHP Extensions:
- php8.3-cli
- php8.3-curl
- php8.3-mbstring
- php8.3-zip
- Composer (PHP package manager)
- ArangoDB PHP driver (`triagens/arangodb`)
### PHP (Optional legacy ArangoDB bridge)
- PHP 8.3+ CLI
- PHP Extensions:
- php8.3-cli
- php8.3-curl
- php8.3-mbstring
- php8.3-zip
- Composer (PHP package manager)
- ArangoDB PHP driver (`triagens/arangodb`) — only if using the legacy PHP bridge

Comment on lines +37 to +40
### Database
- ArangoDB 3.11+ (Graph database)
- Configured for Unix socket at `/tmp/arangodb.sock`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Socket path and access method need update (HTTP/2 UDS proxies).

Docs should reflect RO/RW Go proxies and systemd-managed allowlists under /run/hades/… per PR.

-- ArangoDB 3.11+ (Graph database)
-- Configured for Unix socket at `/tmp/arangodb.sock`
+- ArangoDB 3.11+ (Graph database)
+- Access via HTTP/2 over Unix Domain Sockets through Go proxies:
+  - RO proxy and RW proxy sockets under `/run/hades/…` (systemd-managed allowlists)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Database
- ArangoDB 3.11+ (Graph database)
- Configured for Unix socket at `/tmp/arangodb.sock`
### Database
- ArangoDB 3.11+ (Graph database)
- Access via HTTP/2 over Unix Domain Sockets through Go proxies:
- RO proxy and RW proxy sockets under `/run/hades/…` (systemd-managed allowlists)
🤖 Prompt for AI Agents
In setup/system_requirements.md around lines 37 to 40, the Database section
still references a Unix socket at `/tmp/arangodb.sock`; update it to describe
using HTTP/2 UDS proxies instead of direct socket access, specify the RO/RW Go
proxy setup (read-only and read-write proxy roles), and document that
systemd-managed allowlists live under /run/hades/... with the recommended paths
and permissions; adjust the example paths from `/tmp/arangodb.sock` to the
/run/hades/ layout and add a short note about using HTTP/2 over UDS for proxy
communication and systemd-managed access control.

Comment on lines +69 to +77
### Python Dependencies
Managed via Poetry (`pyproject.toml`):
- torch 2.0+
- transformers
- sentence-transformers
- python-arango
- numpy, pandas
- And more...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Python deps list is stale vs PR direction.

You removed sentence-transformers and standardized on Jina v4 in this PR; update the list accordingly.

-# Python Dependencies
-Managed via Poetry (`pyproject.toml`):
-- torch 2.0+
-- transformers
-- sentence-transformers
-- python-arango
-- numpy, pandas
-- And more...
+# Python Dependencies (managed via Poetry; see pyproject.toml)
+- torch 2.0+
+- transformers
+- jina-ai clients / JinaV4 embedder deps
+- numpy, pandas
+# Removed: sentence-transformers (now optional) per dual-embedder policy
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Python Dependencies
Managed via Poetry (`pyproject.toml`):
- torch 2.0+
- transformers
- sentence-transformers
- python-arango
- numpy, pandas
- And more...
# Python Dependencies (managed via Poetry; see pyproject.toml)
- torch 2.0+
- transformers
- jina-ai clients / JinaV4 embedder deps
- numpy, pandas
# Removed: sentence-transformers (now optional) per dual-embedder policy
🤖 Prompt for AI Agents
In setup/system_requirements.md around lines 69 to 77, the Python dependency
list is stale: remove sentence-transformers from the enumerated list and
replace/add the standardized Jina dependency to reflect Jina v4 (e.g.,
"jina==4.x" or "jina>=4.0.0"). Keep other entries (torch, transformers,
python-arango, numpy, pandas) as needed and ensure the bullet list matches
pyproject.toml/Poetry dependencies.

Comment on lines +126 to +130
## Notes

- PHP is required because python-arango cannot use Unix sockets
- The PHP bridge (`core/database/arango/php_unix_bridge.php`) handles all database operations
- See `setup/php_arango_setup.md` for detailed PHP setup instructions No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Outdated statement about python-arango and UDS.

With HTTP/2 UDS proxies, this is no longer true; clarify both supported paths.

-- PHP is required because python-arango cannot use Unix sockets
-- The PHP bridge (`core/database/arango/php_unix_bridge.php`) handles all database operations
+- Default: use HTTP/2 over UDS via Go proxies (no PHP required)
+- Optional: PHP bridge (`core/database/arango/php_unix_bridge.php`) if you must talk to UDS directly without proxies
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Notes
- PHP is required because python-arango cannot use Unix sockets
- The PHP bridge (`core/database/arango/php_unix_bridge.php`) handles all database operations
- See `setup/php_arango_setup.md` for detailed PHP setup instructions
## Notes
- Default: use HTTP/2 over UDS via Go proxies (no PHP required)
- Optional: PHP bridge (`core/database/arango/php_unix_bridge.php`) if you must talk to UDS directly without proxies
- See `setup/php_arango_setup.md` for detailed PHP setup instructions
🤖 Prompt for AI Agents
In setup/system_requirements.md around lines 126 to 130, the note claiming "PHP
is required because python-arango cannot use Unix sockets" is outdated; update
the paragraph to state both supported options: (1) python-arango can now
communicate over Unix Domain Sockets when routed through an HTTP/2 UDS proxy,
and (2) the existing PHP bridge (core/database/arango/php_unix_bridge.php)
remains supported for direct PHP-based UDS handling. Replace the single-line
assertion with two bullets that briefly describe each supported path, keep the
reference to setup/php_arango_setup.md for PHP setup, and add a short pointer to
documentation or a new note about configuring an HTTP/2 UDS proxy for
python-arango.

- Added dynamic attribute retrieval for LocalStorage, S3Storage, and RamFSStorage in storage module, raising ImportError when backends are unavailable.
- Improved error logging in workflow_arxiv_initial_ingest.py to use logger.exception for better stack trace visibility.
- Enhanced document processing in workflow_pdf.py to support late chunking and embedding configuration.
- Updated workflow_pdf_batch.py to replace WORKER_DB_MANAGER with WORKER_DB_CLIENT and adjusted related logic for embedding and database operations.
- Added transaction handling for embedding and chunk storage in workflow_pdf_batch.py, ensuring atomic operations.
- Updated documentation to reflect changes in deployment and connection configurations for ArangoDB.
- Improved setup scripts for better error handling and user feedback during environment setup.
- Refined verify_storage.py to align with new database structure and improve record verification logic.
- Updated ArXivManager to store papers, chunks, embeddings, and structures in new collections.
- Changed collection names in ArxivMetadataConfig and metadata_pipeline.yaml to reflect new schema.
- Modified workflow_arxiv_initial_ingest to handle new data structure and embedding configurations.
- Enhanced PDF workflows to support new embedding factory and deprecated old embedder types.
- Added logging for embedding initialization and processing.
- Updated setup scripts and documentation to align with new collection names and structures.
- Introduced a new Conveyance Framework for structured logging and performance evaluation.
- Improved database verification script to check for embeddings and chunk counts.
- Updated collection names in the academic citation toolkit and workflows to reflect changes from `arxiv_chunks` to `arxiv_abstract_chunks` and from `arxiv_papers` to `arxiv_metadata`.
- Modified the workflow initialization to use the new collection names.
- Adjusted documentation and README files to align with the new collection structure.
- Enhanced the database setup scripts to create the updated collections.
- Implemented a new benchmark test for ArangoDB HTTP/2 client, including GET, INSERT, and QUERY operations.
- Added a CLI tool for logging benchmark results with context parameters.
- Created unit tests for the memory client configuration to ensure proper socket resolution and client instantiation.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (19)
setup/setup_python_env.sh (3)

26-38: Fix uv install PATH and harden installer fetch (prevents immediate failure of next step).

  • uv installer typically drops binaries into ~/.local/bin, not ~/.cargo/bin. Without adding ~/.local/bin to PATH in this session, the subsequent "uv tool install poetry" can fail.
  • Add a curl check, enforce HTTPS defaults, and ensure temp file cleanup on error.
 if command -v uv &> /dev/null; then
     echo "✓ uv is already installed: $(uv --version)"
 else
     echo "Installing uv..."
-    install_script="$(mktemp)"
-    curl -LsSf https://astral.sh/uv/install.sh -o "$install_script"
-    sh "$install_script"
-    rm -f "$install_script"
+    if ! command -v curl >/dev/null 2>&1; then
+        echo "✗ curl is required to install uv. Please install curl." >&2
+        exit 1
+    fi
+    install_script="$(mktemp)"
+    trap 'rm -f "$install_script"' EXIT
+    curl -fsSL --proto '=https' --tlsv1.2 https://astral.sh/uv/install.sh -o "$install_script"
+    sh "$install_script"
 
     # Add to PATH for current session
-    export PATH="$HOME/.cargo/bin:$PATH"
+    export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"
 
     # Add to shell profile if not already there
-    if ! grep -q "/.cargo/bin" ~/.bashrc; then
-        echo 'export PATH="$HOME/.cargo/bin:$PATH"' >> ~/.bashrc
-        echo "Added uv to PATH in ~/.bashrc"
-    fi
+    for rc in ~/.bashrc ~/.zshrc; do
+        [ -f "$rc" ] || continue
+        if ! grep -q '/.local/bin' "$rc"; then
+            echo 'export PATH="$HOME/.local/bin:$PATH"' >> "$rc"
+            echo "Added ~/.local/bin to PATH in $rc"
+        fi
+        if ! grep -q '/.cargo/bin' "$rc"; then
+            echo 'export PATH="$HOME/.cargo/bin:$PATH"' >> "$rc"
+            echo "Added ~/.cargo/bin to PATH in $rc"
+        fi
+    done

12-18: Enforce Python 3.11+ requirement (currently only echoed, not validated).

Parsing and checking the version prevents subtle failures later.

-if command -v python3 &> /dev/null; then
-    PYTHON_VERSION=$(python3 --version)
-    echo "✓ Python found: $PYTHON_VERSION"
-else
-    echo "✗ Python3 not found. Please install Python 3.11 or higher"
-    exit 1
-fi
+if command -v python3 >/dev/null 2>&1; then
+    PY_VERSION_STR="$(python3 --version | awk '{print $2}')"
+    MAJ="$(echo "$PY_VERSION_STR" | cut -d. -f1)"
+    MIN="$(echo "$PY_VERSION_STR" | cut -d. -f2)"
+    if [ "$MAJ" -lt 3 ] || { [ "$MAJ" -eq 3 ] && [ "$MIN" -lt 11 ]; }; then
+        echo "✗ Python 3.11+ is required. Found $PY_VERSION_STR." >&2
+        exit 1
+    fi
+    echo "✓ Python found: $PY_VERSION_STR"
+else
+    echo "✗ python3 not found. Please install Python 3.11 or higher." >&2
+    exit 1
+fi

95-98: Avoid mixing Poetry and uv installs; this can pollute the wrong environment.

This installs outside the Poetry venv unless explicitly targeted. Prefer a single tool (Poetry) for the project env.

-    # Also ensure uv can manage packages
-    echo -e "\nSetting up uv for package management..."
-    if [ -f "requirements.txt" ]; then
-        echo "Installing from requirements.txt with uv..."
-        uv pip install -r requirements.txt
-    fi
+    # Poetry manages the project venv; avoid duplicating installs with uv here.
+    # If a requirements.txt is needed for downstream tools, export it:
+    # poetry export --without-hashes -f requirements.txt -o requirements.txt
core/tools/arxiv/arxiv_metadata_config.py (2)

254-259: Default config path is wrong (core/core double-prefix).

Path(__file__).parent.parent.parent / "core/config/workflows/..." resolves to core/core/... from this file’s location. Load from core/config/... with parents[2].

Apply:

-        default_path = Path(__file__).parent.parent.parent / "core/config/workflows/arxiv_metadata_default.yaml"
+        default_path = Path(__file__).parents[2] / "config/workflows/arxiv_metadata_default.yaml"

231-238: Respect configured num_workers in workflow dict.

Hard-coding num_workers: 1 ignores the validated field above and surprises callers.

Apply:

-            "num_workers": 1,  # Single-threaded for streaming
+            "num_workers": self.num_workers,
core/tools/arxiv/README.md (1)

7-13: Remove or restore arxiv_pipeline.py and update README/usage

core/tools/arxiv/arxiv_pipeline.py is not present; either restore it or remove/update all references.

  • If removed: delete the arxiv_pipeline.py entry from core/tools/arxiv/README.md (tree + usage example) and fix example commands to the actual module/path. (README refs at core/tools/arxiv/README.md lines ~11 and ~25.)
  • If restored: prefer module invocation: python -m core.tools.arxiv.arxiv_pipeline ....
  • Update other references: setup/verify_environment.py:136, setup/setup_local.sh:229, .gitignore:406, core/monitoring/init.py:109.
setup/setup_database.sh (1)

257-271: Fix .env variable mismatch for DB name.

You write ARXIV_DB_NAME using an unset ARXIV_DB_NAME var; should use DB_NAME’s value. This breaks consumer env.

Apply this diff:

-            sed -i "s/^ARXIV_DB_NAME=.*/ARXIV_DB_NAME=$ARXIV_DB_NAME/" .env
+            sed -i "s/^ARXIV_DB_NAME=.*/ARXIV_DB_NAME=$DB_NAME/" .env
@@
-            echo "ARXIV_DB_NAME=$ARXIV_DB_NAME" >> .env
+            echo "ARXIV_DB_NAME=$DB_NAME" >> .env
core/embedders/README.md (4)

65-74: Fix constructor usage: JinaV4Embedder expects a config, not device kwarg.

Samples pass device as a named arg; the class takes a single “config” param. Provide EmbeddingConfig or dict.

Apply this diff:

-from core.embedders import JinaV4Embedder
-
-# Initialize with configuration
-embedder = JinaV4Embedder(
-    model_name="jinaai/jina-embeddings-v4",
-    device="cuda",
-    use_fp16=True,  # Memory efficient
-    trust_remote_code=True
-)
+from core.embedders import JinaV4Embedder
+from core.embedders.embedders_base import EmbeddingConfig
+
+config = EmbeddingConfig(
+    model_name="jinaai/jina-embeddings-v4",
+    device="cuda",
+    use_fp16=True,
+    batch_size=32,
+)
+embedder = JinaV4Embedder(config)

Also applies to: 68-73


75-86: Align method names: use embed_texts/embed_single instead of embed/embed_batch.

API in code exposes embed_texts and embed_single. Update examples accordingly.

Apply this diff:

-vector = embedder.embed([text])[0]  # Shape: (1024,) or (2048,)
+vector = embedder.embed_texts([text])[0]  # Shape: (2048,)
@@
-embedder = JinaV4Embedder(device="cuda")
+embedder = JinaV4Embedder(EmbeddingConfig(device="cuda", model_name="jinaai/jina-embeddings-v4"))
@@
-embeddings = embedder.embed_batch(
+embeddings = embedder.embed_texts(
     texts,
-    batch_size=32,  # GPU memory dependent
+    batch_size=32,  # GPU memory dependent
     show_progress=True
 )

Also applies to: 186-198, 203-215


417-429: Remove non-existent “dimensions” ctor arg; document how to select 1024 vs 2048 properly.

The class doesn’t accept dimensions; selecting adapters is model-specific. Update example or add a config flag wired in code.

Apply this diff to fix the sample and note adapter selection:

-embedder = JinaV4Embedder(
-    model_name="jinaai/jina-embeddings-v4",
-    dimensions=1024
-)
+embedder = JinaV4Embedder(EmbeddingConfig(model_name="jinaai/jina-embeddings-v4"))
+# Note: 1024 vs 2048 depends on the selected adapter; see embedders_jina.py for adapter config.
@@
-embedder = JinaV4Embedder(
-    model_name="jinaai/jina-embeddings-v4",
-    dimensions=2048
-)
+embedder = JinaV4Embedder(EmbeddingConfig(model_name="jinaai/jina-embeddings-v4"))

241-246: Update database example to new memory client (avoid ArangoClient import).

Examples reference a non-existent ArangoClient; switch to DatabaseFactory.get_arango_memory_service.

Apply this diff:

-from core.database import ArangoClient
+from core.database.database_factory import DatabaseFactory
@@
-db = ArangoClient()
+db = DatabaseFactory.get_arango_memory_service()

Also applies to: 543-547

core/tools/rag_utils/academic_citation_toolkit.py (3)

151-156: Use consistent key (arxiv_id) for chunks; “paper_id” likely stale.

Elsewhere you join on arxiv_id; this mismatch will silently return no chunks.

Apply this diff:

-            FOR chunk IN arxiv_abstract_chunks
-                FILTER chunk.paper_id == @paper_id
+            FOR chunk IN arxiv_abstract_chunks
+                FILTER chunk.arxiv_id == @paper_id
                 SORT chunk.chunk_index ASC
                 RETURN chunk.text

Also applies to: 143-148


113-135: Coupling to python-arango blocks memory client adoption.

Provider/Storage call arango_client.db(...).aql.execute; refactor to accept ArangoMemoryClient (execute_query) as well.

I can draft a compatibility shim that detects the client type and routes queries accordingly. Want me to push a patch?

Also applies to: 247-269


270-309: Wrap bibliography upsert in a transaction (atomicity).

Per guidelines, database writes should be all-or-nothing. Current per-doc insert/update can partially succeed.

I can provide an execute_transaction-based upsert for ArangoMemoryClient, or a python-arango Transaction for legacy paths. Proceed?

core/embedders/embedders_jina.py (1)

886-899: Ensure chunk embeddings match EMBEDDING_DIM (contract).

Mean-pooled token embeddings may not be 2048; downstream expects EMBEDDING_DIM. Provide a projection/fallback.

Apply this diff:

-            # Convert to numpy and normalize
-            chunk_embedding_np = chunk_embedding.numpy().astype(np.float32, copy=False)
+            # Convert to numpy and normalize
+            chunk_embedding_np = chunk_embedding.numpy().astype(np.float32, copy=False)
             norm = np.linalg.norm(chunk_embedding_np)
             if norm > 0:
                 chunk_embedding_np = chunk_embedding_np / norm
+
+            # Guarantee dimensionality matches EMBEDDING_DIM
+            if chunk_embedding_np.shape[0] != self.EMBEDDING_DIM:
+                # Fallback: use model's pooled embedding for the chunk text to get 2048-D
+                # (keeps contract; acceptable cost for correctness)
+                safe_text = text[int(start_char):int(end_char)] or text
+                pooled = self.embed_texts([safe_text], task="retrieval")[0]
+                chunk_embedding_np = pooled.astype(np.float32, copy=False)

Also applies to: 892-897

core/processors/text/chunking_strategies.py (2)

292-296: Semantic chunking bug: _clean_text destroys paragraph boundaries.

You collapse newlines to spaces before splitting paragraphs, so _split_paragraphs sees a single paragraph.

Fix by preserving newlines for semantic chunking:

-        text = self._clean_text(text)
+        # For semantic splitting, preserve paragraph boundaries; only strip NULs/edges
+        text = text.replace('\x00', '').strip()

389-421: Duplicate chunk_index values in forced splits.

_force_split_text starts indices at 0 every time; when extending the outer list, indices collide.

Pass a starting index or compute from the outer list:

-    def _force_split_text(self, text: str, start_char: int) -> List[TextChunk]:
+    def _force_split_text(self, text: str, start_char: int, start_index: int = 0) -> List[TextChunk]:
@@
-        for i in range(0, len(words), self.max_chunk_size):
+        for i in range(0, len(words), self.max_chunk_size):
@@
-            chunks.append(TextChunk(
+            chunks.append(TextChunk(
                 text=chunk_text,
                 start_char=start_char,
                 end_char=start_char + len(chunk_text),
-                chunk_index=len(chunks),
+                chunk_index=start_index + len(chunks),
                 metadata={
                     'strategy': 'semantic_forced_split',
                     'token_count': len(chunk_words)
                 }
             ))

And update the call site:

-                    chunks.extend(self._force_split_text(para, current_chunk_start))
+                    chunks.extend(self._force_split_text(para, current_chunk_start, start_index=len(chunks)))
core/workflows/__init__.py (1)

36-50: Potential backward-compat break: ArxivMetadataWorkflow removed from exports

If external code imports ArxivMetadataWorkflow, this is a breaking change. Provide a temporary alias to preserve API.

 __all__ = [
@@
 ]
 
 if PDFWorkflow is not None:
     __all__.append('PDFWorkflow')
@@
 if ArxivInitialIngestWorkflow is not None:
     __all__.append('ArxivInitialIngestWorkflow')
+
+# Temporary backward-compat alias (to be deprecated in a future release)
+try:
+    ArxivMetadataWorkflow = ArxivInitialIngestWorkflow  # type: ignore[assignment]
+    __all__.append('ArxivMetadataWorkflow')
+except Exception:
+    pass
core/workflows/workflow_pdf.py (1)

1-581: Restore a real SentenceTransformers throughput path or document its intentional removal

Verified: Jina Embeddings v4 is a distinct model and cannot load Sentence‑Transformers checkpoints (e.g., all‑mpnet‑base‑v2); ST models require a separate SentenceTransformers pipeline. The current code deprecates ST and forces a JinaV4 fallback, which breaks the guideline allowing SentenceTransformers for high‑throughput paths.

  • Option A — Reintroduce a true ST path: implement/retain a SentenceTransformersEmbedder (or separate service) and select it when config.embedder_type in {'sentence','sentence-transformers'}; ensure EmbedderFactory.create or the init branch explicitly chooses the ST embedder. (core/workflows/workflow_pdf.py::init — embedder selection block)
  • Option B — Intentionally remove ST: update guidelines/docs and remove/adjust ST-related config, warnings, and misleading defaulting logic in the same embedder selection block.
🧹 Nitpick comments (46)
setup/setup_python_env.sh (5)

65-78: Make project-root resolution portable and bash‑only explicit.

readlink -f is absent on macOS; relying on BASH_SOURCE is fine—simplify and resolve symlinks with pwd -P. Also guard that the script runs under bash.

-# Get the directory where this script is located, then go to parent (project root)
-_src="${BASH_SOURCE[0]}"
-if command -v readlink >/dev/null 2>&1; then
-    resolved="$(readlink -f "${_src}" 2>/dev/null || true)"
-    if [ -n "$resolved" ]; then
-        _src="$resolved"
-    fi
-fi
-SCRIPT_DIR="$(cd "$(dirname "${_src}")" && pwd)"
-PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
-if ! cd "$PROJECT_ROOT"; then
+# Get the directory where this script is located (bash-only), then go to parent (project root)
+: "${BASH_VERSION:?This script must be run with bash}"
+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P)"
+PROJECT_ROOT="$(cd -- "$SCRIPT_DIR/.." && pwd -P)"
+if ! cd "$PROJECT_ROOT"; then
     echo "✗ Failed to enter project root: $PROJECT_ROOT" >&2
     exit 1
 fi

137-147: Don’t run “poetry shell” inside a script; source the venv instead.

poetry shell spawns an interactive subshell and is brittle when invoked from a non-interactive script.

 cat > activate_env.sh << 'EOF'
 #!/bin/bash
 # Activate HADES-Lab Python environment
 
-# Ensure uv and poetry are in PATH
-export PATH="$HOME/.cargo/bin:$HOME/.local/bin:$PATH"
-
-# Activate Poetry virtual environment
-echo "Activating Poetry virtual environment..."
-poetry shell
+# Ensure uv and poetry are in PATH
+export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH"
+
+# Activate Poetry virtual environment
+VENV_ACTIVATE="$(poetry env info -p 2>/dev/null)/bin/activate"
+if [ -f "$VENV_ACTIVATE" ]; then
+  # shellcheck disable=SC1090
+  source "$VENV_ACTIVATE"
+else
+  echo "Poetry venv not found. Creating it..."
+  poetry install
+  # shellcheck disable=SC1091
+  source "$(poetry env info -p)/bin/activate"
+fi
 EOF

108-126: Add UDS/HTTP2 proxy env hints to .env.template to match Phase 4.

Expose RW/RO socket paths managed under /run/hades to align with the new proxies.

 ARANGO_PASSWORD=your-arango-password
 ARANGO_HOST=localhost
 ARANGO_PORT=8529
+# If using HTTP/2 over Unix sockets via proxies (Phase 4), set these:
+# ARANGO_RW_SOCKET=/run/hades/arango-rw.sock
+# ARANGO_RO_SOCKET=/run/hades/arango-ro.sock

4-4: Harden shell safety flags.

Prefer -Eeuo pipefail and a safe IFS for predictable failures.

-set -e
+set -Eeuo pipefail
+IFS=$'\n\t'

172-172: Prefer “poetry run python” (or “uv run …”), not plain “python”.

Plain python may bypass the project venv.

-echo "   python script.py"
+echo "   poetry run python script.py"
pyproject.toml (2)

58-59: Dev-only gRPC deps OK; consider isolating to a separate “codegen” group.

Optional: move grpcio/grpcio-tools to [tool.poetry.group.codegen.dependencies] to avoid pulling them for general dev envs that don’t regenerate stubs.

+[tool.poetry.group.codegen.dependencies]
+grpcio = "^1.75.0"
+grpcio-tools = "^1.75.0"

67-84: Permissive mypy baseline is fine; gate new Arango code stricter.

Recommend raising the bar for new/changed modules (e.g., core/database/arango/**) while keeping global permissive defaults.

+[[tool.mypy.overrides]]
+module = "core.database.arango.*"
+disallow_untyped_defs = true
+check_untyped_defs = true
+warn_return_any = true
core/tools/arxiv/README.md (1)

63-63: Broken config filename in README.

The repo contains configs/metadata_pipeline.yaml; README points to configs/acid_pipeline_phased.yaml. Align to avoid 404s.

Apply:

-See `configs/acid_pipeline_phased.yaml` for pipeline configuration options.
+See `configs/metadata_pipeline.yaml` for pipeline configuration options.
.coderabbit.yaml (1)

55-55: Excluding tests from review hides regressions.

Consider narrowing instead of blanket exclusion (e.g., heavy data/long-running tests only) so test diffs still get lint/eyes.

Apply:

-    - "!**/tests/**"             # Exclude tests from review scope
+    # - "!**/tests/**"           # Consider keeping tests in scope; exclude only heavy suites if needed
setup/verify_storage.py (2)

41-44: Remove pointless f-strings (Ruff F541).

These prints have no interpolations; drop the leading f for clarity and to satisfy linters.

Apply this diff:

-    print(f"\nCollection Counts:")
+    print("\nCollection Counts:")
-    print(f"\nConsistency Check:")
+    print("\nConsistency Check:")
-    print(f"\nSample Recent Record:")
+    print("\nSample Recent Record:")
-    print(f"\nRecent Processing Activity:")
+    print("\nRecent Processing Activity:")

Also applies to: 47-48, 55-56, 107-108


103-105: Narrow exception handling and import the service error.

Catching bare Exception hides actionable failures. Prefer the client’s error type and log details.

Apply this diff:

+from core.database.arango import MemoryServiceError
@@
-    except Exception as e:
-        print(f"  Error checking recent record: {e}")
+    except MemoryServiceError as e:
+        print(f"  Error checking recent record: {e}", file=sys.stderr)
+    except Exception as e:
+        print(f"  Unexpected error checking recent record: {e}", file=sys.stderr)
@@
-    except Exception as e:
-        print(f"  Could not check recent activity: {e}")
+    except MemoryServiceError as e:
+        print(f"  Could not check recent activity: {e}", file=sys.stderr)
+    except Exception as e:
+        print(f"  Unexpected error checking recent activity: {e}", file=sys.stderr)

Also applies to: 121-123, 15-16

docs/benchmarks/arango_phase4_summary.md (2)

15-24: Surround tables with blank lines (MD058).

Add a blank line before and after each table block.

Apply this diff (illustrative for first table; repeat for others):

-## GET Latency (ms)
-| Scenario | Concurrency | TTFB p95 | TTFB p99 | E2E p95 | E2E p99 |
+## GET Latency (ms)
+
+| Scenario | Concurrency | TTFB p95 | TTFB p99 | E2E p95 | E2E p99 |
@@
-| Cold concurrency 128 (`get_cold_c128.json`) | 128 | 32.349 | 32.694 | 61.206 | 62.886 |
+| Cold concurrency 128 (`get_cold_c128.json`) | 128 | 32.349 | 32.694 | 61.206 | 62.886 |
+

Also applies to: 25-31, 32-43


3-14: Explicitly state “Time stance” and α in Scorecard text.

Add a “Time stance: Efficiency” row and α value per guidelines to the scorecard body (not only notes).

Example:

 | **Ctx** | 0.90 | L=I=A=G=0.90 via observability + isolation controls. |
-| **Conveyance** | **0.70** | α = 1.7 ⇒ (W·R·H/T)·Ctx^α. |
+| **α** | 1.7 | Exponent applied only to Ctx. |
+| **Time stance** | Efficiency | T used as divisor (lower is better). |
+| **Conveyance** | **0.70** | (W·R·H/T)·Ctx^α. |
setup/setup_database.sh (2)

173-180: Remove unnecessary python-arango install in memory-client path.

The setup flow now uses the HTTP/2 memory client; installing python-arango is no longer required here.

Apply this diff:

-print_info "Checking Python environment..."
-if ! python3 -c "import arango" 2>/dev/null; then
-    print_warning "python-arango not installed"
-    print_info "Installing with pip..."
-    pip install python-arango
-fi
+# Memory client path does not require python-arango

216-219: Guard jq dependency or avoid it.

COLLECTIONS_JSON uses jq; add a check or pure bash/printf fallback to avoid runtime failures.

Would you like me to provide a jq-less fallback for COLLECTIONS_JSON generation?

core/database/arango/optimized_client.py (1)

150-158: Optional: proactively delete cursor when done.

Arango auto-closes on exhaustion, but issuing DELETE reduces server bookkeeping under churn.

Example patch (after loop completes and cursor_id remains):

-        while data.get("hasMore") and cursor_id:
+        while data.get("hasMore") and cursor_id:
             follow_path = f"/_db/{self._config.database}/_api/cursor/{cursor_id}"
             data = self._handle_response(self._client.put(follow_path))
             results.extend(data.get("result", []))
             cursor_id = data.get("id")
 
+        if cursor_id:
+            try:
+                self._client.delete(f"/_db/{self._config.database}/_api/cursor/{cursor_id}")
+            except Exception:
+                pass
core/tools/rag_utils/academic_citation_toolkit.py (1)

768-773: Avoid hard-coded host; use env or config.

Leaking 192.168.* in repo is brittle. Read ARANGO_BASE_URL or sockets from env.

Apply this diff:

-    client = ArangoClient(hosts='http://192.168.1.69:8529')
+    client = ArangoClient(hosts=os.getenv('ARANGO_BASE_URL', 'http://localhost:8529'))
core/embedders/embedders_jina.py (1)

693-701: Clarify truncation behavior and provide a long-doc path.

You warn of truncation; recommend process_long_document() explicitly when estimated_tokens > MAX_TOKENS.

Example:

-            logger.warning(
+            logger.warning(
                 f"Document will be truncated from ~{estimated_tokens} to {self.MAX_TOKENS} tokens. "
                 f"Consider using process_long_document() for documents > 32k tokens."
             )
+            # Early exit suggestion:
+            # return self.process_long_document(text, task)
core/tools/arxiv/arxiv_api_client.py (3)

420-431: Use logging.exception and avoid blind except.

Preserve tracebacks for diagnosis and restrict catches to RequestException/IOError here.

Apply this diff:

-        except Exception as e:
-            logger.error(f"Failed to download PDF for {arxiv_id}: {e}")
+        except (requests.exceptions.RequestException, IOError) as e:
+            logger.exception("Failed to download PDF for %s", arxiv_id)

464-472: Same for LaTeX path: structured exceptions + traceback.

Mirror the PDF branch for consistency.

Apply this diff:

-            except Exception as e:
-                logger.warning(f"Failed to download LaTeX for {arxiv_id}: {e}")
+            except (requests.exceptions.RequestException, IOError) as e:
+                logger.exception("Failed to download LaTeX for %s", arxiv_id)

84-116: Optional: leverage requests Retry instead of custom loop.

Mount an HTTPAdapter with Retry(backoff_factor, status_forcelist) to simplify _make_request and centralize backoff.

I can provide a minimal Retry config (429/5xx with exponential backoff) if you want to drop the manual retry loop.

Also applies to: 131-166

core/logging/__init__.py (1)

3-11: Remove unused noqa and export cleanup.

  • The # noqa: F401 directives are flagged as unused (RUF100). They’re unnecessary if F401 isn’t enabled, and re-exports via __all__ are acceptable.
  • Sort __all__ to satisfy RUF022.

Apply:

-from .logging import LogManager  # noqa: F401
-from .conveyance import (  # noqa: F401
+from .logging import LogManager
+from .conveyance import (
     ConveyanceContext,
     TIME_UNITS,
     build_record,
     compute_conveyance,
     load_metric,
     log_conveyance,
 )
 
-__all__ = [
-    "LogManager",
-    "ConveyanceContext",
-    "TIME_UNITS",
-    "build_record",
-    "compute_conveyance",
-    "load_metric",
-    "log_conveyance",
-]
+__all__ = [
+    "ConveyanceContext",
+    "LogManager",
+    "TIME_UNITS",
+    "build_record",
+    "compute_conveyance",
+    "load_metric",
+    "log_conveyance",
+]
README.md (1)

61-69: Specify language for fenced code block.

Add a language to satisfy MD040 and improve highlighting.

-```
+```bash
 poetry run python tests/benchmarks/arango_connection_test.py \
     --socket /run/hades/readonly/arangod.sock \
     --database arxiv_repository \
     --collection arxiv_metadata \
     --key 0704_0001 --key 0704_0002 \
     --iterations 20 --concurrency 4 \
     --report-json reports/get_hot.json

</blockquote></details>
<details>
<summary>CLAUDE.md (1)</summary><blockquote>

`127-129`: **Align embedder guidance with current standardization.**

PR claims standardizing on Jina v4. If SentenceTransformers is deprecated, update:

```diff
-* **High throughput (48+ papers/sec)**: Use `SentenceTransformersEmbedder`
-* **Sophisticated processing**: Use `JinaV4Embedder` (transformers)
-* **Both MUST use late chunking**: This is non-negotiable
+* **Primary**: `JinaV4Embedder` (32k context, 2048 dims). Late chunking is mandatory.
+* Note: SentenceTransformers path is deprecated; remove or gate behind docs for legacy runs.
core/processors/text/chunking_strategies.py (2)

20-20: Remove unused NumPy import.

numpy as np is unused; drop it to reduce deps and lint noise.

-import numpy as np

188-191: Reduce log verbosity in hot paths.

These INFO logs trigger per-document. Drop to DEBUG.

-            logger.info("Created 0 token-based chunks from 0 tokens")
+            logger.debug("Created 0 token-based chunks from 0 tokens")
@@
-        logger.info(f"Created {len(chunks)} token-based chunks from {len(tokens)} tokens")
+        logger.debug("Created %d token-based chunks from %d tokens", len(chunks), len(tokens))
@@
-        logger.info(f"Created {len(chunks)} sliding window chunks with {self.step_size}/{self.window_size} step/window")
+        logger.debug("Created %d sliding-window chunks (step/window=%d/%d)", len(chunks), self.step_size, self.window_size)

Also applies to: 241-242, 574-575

core/database/arango/php_unix_bridge.php (2)

224-229: Return available operations based on a single source of truth.

Optional: define $ops = [...] once and reuse for routing and help to avoid drift.


1-248: Consider archiving this bridge.

The repo now routes via ArangoMemoryClient + proxies. Move this script to Acheron or gate behind docs-only usage to avoid divergence.

core/database/database_factory.py (2)

33-49: Default database mismatch (“academy_store” vs project default).

Elsewhere defaults use arxiv_repository. Align to prevent accidental misconfig on callers that rely on defaults.

-        database: str = "academy_store",
+        database: str = "arxiv_repository",

141-143: Log full traceback on DB connection errors.

Use logger.exception for PostgreSQL failures.

-            logger.error(f"Failed to connect to PostgreSQL: {e}")
+            logger.exception("Failed to connect to PostgreSQL")
core/monitoring/__init__.py (1)

282-318: Validate alpha and docstring language to match framework.

  • Enforce 1.5 ≤ alpha ≤ 2.0 (match logging.conveyance).
  • Replace the literal “α” to satisfy linters and improve portability.
-def calculate_conveyance(where: float, what: float, who: float, time: float,
-                        context: float, alpha: float = 1.7) -> float:
+def calculate_conveyance(where: float, what: float, who: float, time: float,
+                        context: float, alpha: float = 1.7) -> float:
@@
-    # Efficiency view: C = (W·R·H/T)·Ctx^α
+    # Efficiency view: C = (W·R·H/T)·Ctx^alpha
+    if not (1.5 <= float(alpha) <= 2.0):
+        raise ValueError("alpha must be between 1.5 and 2.0 (inclusive)")
docs/CONVEYANCE_FRAMEWORK.md (1)

81-89: Add a “Conveyance Scorecard” output block

The coding guidelines require an explicit scorecard (time stance, W/R/H/T, L/I/A/G in [0,1], α, zero‑propagation check, evidence links). Suggest adding an example template after this section.

  For each run/condition:
@@
   * Protocol: model/decoding params, retrieval policy, steps/halting, dataset split.
 
 **All analyses, critiques, and designs must conform to this framework and explicitly state which view (efficiency vs capability) is used and why.**
+
+### Conveyance Scorecard (template)
+- Time stance: Efficiency|Capability
+- W/R/H/T: W=…, R=…, H=…, T=… (units)
+- Ctx: L=…, I=…, A=…, G=… (weights wL/wI/wA/wG; Ctx=…)
+- α: 1.7 (default)  [applied to Ctx only]
+- Zero‑propagation: passed|failed (reason if failed)
+- Evidence: links to logs, configs, datasets, metrics dashboards
core/workflows/__init__.py (1)

19-31: Avoid type: ignore on import; prefer TYPE_CHECKING or importlib

Using “type: ignore[assignment]” on import is noisy. Consider this pattern instead.

-from .workflow_pdf import PDFWorkflow  # type: ignore[assignment]
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from .workflow_pdf import PDFWorkflow as _PDFWorkflow
+try:
+    from .workflow_pdf import PDFWorkflow
+except ImportError:
+    PDFWorkflow = None  # type: ignore[assignment]

(Apply similarly for BatchPDFWorkflow and ArxivInitialIngestWorkflow.)

setup/setup_local.sh (2)

58-85: PHP/composer check likely obsolete; gate behind a switch

This PR standardizes on the HTTP/2 memory client; the PHP bridge check adds noise for most setups. Gate it behind NEED_PHP_BRIDGE or remove.

-# Check PHP version (required for ArangoDB bridge)
-print_info "Checking PHP version..."
-if command -v php &> /dev/null; then
+if [[ "${NEED_PHP_BRIDGE:-0}" -eq 1 ]]; then
+  print_info "Checking PHP version..."
+  if command -v php &> /dev/null; then
     ...
-else
-    print_warning "PHP not installed"
-    echo "  PHP is required for ArangoDB Unix socket connections"
-    echo "  Install with: apt install php8.3-cli php8.3-curl php8.3-mbstring php8.3-zip"
-    echo "  See setup/php_arango_setup.md for full instructions"
-fi
+  else
+    print_warning "PHP not installed"
+    echo "  See setup/php_arango_setup.md for full instructions"
+  fi
+fi

193-205: Template/default mismatch: ARANGO_HOST

Runtime default is 127.0.0.1 but template sets 192.168.1.69. Align the template.

-ARANGO_HOST=192.168.1.69
+ARANGO_HOST=127.0.0.1
core/database/utils/check_resume_state.py (4)

22-34: Over‑broad exception catch

Catching Exception masks actionable errors. Narrow to specific exceptions and re‑raise unknowns.

-    except Exception as exc:  # pragma: no cover - diagnostics
-        print(f"[warn] Unexpected error counting '{collection}': {exc}")
-        return 0
+    except (TypeError, ValueError) as exc:  # pragma: no cover - diagnostics
+        print(f"[warn] Unexpected error counting '{collection}': {exc}")
+        return 0

36-67: Assumes processing_timestamp exists; add fallbacks

Many pipelines don’t set doc.processing_timestamp. Fallback to embedding.created_at or update_date.

-                SORT doc.processing_timestamp DESC NULLS LAST
+                LET ts = FIRST(
+                    FOR e IN arxiv_abstract_embeddings
+                        FILTER e.arxiv_id == doc.arxiv_id
+                        SORT e.created_at DESC NULLS LAST
+                        LIMIT 1
+                        RETURN e.created_at
+                )
+                SORT ts DESC NULLS LAST
@@
-                    processing_timestamp: doc.processing_timestamp,
+                    processing_timestamp: ts ?? doc.processing_timestamp ?? doc.update_date,

85-106: Recent activity filter depends on missing timestamp

Use embedding created_at as the activity clock if processing_timestamp is absent.

-            FOR doc IN arxiv_metadata
-                FILTER DATE_TIMESTAMP(doc.processing_timestamp) >= @ts
+            FOR e IN arxiv_abstract_embeddings
+                FILTER DATE_TIMESTAMP(e.created_at) >= @ts
                 COLLECT WITH COUNT INTO count
                 RETURN count

118-134: Backlog query can be heavy with NOT IN on large arrays

Prefer anti‑join pattern used elsewhere to leverage indexes.

-            LET processed = (
-                FOR emb IN arxiv_abstract_embeddings
-                    RETURN DISTINCT emb.arxiv_id
-            )
-                FOR doc IN arxiv_metadata
-                    FILTER doc.abstract != null
-                    FILTER doc.abstract_length >= @min_len
-                    FILTER doc.arxiv_id NOT IN processed
+            FOR doc IN arxiv_metadata
+                FILTER doc.abstract != null
+                FILTER doc.abstract_length >= @min_len
+                LET has_emb = FIRST(
+                    FOR emb IN arxiv_abstract_embeddings
+                        FILTER emb.arxiv_id == doc.arxiv_id
+                        LIMIT 1
+                        RETURN 1
+                )
+                FILTER has_emb == null
                 SORT doc.abstract_length ASC
                 LIMIT 10
                 RETURN {id: doc.arxiv_id, length: doc.abstract_length}
core/workflows/workflow_pdf.py (2)

516-554: Batch embeddings for throughput and fix zero‑vector dim

Embedding per chunk is slow; also, fallback zero‑vector should match the actual embedder dimension.

-embedded_chunks = []
-
-for i, chunk in enumerate(chunks):
-    embeddings = self.embedder.embed_texts([chunk['text']], batch_size=1)
-    embedding = embeddings[0] if len(embeddings) > 0 else None
-    ...
-    embedded_chunk = ChunkWithEmbedding(
-        ...
-    )
-    embedded_chunks.append(embedded_chunk)
-
-return embedded_chunks
+texts = [c['text'] for c in chunks]
+embs = self.embedder.embed_texts(texts, batch_size=max(1, self.config.batch_size))
+# Determine output dimension once
+out_dim = None
+for e in embs:
+    if e is not None:
+        _arr = np.asarray(e, dtype=np.float32).reshape(-1)
+        out_dim = _arr.shape[0]
+        break
+if out_dim is None:
+    out_dim = getattr(self.embedder, "output_dimension", self.config.embedding_dim)
+
+embedded_chunks: List[ChunkWithEmbedding] = []
+for i, (chunk, e) in enumerate(zip(chunks, embs)):
+    if e is not None:
+        arr = np.asarray(e, dtype=np.float32).reshape(-1)
+    else:
+        arr = np.zeros(out_dim, dtype=np.float32)
+    embedded_chunks.append(ChunkWithEmbedding(
+        text=chunk['text'],
+        embedding=arr,
+        start_char=chunk.get('start_char', 0),
+        end_char=chunk.get('end_char', len(chunk['text'])),
+        start_token=chunk.get('start_token', 0),
+        end_token=chunk.get('end_token', 0),
+        chunk_index=i,
+        total_chunks=len(chunks),
+        context_window_used=len(chunk['text'].split())
+    ))
+return embedded_chunks

1-581: Add tests for late‑chunking “no tail loss” and embedding dim consistency

I can draft unit tests to assert last chunk coverage and consistent dims across batches.

core/workflows/workflow_pdf_batch.py (2)

256-256: Consider catching specific exceptions instead of broad Exception.

The static analyzer correctly flags this. Consider catching specific exceptions like TimeoutError, ProcessorError, or other expected failure modes to avoid masking unexpected errors.

-                    except Exception as e:
+                    except (TimeoutError, RuntimeError) as e:

389-458: Consider validating embedding configuration values.

The embedding worker initialization has good fallback logic, but consider adding validation for the numeric parameters (batch_size, chunk_size_tokens, chunk_overlap_tokens) to ensure they're within reasonable ranges.

     embed_config = EmbeddingConfig(
         model_name=model_name,
         device=embed_settings.get('device', 'cuda'),
-        batch_size=embed_settings.get('batch_size', 128),
+        batch_size=min(max(embed_settings.get('batch_size', 128), 1), 512),
         use_fp16=embed_settings.get('use_fp16', True),
-        chunk_size_tokens=embed_settings.get('chunk_size_tokens', 1000),
-        chunk_overlap_tokens=embed_settings.get('chunk_overlap_tokens', 200),
+        chunk_size_tokens=min(max(embed_settings.get('chunk_size_tokens', 1000), 100), 8192),
+        chunk_overlap_tokens=min(max(embed_settings.get('chunk_overlap_tokens', 200), 0), 1000),
     )
core/logging/conveyance.py (1)

248-255: Sort the __all__ list for consistency.

The static analyzer suggests sorting the exports alphabetically.

 __all__ = [
+    "build_record",
+    "compute_conveyance",
     "ConveyanceContext",
-    "TIME_UNITS",
-    "build_record",
-    "compute_conveyance",
+    "load_metric",
+    "log_conveyance",
-    "load_metric",
-    "log_conveyance",
+    "TIME_UNITS",
 ]
core/workflows/workflow_arxiv_initial_ingest.py (1)

442-442: Consider catching specific database exceptions.

Replace broad Exception catch with specific database-related exceptions for better error handling.

-                        except Exception as e:
+                        except (ArangoHttpError, MemoryServiceError) as e:

Also applies to: 456-456

core/database/arango/memory_client.py (1)

143-143: Consider using a custom exception class.

Instead of raising a generic ValueError with a long message, consider creating a specific exception class for configuration errors.

class ArangoConfigError(ValueError):
    """Raised when ArangoDB configuration is invalid or incomplete."""
    pass

# Then use it as:
raise ArangoConfigError("ArangoDB password required (set ARANGO_PASSWORD env var)")

Comment on lines +103 to +116
- path: "core/workflows/**/*.py"
instructions: |
Embedding strategy expectations:
- Prefer JinaV4Embedder with late chunking for high-fidelity workflows.
- When SentenceTransformersEmbedder is selected for throughput, document the trade-off explicitly.
- Confirm late chunking paths never discard tail tokens.
- path: "core/embedders/**/*.py"
instructions: |
Embedders must honor the standard EmbeddingConfig interface:
- Accept EmbeddingConfig or equivalent dict input.
- Ensure fallback embeddings match EMBEDDING_DIM.
- Preserve late chunking semantics (encode once, pool later).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Embedding path instructions LGTM; ensure code conforms.

Guidance matches learnings (late chunking, JinaV4 vs ST trade-offs). Verify core/workflows actually document trade-offs and never drop tail tokens.


🏁 Script executed:

#!/bin/bash
rg -n -g 'core/workflows/**/*.py' -C2 -P '(JinaV4Embedder|SentenceTransformersEmbedder|late[_\s-]?chunk)'

Length of output: 7257


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "1) Search workflows for trade-off keywords..."
rg -n -S -g 'core/workflows/**/*.py' -C3 -P '\b(throughput|fidelity|trade|trade-off|tradeoffs|trade-offs)\b' || true
echo
echo "2) Search workflows for token/tail/truncate/drop/discard..."
rg -n -S -g 'core/workflows/**/*.py' -C3 -P '\b(tail|token(s)?|truncate|truncat|discard|drop)\b' || true
echo
echo "3) Search embedders for embed_with_late_chunking and embedder classes/keywords..."
rg -n -S -g 'core/embedders/**/*.py' -C5 -P '(class\s+\w*Embedder|def\s+embed_with_late_chunking|embed_with_late_chunking|late[_\s-]?chunk|tokenizer|stride|truncate|truncat|tail|token(s)?|EMBEDDING_DIM|EmbeddingConfig)' || true
echo
echo "4) Files across core/ that reference embed_with_late_chunking..."
rg -n -S -g 'core/**/*.py' -C5 -P 'embed_with_late_chunking' || true
echo
echo "5) Print file heads and occurrences for files that implement embed_with_late_chunking..."
files=$(rg -n -S -g 'core/**/*.py' -P 'embed_with_late_chunking' --hidden --no-line-number --files || true)
if [ -n "$files" ]; then
  for f in $files; do
    echo
    echo "---- $f ----"
    rg -n -C5 -P 'embed_with_late_chunking' "$f" || true
    echo "----- file head (first 240 lines) -----"
    sed -n '1,240p' "$f" || true
  done
else
  echo "No files implementing embed_with_late_chunking found."
fi

Length of output: 75273


Embedding path instructions LGTM — critical fixes required

  • Workflows warn about SentenceTransformers deprecation but do not explicitly state the throughput vs. fidelity trade-off; add explicit trade-off messaging where embedder_type is handled (core/workflows/workflow_pdf.py:234-239; core/workflows/workflow_pdf_batch.py:444-448; core/workflows/workflow_arxiv_initial_ingest.py:91-95).
  • Late-chunking can discard tail tokens: core/embedders/embedders_jina.py::encode_full_document truncates to MAX_TOKENS (32,768) via tokenizer truncation, and workflows call embed_with_late_chunking directly (core/workflows/workflow_pdf.py:449-451; core/workflows/workflow_pdf_batch.py:506-507). Require change: ensure embed_with_late_chunking automatically falls back to process_long_document for texts > MAX_TOKENS, or update workflows to detect long documents and call process_long_document.
🤖 Prompt for AI Agents
In .coderabbit.yaml around lines 103 to 116, update the embedder/workflow
instructions so they mandate two fixes: (1) In core/workflows/workflow_pdf.py
(lines ~234-239), core/workflows/workflow_pdf_batch.py (lines ~444-448), and
core/workflows/workflow_arxiv_initial_ingest.py (lines ~91-95) add explicit
messaging when embedder_type == SentenceTransformersEmbedder that documents the
throughput vs. fidelity trade-off (high throughput but lower embedding fidelity
and potential deprecation) where the embedder selection is handled; (2) In
core/embedders/embedders_jina.py ensure encode_full_document and
embed_with_late_chunking do not lose tail tokens by implementing an automatic
fallback: if tokenized length > MAX_TOKENS (32768) have embed_with_late_chunking
call the existing process_long_document flow (or perform streaming/chunking that
preserves tails) instead of relying on tokenizer truncation, or alternatively
update the workflows (core/workflows/workflow_pdf.py lines ~449-451 and
core/workflows/workflow_pdf_batch.py lines ~506-507) to detect long documents
and call process_long_document before invoking embed_with_late_chunking.

Comment on lines +25 to +38
$connectionOptions = [
// TODO: Fix Unix socket permissions, for now use TCP
// ConnectionOptions::OPTION_ENDPOINT => 'unix:///tmp/arangodb.sock',
ConnectionOptions::OPTION_ENDPOINT => 'tcp://localhost:8529',
ConnectionOptions::OPTION_DATABASE => 'arxiv_repository',
ConnectionOptions::OPTION_AUTH_TYPE => 'Basic',
ConnectionOptions::OPTION_AUTH_USER => 'root',
ConnectionOptions::OPTION_AUTH_PASSWD => getenv('ARANGO_PASSWORD') ?: '',
ConnectionOptions::OPTION_CONNECTION => 'Keep-Alive',
ConnectionOptions::OPTION_TIMEOUT => 30,
ConnectionOptions::OPTION_RECONNECT => true,
ConnectionOptions::OPTION_CREATE => false,
ConnectionOptions::OPTION_UPDATE_POLICY => UpdatePolicy::LAST,
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Unify endpoint and prefer Unix sockets via proxies.

Endpoint reporting and stats show inconsistent transports. Default to Unix sockets from env (RO/RW proxies), with TCP as explicit fallback.

-$connectionOptions = [
-    // TODO: Fix Unix socket permissions, for now use TCP
-    // ConnectionOptions::OPTION_ENDPOINT => 'unix:///tmp/arangodb.sock',
-    ConnectionOptions::OPTION_ENDPOINT => 'tcp://localhost:8529',
+// Prefer systemd-managed Unix sockets; fall back to TCP if not set
+$listenSocket = getenv('ARANGO_RO_SOCKET') ?: getenv('LISTEN_SOCKET');
+$endpoint = $listenSocket ? ("unix://".$listenSocket) : 'tcp://localhost:8529';
+$connectionOptions = [
+    ConnectionOptions::OPTION_ENDPOINT => $endpoint,
@@
-            echo json_encode([
+            echo json_encode([
                 'status' => 'success',
                 'message' => 'Connected successfully!',
                 'collections_found' => count($collections),
-                'endpoint' => 'tcp://localhost:8529'  // TODO: Switch to Unix socket
+                'endpoint' => $endpoint
             ], JSON_PRETTY_PRINT) . "\n";
@@
-            $stats = [
-                'connection' => 'unix:///tmp/arangodb.sock',
+            $stats = [
+                'connection' => $endpoint,
                 'database' => 'arxiv_repository',

Also applies to: 53-55, 194-203

🤖 Prompt for AI Agents
In core/database/arango/php_unix_bridge.php around lines 25-38 (and also update
the similar occurrences at 53-55 and 194-203), the endpoint is hardcoded to TCP
and endpoints are inconsistent; change the logic to choose the endpoint from
environment configuration preferring a Unix-socket proxy when available (e.g.
ARANGO_ENDPOINT or ARANGO_RW_PROXY/ARANGO_RO_PROXY), falling back to a TCP URL
only if no unix socket proxy env is set; implement a single helper/decision
block to build ConnectionOptions::OPTION_ENDPOINT either as 'unix:///path.sock'
(for proxies) or the explicit 'tcp://host:port' fallback, and update all other
occurrences to use this helper so transports are unified and configurable via
env.

Comment on lines +149 to +185
// Process in batches for efficiency
foreach (array_chunk($documents, 1000) as $batch) {
$batchDocs = [];
foreach ($batch as $doc) {
$document = new Document();
foreach ($doc as $key => $value) {
$document->set($key, $value);
}
$batchDocs[] = $document;
}

// ACID compliant atomic bulk insert using transaction
$trx = new \ArangoDBClient\Transaction($connection);
$trx->addWrite($collectionName);

// Build batch insert action
$docsJson = json_encode(array_map(function($doc) {
return $doc->getAll();
}, $batchDocs));

$action = "function () {
var db = require('@arangodb').db;
var docs = $docsJson;
return db.$collectionName.save(docs);
}";

$trx->setAction($action);

try {
// Execute atomic transaction - all or nothing
$result = $trx->execute();
$inserted += count($batch);
} catch (\Exception $e) {
// Transaction failed - nothing was inserted
throw new \Exception("ACID transaction failed: " . $e->getMessage());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Transaction action is unsafe/incorrect; parameterize and use _collection.

  • String interpolating $collectionName into JS is brittle and risks breakage.
  • db.$collectionName.save(docs) is not valid JS; use db._collection(name) and iterate docs.

Apply:

-            foreach (array_chunk($documents, 1000) as $batch) {
-                $batchDocs = [];
-                foreach ($batch as $doc) {
-                    $document = new Document();
-                    foreach ($doc as $key => $value) {
-                        $document->set($key, $value);
-                    }
-                    $batchDocs[] = $document;
-                }
-
-                // ACID compliant atomic bulk insert using transaction
-                $trx = new \ArangoDBClient\Transaction($connection);
-                $trx->addWrite($collectionName);
-
-                // Build batch insert action
-                $docsJson = json_encode(array_map(function($doc) {
-                    return $doc->getAll();
-                }, $batchDocs));
-
-                $action = "function () {
-                    var db = require('@arangodb').db;
-                    var docs = $docsJson;
-                    return db.$collectionName.save(docs);
-                }";
-
-                $trx->setAction($action);
+            foreach (array_chunk($documents, 1000) as $batch) {
+                // ACID compliant atomic bulk insert using transaction
+                $trx = new \ArangoDBClient\Transaction($connection);
+                $trx->setCollections([ 'write' => [$collectionName] ]);
+                $trx->setParams([
+                    'collection' => $collectionName,
+                    'docs' => array_values($batch),
+                ]);
+                $trx->setAction(<<<'JS'
+function (params) {
+  const db = require('@arangodb').db;
+  const coll = db._collection(params.collection);
+  const docs = params.docs;
+  for (const d of docs) { coll.insert(d); }
+  return docs.length;
+}
+JS
+                );

This ensures correct collection resolution and avoids JS string interpolation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Process in batches for efficiency
foreach (array_chunk($documents, 1000) as $batch) {
$batchDocs = [];
foreach ($batch as $doc) {
$document = new Document();
foreach ($doc as $key => $value) {
$document->set($key, $value);
}
$batchDocs[] = $document;
}
// ACID compliant atomic bulk insert using transaction
$trx = new \ArangoDBClient\Transaction($connection);
$trx->addWrite($collectionName);
// Build batch insert action
$docsJson = json_encode(array_map(function($doc) {
return $doc->getAll();
}, $batchDocs));
$action = "function () {
var db = require('@arangodb').db;
var docs = $docsJson;
return db.$collectionName.save(docs);
}";
$trx->setAction($action);
try {
// Execute atomic transaction - all or nothing
$result = $trx->execute();
$inserted += count($batch);
} catch (\Exception $e) {
// Transaction failed - nothing was inserted
throw new \Exception("ACID transaction failed: " . $e->getMessage());
}
}
// Process in batches for efficiency
foreach (array_chunk($documents, 1000) as $batch) {
// ACID compliant atomic bulk insert using transaction
$trx = new \ArangoDBClient\Transaction($connection);
$trx->setCollections([ 'write' => [$collectionName] ]);
$trx->setParams([
'collection' => $collectionName,
'docs' => array_values($batch),
]);
$trx->setAction(<<<'JS'
function (params) {
const db = require('@arangodb').db;
const coll = db._collection(params.collection);
const docs = params.docs;
for (const d of docs) { coll.insert(d); }
return docs.length;
}
JS
);
try {
// Execute atomic transaction - all or nothing
$result = $trx->execute();
$inserted += count($batch);
} catch (\Exception $e) {
// Transaction failed - nothing was inserted
throw new \Exception("ACID transaction failed: " . $e->getMessage());
}
}

Comment on lines +16 to +18
from core.database.database_factory import DatabaseFactory
from core.database.arango import MemoryServiceError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Import path for MemoryServiceError may be wrong

MemoryServiceError is defined in memory_client.py; unless re‑exported in core.database.arango.init, this will ImportError.

-from core.database.arango import MemoryServiceError
+from core.database.arango.memory_client import MemoryServiceError
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from core.database.database_factory import DatabaseFactory
from core.database.arango import MemoryServiceError
from core.database.database_factory import DatabaseFactory
from core.database.arango.memory_client import MemoryServiceError
🤖 Prompt for AI Agents
In core/database/utils/check_resume_state.py around lines 16 to 18, the code
imports MemoryServiceError from core.database.arango but that exception is
actually defined in memory_client.py; update the import to reference the correct
module (e.g., from core.database.memory_client import MemoryServiceError) or
alternatively add a re-export in core.database.arango.__init__ so the symbol is
available from core.database.arango; ensure tests/imports still pass after the
change.

Comment on lines +508 to +536
transaction_action = """
function (params) {
const db = require('@arangodb').db;
const papers = params.papers || [];
const chunks = params.chunks || [];
const embeddings = params.embeddings || [];
const structures = params.structures || [];
if (papers.length) {
db.arxiv_metadata.insert(papers, { overwriteMode: 'replace' });
}
if (chunks.length) {
db.arxiv_abstract_chunks.insert(chunks, { overwriteMode: 'replace' });
}
if (embeddings.length) {
db.arxiv_abstract_embeddings.insert(embeddings, { overwriteMode: 'replace' });
}
if (structures.length) {
db.arxiv_structures.insert(structures, { overwriteMode: 'replace' });
}
return {
papers: papers.length,
chunks: chunks.length,
embeddings: embeddings.length,
structures: structures.length
};
}
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Verify the ArangoDB internal reference.

The transaction uses require('@arangodb').db while other workflows use require('internal').db. Ensure this is the correct module for your ArangoDB version.


🏁 Script executed:

#!/bin/bash
# Check which require pattern is used across the codebase
rg "require\('@arangodb'\)|require\('internal'\)" --type py

Length of output: 153


Use require('@arangodb').db for server-side transactions. ArangoDB’s server-side JS (transactions/Foxx/arangosh) exposes the db object via @arangodb; transactions examples use require("@arangodb").db. (docs.arangodb.com)

  • Found inconsistent usage: core/tools/arxiv/arxiv_manager.py — uses require('@arangodb').db
  • core/workflows/workflow_pdf_batch.py — uses require('internal').db

Change the latter to require('@arangodb').db inside the transaction action to match docs and avoid runtime errors.

🤖 Prompt for AI Agents
In core/workflows/workflow_pdf_batch.py around the transaction action (lines
referenced by the review), the transaction's server-side JS uses
require('internal').db which is incorrect for transactions and can cause runtime
errors; update the transaction action to use require('@arangodb').db instead
(replace any require('internal').db occurrences within the transaction
function), and verify any other server-side transaction references in that file
follow the same pattern so the db object is sourced via require('@arangodb') as
per ArangoDB docs.

Comment on lines 486 to +625
sanitized_id = document_id.replace('.', '_').replace('/', '_')

# Acquire lock with timeout (1 minute should be sufficient for lock acquisition)
if not WORKER_DB_MANAGER.acquire_lock(document_id, timeout_minutes=1):
logger.warning(f"Could not acquire lock for {document_id} within 60 seconds")
return {
'success': False,
'error': 'Lock acquisition timeout - document may be processing elsewhere'

paper_doc = {
'_key': sanitized_id,
'document_id': document_id,
'paper_key': sanitized_id,
'metadata': doc.get('metadata', {}),
'status': 'PROCESSED',
'processing_date': datetime.now().isoformat(),
'num_chunks': len(chunks),
'has_latex': extracted.get('has_latex', False)
}

if doc.get('metadata', {}).get('repo'):
paper_doc['repository'] = doc['metadata']['repo']

if extracted.get('symbols'):
paper_doc['symbols'] = extracted['symbols']
paper_doc['symbol_hash'] = extracted.get('symbol_hash', '')
paper_doc['code_metrics'] = extracted.get('code_metrics', {})
paper_doc['code_structure'] = extracted.get('code_structure', {})
paper_doc['language'] = extracted.get('language')
paper_doc['has_tree_sitter'] = True
else:
paper_doc['has_tree_sitter'] = False

chunk_docs = []
embedding_docs = []
for i, chunk in enumerate(chunks):
chunk_key = f"{sanitized_id}_chunk_{i}"
chunk_docs.append({
'_key': chunk_key,
'document_id': document_id,
'paper_key': sanitized_id,
'text': chunk.text,
'chunk_index': i,
'start_char': chunk.start_char,
'end_char': chunk.end_char
})

embedding_record = {
'_key': f"{chunk_key}_emb",
'document_id': document_id,
'paper_key': sanitized_id,
'chunk_id': chunk_key,
'vector': chunk.embedding.tolist(),
'model': 'jinaai/jina-embeddings-v4'
}

try:
# Begin transaction
collections = WORKER_DB_MANAGER.collections
write_collections = [
collections['papers'],
collections['chunks'],
collections['embeddings']
]

txn_db = WORKER_DB_MANAGER.begin_transaction(
write_collections=write_collections,
lock_timeout=5
)

try:
# Store paper metadata with Tree-sitter symbols if available
paper_doc = {
'_key': sanitized_id,
'document_id': document_id,
'metadata': doc.get('metadata', {}),
'status': 'PROCESSED',
'processing_date': datetime.now().isoformat(),
'num_chunks': len(chunks),
'has_latex': extracted.get('has_latex', False)

if extracted.get('symbols'):
embedding_record['has_symbols'] = True
embedding_record['language'] = extracted.get('language')
embedding_record['symbol_context'] = {
'total_functions': len(extracted['symbols'].get('functions', [])),
'total_classes': len(extracted['symbols'].get('classes', [])),
'total_imports': len(extracted['symbols'].get('imports', [])),
'complexity': extracted.get('code_metrics', {}).get('complexity', 0)
}

# Add repository field if this is GitHub data
if doc.get('metadata', {}).get('repo'):
paper_doc['repository'] = doc['metadata']['repo']

# Add Tree-sitter symbol data if available
if extracted.get('symbols'):
paper_doc['symbols'] = extracted['symbols']
paper_doc['symbol_hash'] = extracted.get('symbol_hash', '')
paper_doc['code_metrics'] = extracted.get('code_metrics', {})
paper_doc['code_structure'] = extracted.get('code_structure', {})
paper_doc['language'] = extracted.get('language')
paper_doc['has_tree_sitter'] = True
else:
paper_doc['has_tree_sitter'] = False

txn_db.collection(collections['papers']).insert(paper_doc, overwrite=True)

# Store chunks and embeddings
for i, chunk in enumerate(chunks):
chunk_key = f"{sanitized_id}_chunk_{i}"

# Store chunk
txn_db.collection(collections['chunks']).insert({
'_key': chunk_key,
'document_id': sanitized_id,
'text': chunk.text,
'chunk_index': i,
'start_char': chunk.start_char,
'end_char': chunk.end_char
}, overwrite=True)

# Store embedding with symbol metadata for code files
embedding_doc = {
'_key': f"{chunk_key}_emb",
'document_id': sanitized_id,
'chunk_id': chunk_key,
'vector': chunk.embedding.tolist(),
'model': 'jina-v4'
}

# Add symbol metadata if this is code with Tree-sitter data
if extracted.get('symbols'):
embedding_doc['has_symbols'] = True
embedding_doc['language'] = extracted.get('language')
# Store a summary of symbols for this chunk's context
# This helps the Jina v4 coding LoRA understand the code context
embedding_doc['symbol_context'] = {
'total_functions': len(extracted['symbols'].get('functions', [])),
'total_classes': len(extracted['symbols'].get('classes', [])),
'total_imports': len(extracted['symbols'].get('imports', [])),
'complexity': extracted.get('code_metrics', {}).get('complexity', 0)
}

txn_db.collection(collections['embeddings']).insert(embedding_doc, overwrite=True)

# Commit
txn_db.commit_transaction()

except Exception as e:
txn_db.abort_transaction()
raise

finally:
WORKER_DB_MANAGER.release_lock(document_id)


embedding_docs.append(embedding_record)

structures_doc = None
if extracted.get('tables') or extracted.get('equations') or extracted.get('images'):
structures_doc = {
'_key': sanitized_id,
'document_id': document_id,
'paper_key': sanitized_id,
'tables': extracted.get('tables', []),
'equations': extracted.get('equations', []),
'images': extracted.get('images', []),
'figures': extracted.get('figures', [])
}

action = """
function (params) {
const db = require('internal').db;
const opts = { overwriteMode: "replace" };
const cols = params.collections;
db._collection(cols.papers).insert(params.paper, opts);
params.chunks.forEach(function (chunk) {
db._collection(cols.chunks).insert(chunk, opts);
});
params.embeddings.forEach(function (emb) {
db._collection(cols.embeddings).insert(emb, opts);
});
if (params.structures) {
db._collection(cols.structures).insert(params.structures, opts);
}
}
"""

write_collections = [
WORKER_COLLECTIONS['papers'],
WORKER_COLLECTIONS['chunks'],
WORKER_COLLECTIONS['embeddings'],
]
if structures_doc:
write_collections.append(WORKER_COLLECTIONS['structures'])

transaction_request = {
'collections': {'write': write_collections},
'params': {
'paper': paper_doc,
'chunks': chunk_docs,
'embeddings': embedding_docs,
'structures': structures_doc,
'collections': WORKER_COLLECTIONS,
},
'action': action,
}

WORKER_DB_CLIENT._write_client.request(
'POST',
f"/_db/{WORKER_DB_CLIENT._config.database}/_api/transaction",
json=transaction_request,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Verify transactional consistency in the new memory-based storage flow.

The transition to JavaScript-based transactional writes looks correct, but ensure the transaction properly handles failures mid-operation. Consider adding a comment explaining why overwriteMode: "replace" is used.


🏁 Script executed:

#!/bin/bash
# Check for other usages of similar transaction patterns to ensure consistency
rg -A 10 "execute_transaction|_write_client\.request.*transaction" --type py

Length of output: 1648


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== files with execute_transaction ==="
rg -n "execute_transaction" -S || true

echo
echo "=== core/database/arango/memory_client.py (first 400 lines) ==="
sed -n '1,400p' core/database/arango/memory_client.py || true

echo
echo "=== core/tools/arxiv/arxiv_manager.py (first 240 lines) ==="
sed -n '1,240p' core/tools/arxiv/arxiv_manager.py || true

echo
echo "=== occurrences of _write_client.request ==="
rg -n "_write_client\.request" -S || true

echo
echo "=== occurrences of WORKER_DB_CLIENT and WORKER_COLLECTIONS ==="
rg -n "WORKER_DB_CLIENT" -S || true
rg -n "WORKER_COLLECTIONS" -S || true

Length of output: 26111


Use the execute_transaction wrapper and document overwriteMode='replace'

  • Replace the direct call to WORKER_DB_CLIENT._write_client.request(...) in core/workflows/workflow_pdf_batch.py (around line 620, within lines ~509–625) with WORKER_DB_CLIENT.execute_transaction(...) so the write benefits from execute_transaction's payload shaping, waitForSync support, and consistent error-wrapping (see core/database/arango/memory_client.py — def execute_transaction at ~line 280).
  • Add a one-line inline comment beside opts = { overwriteMode: "replace" } explaining the intended semantics (why existing documents are replaced / idempotent write behavior) and confirm that this choice and the transaction atomicity/failure semantics meet expectations.
🤖 Prompt for AI Agents
In core/workflows/workflow_pdf_batch.py around lines 509 to 625, replace the
direct low-level POST to WORKER_DB_CLIENT._write_client.request(...) with a call
to WORKER_DB_CLIENT.execute_transaction(...) passing the constructed
transaction_request so the transaction payload is shaped, waitForSync honored
and errors wrapped consistently; also add a one-line inline comment next to opts
= { overwriteMode: "replace" } stating that overwriteMode: "replace" is used to
make the write idempotent by replacing existing documents (ensures
deterministic/replace-on-retry semantics) and confirm that atomic transaction
semantics are acceptable for failure behavior.

Comment on lines 213 to 245
embedder_type = (self.config.embedder_type or 'jina').lower()
model_name = self.config.embedding_model or 'jinaai/jina-embeddings-v4'
if model_name.lower() in {'jina-v4', 'jinaai/jina-v4'}:
model_name = 'jinaai/jina-embeddings-v4'

if embedder_type in {'sentence', 'sentence-transformers'} and 'sentence-transformers' not in model_name:
logger.warning(
"SentenceTransformers embedder requested without an explicit model name; "
"defaulting to sentence-transformers/all-mpnet-base-v2."
)
model_name = 'sentence-transformers/all-mpnet-base-v2'

embed_config = EmbeddingConfig(
model_name=model_name,
device=self.config.device or 'cuda',
batch_size=self.config.batch_size,
use_fp16=self.config.use_fp16,
chunk_size_tokens=self.config.chunk_size_tokens,
chunk_overlap_tokens=self.config.chunk_overlap_tokens
chunk_overlap_tokens=self.config.chunk_overlap_tokens,
)

if embedder_type in {'sentence', 'sentence-transformers'}:
logger.warning(
"SentenceTransformersEmbedder is deprecated and will be removed after migration; "
"routing to JinaV4Embedder fallback."
)
elif embedder_type not in {'jina', 'transformer'}:
logger.warning("Unknown embedder_type '%s'; defaulting to JinaV4Embedder", embedder_type)

self.embedder = EmbedderFactory.create(model_name=model_name, config=embed_config)
self.embedding_model = model_name
logger.info("Initialized embedder '%s' with model '%s'", embedder_type, model_name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Model/driver mismatch when embedder_type='sentence'

You set model_name to a Sentence‑Transformers model, but the factory redirects “sentence” to JinaV4Embedder, which can’t load ST models. Force a Jina model when falling back; also auto‑select CPU when CUDA is unavailable.

-embedder_type = (self.config.embedder_type or 'jina').lower()
-model_name = self.config.embedding_model or 'jinaai/jina-embeddings-v4'
-if model_name.lower() in {'jina-v4', 'jinaai/jina-v4'}:
-    model_name = 'jinaai/jina-embeddings-v4'
+embedder_type = (self.config.embedder_type or 'jina').lower()
+model_name = self.config.embedding_model or 'jinaai/jina-embeddings-v4'
+if model_name.lower() in {'jina-v4', 'jinaai/jina-v4'}:
+    model_name = 'jinaai/jina-embeddings-v4'
@@
-if embedder_type in {'sentence', 'sentence-transformers'} and 'sentence-transformers' not in model_name:
-    logger.warning(
-        "SentenceTransformers embedder requested without an explicit model name; "
-        "defaulting to sentence-transformers/all-mpnet-base-v2."
-    )
-    model_name = 'sentence-transformers/all-mpnet-base-v2'
+if embedder_type in {'sentence', 'sentence-transformers'}:
+    logger.warning(
+        "SentenceTransformers path is deprecated; routing to JinaV4Embedder."
+    )
+    # Force a compatible Jina model to avoid loader mismatch
+    model_name = 'jinaai/jina-embeddings-v4'
@@
-embed_config = EmbeddingConfig(
-    model_name=model_name,
-    device=self.config.device or 'cuda',
+# Device auto-detect
+_device = self.config.device
+if not _device:
+    try:
+        import torch  # local import
+        _device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    except Exception:
+        _device = 'cpu'
+embed_config = EmbeddingConfig(
+    model_name=model_name,
+    device=_device,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
embedder_type = (self.config.embedder_type or 'jina').lower()
model_name = self.config.embedding_model or 'jinaai/jina-embeddings-v4'
if model_name.lower() in {'jina-v4', 'jinaai/jina-v4'}:
model_name = 'jinaai/jina-embeddings-v4'
if embedder_type in {'sentence', 'sentence-transformers'} and 'sentence-transformers' not in model_name:
logger.warning(
"SentenceTransformers embedder requested without an explicit model name; "
"defaulting to sentence-transformers/all-mpnet-base-v2."
)
model_name = 'sentence-transformers/all-mpnet-base-v2'
embed_config = EmbeddingConfig(
model_name=model_name,
device=self.config.device or 'cuda',
batch_size=self.config.batch_size,
use_fp16=self.config.use_fp16,
chunk_size_tokens=self.config.chunk_size_tokens,
chunk_overlap_tokens=self.config.chunk_overlap_tokens
chunk_overlap_tokens=self.config.chunk_overlap_tokens,
)
if embedder_type in {'sentence', 'sentence-transformers'}:
logger.warning(
"SentenceTransformersEmbedder is deprecated and will be removed after migration; "
"routing to JinaV4Embedder fallback."
)
elif embedder_type not in {'jina', 'transformer'}:
logger.warning("Unknown embedder_type '%s'; defaulting to JinaV4Embedder", embedder_type)
self.embedder = EmbedderFactory.create(model_name=model_name, config=embed_config)
self.embedding_model = model_name
logger.info("Initialized embedder '%s' with model '%s'", embedder_type, model_name)
embedder_type = (self.config.embedder_type or 'jina').lower()
model_name = self.config.embedding_model or 'jinaai/jina-embeddings-v4'
if model_name.lower() in {'jina-v4', 'jinaai/jina-v4'}:
model_name = 'jinaai/jina-embeddings-v4'
if embedder_type in {'sentence', 'sentence-transformers'}:
logger.warning(
"SentenceTransformers path is deprecated; routing to JinaV4Embedder."
)
# Force a compatible Jina model to avoid loader mismatch
model_name = 'jinaai/jina-embeddings-v4'
# Device auto-detect
_device = self.config.device
if not _device:
try:
import torch # local import
_device = 'cuda' if torch.cuda.is_available() else 'cpu'
except Exception:
_device = 'cpu'
embed_config = EmbeddingConfig(
model_name=model_name,
device=_device,
batch_size=self.config.batch_size,
use_fp16=self.config.use_fp16,
chunk_size_tokens=self.config.chunk_size_tokens,
chunk_overlap_tokens=self.config.chunk_overlap_tokens,
)
if embedder_type in {'sentence', 'sentence-transformers'}:
logger.warning(
"SentenceTransformersEmbedder is deprecated and will be removed after migration; "
"routing to JinaV4Embedder fallback."
)
elif embedder_type not in {'jina', 'transformer'}:
logger.warning("Unknown embedder_type '%s'; defaulting to JinaV4Embedder", embedder_type)
self.embedder = EmbedderFactory.create(model_name=model_name, config=embed_config)
self.embedding_model = model_name
logger.info("Initialized embedder '%s' with model '%s'", embedder_type, model_name)
🤖 Prompt for AI Agents
In core/workflows/workflow_pdf.py around lines 213 to 245, the code sets a
Sentence‑Transformers model_name while still routing embedder_type 'sentence' to
the JinaV4Embedder (which cannot load ST models) and always selects 'cuda' if
device is unset; change the fallback so that when embedder_type is 'sentence' or
'sentence-transformers' you either (a) ensure model_name is a Jina model when
routing to JinaV4Embedder (e.g., override to a Jina v4 embeddings model) or (b)
route to a SentenceTransformersEmbedder if you intend to keep the ST model —
pick one consistent mapping and update the logger accordingly; additionally,
detect CUDA availability before defaulting to 'cuda' (use
torch.cuda.is_available() or equivalent) and set device to 'cpu' when CUDA is
not available; finally, ensure EmbedderFactory.create receives the corrected
model_name and device in embed_config and update warning/info messages to
reflect the chosen mapping.

psycopg = {version = "^3.1.18", extras = ["binary"]}
sqlalchemy = "^2.0.31"
alembic = "^1.13.2"
h2 = "^4.3.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Verify HTTP/2 stack compatibility (httpx 0.24.x + h2 4.3.x).

Ensure version constraints won’t introduce HTTP/2 quirks across httpx/httpcore/h2 when using UDS.


🌐 Web query:

httpx 0.24 http2 support and compatible httpcore/h2 versions

💡 Result:

Short answer

  • HTTP/2 in httpx 0.24.x is optional (enable with httpx[http2] or AsyncClient(http2=True)). [1][2]
  • Compatible dependency ranges for the 0.24.x release:
    • httpcore >= 0.15.0, < 0.18.0. [3]
    • h2 >= 3, < 5 (the optional HTTP/2 backend). [2]

Sources

  1. httpx 0.24.0 PyPI / docs (HTTP/2 is an optional extra). [1]
  2. httpx 0.24.0 dependency metadata showing h2 constraint (<5,>=3). [2]
  3. FreeBSD port / packaging showing httpcore >=0.15.0,<0.18.0 for httpx 0.24.0. [3]

🌐 Web query:

python-h2 4.3 compatibility notes with httpx/httpcore

💡 Result:

Short answer — watch header validation.

  • h2 4.3.0 (released Aug 23, 2025) adds stricter header validation (rejects header names/values with characters illegal under RFC 9113) and tighter event typing. This can cause previously-accepted HTTP/2 responses/requests to be rejected by h2 when used as the HTTP/2 codec. [1][2]

  • httpcore/httpx use h2 (as an optional extra) for HTTP/2. If your httpx/httpcore stack installs/uses h2 4.3, you may see failures when upstream servers or proxies emit headers that h2 now rejects. httpcore’s HTTP/2 integration relies on h2’s H2Configuration and can be configured to relax inbound header validation if required. [3][4]

Recommended actions

  1. Test your httpx/httpcore-based clients against h2 4.3 in staging to detect header-rejection errors. [1][3]
  2. If you must accept non‑conformant headers, configure h2 to be permissive (e.g. H2Configuration(validate_inbound_headers=False) when constructing the H2Connection used by httpcore/httpx wrappers) or stick to a h2 version you’ve validated. [2][4]
  3. Pin dependency ranges in production until you’ve verified behavior (e.g. pin httpcore/httpx and, if installed directly, h2). Run integration tests that exercise HTTP/2 paths (including proxies). [3]

Sources

  1. h2 4.3.0 release notes / changelog (notes about RFC 9113 header validation). [1]
  2. h2 configuration docs (H2Configuration and validate_inbound_headers option). [2]
  3. httpcore HTTP/2 documentation / notes about h2 being the optional extra used by httpcore. [3]
  4. h2 PyPI release metadata (h2-4.3.0 upload date). [4]

Would you like a short test script to detect header-validation failures with your current httpx/httpcore/h2 combination?


Verify HTTP/2 stack compatibility — h2 4.3 strict header validation

h2 4.3 tightens RFC 9113 header validation and can cause httpx/httpcore HTTP/2 (UDS) traffic to be rejected; pyproject.toml currently allows h2 = "^4.3.0" (pyproject.toml:45).

  • Immediate: pin h2 below 4.3 (e.g. change to h2 = "<4.3.0") to avoid pulling 4.3.x until tested.
  • Alternative: if upgrading to 4.3 is required, ensure the httpcore/httpx H2 integration constructs H2Connection with H2Configuration(validate_inbound_headers=False) (or equivalent) and validate in staging.
  • Action: run full HTTP/2 integration tests (UDS, proxies) and pin httpx/httpcore versions (httpx 0.24.x ↔ httpcore >=0.15,<0.18) until compatibility is confirmed.
🤖 Prompt for AI Agents
In pyproject.toml around line 45, the dependency h2 = "^4.3.0" may pull h2 4.3.x
which enforces stricter RFC9113 header validation and can break HTTP/2 (UDS)
traffic; change the constraint to pin h2 below 4.3 (for example h2 = "<4.3.0")
to avoid pulling 4.3.x until tested, or if you must upgrade to 4.3.x update the
httpcore/httpx H2 integration code to construct H2Connection with
H2Configuration(validate_inbound_headers=False) (or equivalent), pin compatible
httpx/httpcore versions (e.g. httpx 0.24.x ↔ httpcore >=0.15,<0.18), and run
full HTTP/2 integration tests (UDS, proxies) in staging before merging.

Comment on lines +85 to +91
# Exclude directories that don't need typing
exclude = [
"tests/",
"dev-utils/",
"Acheron/",
"setup/"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Mypy “exclude” is mis-specified (array). Must be a single regex string.

Current list form is ignored or errors; directories won’t be excluded. Replace with a regex and include the proto path you’re excluding elsewhere.

-[tool.mypy]
-# Exclude directories that don't need typing
-exclude = [
-    "tests/",
-    "dev-utils/",
-    "Acheron/",
-    "setup/"
-]
+[tool.mypy]
+# Exclude directories that don't need typing (regex)
+exclude = '(^(tests|dev-utils|Acheron|setup)/|^core/database/arango/go_service/python/proto/)'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Exclude directories that don't need typing
exclude = [
"tests/",
"dev-utils/",
"Acheron/",
"setup/"
]
[tool.mypy]
# Exclude directories that don't need typing (regex)
exclude = '(^(tests|dev-utils|Acheron|setup)/|^core/database/arango/go_service/python/proto/)'
🤖 Prompt for AI Agents
In pyproject.toml around lines 85 to 91, the mypy "exclude" setting is currently
a TOML array which mypy expects as a single regex string; replace the array with
one regex string that matches all directories to ignore (e.g. combine tests/,
dev-utils/, Acheron/, setup/ and the proto path into a single regex using
alternation and proper anchors), update the exclude value to that regex string,
and verify formatting follows TOML string rules so mypy will actually exclude
those paths.

Comment on lines +145 to 151
if [ -f .env ]; then
set -a
. ./.env
set +a
fi
arango_host=${ARANGO_HOST:-127.0.0.1}
if curl -s -u root:"$ARANGO_PASSWORD" "http://$arango_host:8529/_api/version" > /dev/null 2>&1; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Bug: .env is sourced only when ARANGO_PASSWORD is already set

If ARANGO_PASSWORD is defined only in .env, the current logic never sources it. Source .env before checking the variable.

-print_info "Checking ArangoDB connection..."
-if [ -z "$ARANGO_PASSWORD" ]; then
-    print_warning "ARANGO_PASSWORD not set"
-    echo "Please set: export ARANGO_PASSWORD='your_password'"
-else
-    # Try to connect to ArangoDB
-    if [ -f .env ]; then
-        set -a
-        . ./.env
-        set +a
-    fi
+print_info "Checking ArangoDB connection..."
+# Load env first so ARANGO_PASSWORD/ARANGO_HOST can be picked up
+if [ -f .env ]; then
+    set -a
+    . ./.env
+    set +a
+fi
+if [ -z "$ARANGO_PASSWORD" ]; then
+    print_warning "ARANGO_PASSWORD not set"
+    echo "Please set: export ARANGO_PASSWORD='your_password' or define it in .env"
+else
     arango_host=${ARANGO_HOST:-127.0.0.1}
     if curl -s -u root:"$ARANGO_PASSWORD" "http://$arango_host:8529/_api/version" > /dev/null 2>&1; then
         print_success "Connected to ArangoDB at $arango_host:8529"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [ -f .env ]; then
set -a
. ./.env
set +a
fi
arango_host=${ARANGO_HOST:-127.0.0.1}
if curl -s -u root:"$ARANGO_PASSWORD" "http://$arango_host:8529/_api/version" > /dev/null 2>&1; then
print_info "Checking ArangoDB connection..."
# Load env first so ARANGO_PASSWORD/ARANGO_HOST can be picked up
if [ -f .env ]; then
set -a
. ./.env
set +a
fi
if [ -z "$ARANGO_PASSWORD" ]; then
print_warning "ARANGO_PASSWORD not set"
echo "Please set: export ARANGO_PASSWORD='your_password' or define it in .env"
else
arango_host=${ARANGO_HOST:-127.0.0.1}
if curl -s -u root:"$ARANGO_PASSWORD" "http://$arango_host:8529/_api/version" > /dev/null 2>&1; then
print_success "Connected to ArangoDB at $arango_host:8529"
🤖 Prompt for AI Agents
In setup/setup_local.sh around lines 145 to 151, the script only sources .env
inside a conditional that runs after checking ARANGO_PASSWORD, so
ARANGO_PASSWORD values defined only in .env are never loaded; move the .env
sourcing to before any use of ARANGO_PASSWORD (i.e., source .env with set -a/.
./.env/set +a immediately at the top of this block) so the subsequent
arango_host and curl check can read ARANGO_PASSWORD from the sourced file, and
ensure the logic still preserves existing behavior when .env is absent.

…ger Service

- Added documentation for Virtual Context Management using ArangoDB, detailing architecture, goals, and performance metrics.
- Introduced vLLM Model Manager Service documentation, outlining API contracts, deployment plans, and security measures.
- Created test configuration to filter out noisy third-party warnings during testing.
- Developed unit tests for ArxivSinglePDFWorkflow, including input validation and execution payload generation.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (19)
core/processors/__init__.py (1)

18-27: Sort all to satisfy Ruff (RUF022).

Apply alphabetical ordering.

 __all__ = [
-    "DocumentProcessor",
-    "ProcessingConfig",
-    "ProcessingResult",
-    "ExtractionResult",
-    "ChunkingStrategy",
-    "ChunkingStrategyFactory",
-    "SemanticChunking",
-    "TokenBasedChunking",
-    "SlidingWindowChunking",
+    "ChunkingStrategy",
+    "ChunkingStrategyFactory",
+    "DocumentProcessor",
+    "ExtractionResult",
+    "ProcessingConfig",
+    "ProcessingResult",
+    "SemanticChunking",
+    "SlidingWindowChunking",
+    "TokenBasedChunking",
 ]
core/processors/document_processor.py (6)

204-207: Harden RAMFS staging directory permissions.

Ensure 0700 to avoid world/group access on shared hosts.

         if self.config.use_ramfs_staging:
             self.staging_dir = Path(self.config.staging_dir)
-            self.staging_dir.mkdir(parents=True, exist_ok=True)
+            self.staging_dir.mkdir(parents=True, exist_ok=True)
+            try:
+                import os, stat
+                os.chmod(self.staging_dir, 0o700)
+            except Exception:  # best-effort; log but continue
+                logger.warning("Could not set permissions on %s", self.staging_dir)

307-321: Use logging.exception and narrow the catch if possible.

Preserves traceback for observability (TRY400); consider catching expected errors explicitly.

-        except Exception as exc:
-            logger.error("Failed to process document %s: %s", doc_id, exc)
+        except Exception as exc:
+            logger.exception("Failed to process document %s: %s", doc_id, exc)
             errors.append(str(exc))
             return ProcessingResult(

116-154: Include token offsets in to_dict chunks.

Downstream consumers often rely on start/end token; you already compute/store them elsewhere.

                 {
                     "text": chunk.text,
                     "embedding": np.asarray(chunk.embedding).tolist(),
                     "start_char": chunk.start_char,
                     "end_char": chunk.end_char,
+                    "start_token": getattr(chunk, "start_token", 0),
+                    "end_token": getattr(chunk, "end_token", 0),
                     "chunk_index": chunk.chunk_index,
                     "total_chunks": chunk.total_chunks,
                     "context_window_used": chunk.context_window_used,
                 }

410-436: Batch embeddings instead of per-chunk calls.

Current N× single-item calls add overhead; batch by config.batch_size.

-    def _embed_chunks(self, chunks: List[Dict[str, Any]]) -> List[ChunkWithEmbedding]:
-        embedded_chunks: List[ChunkWithEmbedding] = []
-
-        for i, chunk in enumerate(chunks):
-            embeddings = self.embedder.embed_texts([chunk["text"]], batch_size=1)
-            embedding = embeddings[0] if embeddings else None
+    def _embed_chunks(self, chunks: List[Dict[str, Any]]) -> List[ChunkWithEmbedding]:
+        embedded_chunks: List[ChunkWithEmbedding] = []
+        texts = [c["text"] for c in chunks]
+        emb_arr = self.embedder.embed_texts(texts, batch_size=self.config.batch_size)
+        # Normalize shapes to (N, D)
+        if emb_arr is None or getattr(emb_arr, "shape", (0,))[:1] == (0,):
+            emb_arr = np.zeros((0, self.config.embedding_dim), dtype=np.float32)
+        if emb_arr.ndim == 1:
+            emb_arr = emb_arr.reshape(1, -1)
+        # Pad missing embeddings with zeros if any
+        if len(emb_arr) < len(chunks):
+            pad = np.zeros((len(chunks) - len(emb_arr), emb_arr.shape[1]), dtype=np.float32)
+            emb_arr = np.vstack([emb_arr, pad])
+        for i, (chunk, vec) in enumerate(zip(chunks, emb_arr)):
+            embedding_array = np.asarray(vec, dtype=np.float32).reshape(-1)
             if embedding is not None:
-                embedding_array = np.asarray(embedding, dtype=np.float32)
-                if embedding_array.ndim != 1:
-                    embedding_array = embedding_array.reshape(-1)
-            else:
-                embedding_array = np.zeros(self.config.embedding_dim, dtype=np.float32)
+                pass
             embedded_chunks.append(
                 ChunkWithEmbedding(
                     text=chunk["text"],
                     embedding=embedding_array,
                     start_char=chunk.get("start_char", 0),
                     end_char=chunk.get("end_char", len(chunk["text"])),
                     start_token=chunk.get("start_token", 0),
                     end_token=chunk.get("end_token", 0),
                     chunk_index=i,
                     total_chunks=len(chunks),
                     context_window_used=len(chunk["text"].split()),
                 )
             )
         return embedded_chunks

21-41: Scope warning filters more narrowly.

Broad, unconditional filters can hide important signals in non-related contexts.

Consider using module-specific filters or action="once" where feasible.


455-460: Sort all (RUF022).

 __all__ = [
-    "DocumentProcessor",
-    "ProcessingConfig",
-    "ProcessingResult",
-    "ExtractionResult",
+    "DocumentProcessor",
+    "ExtractionResult",
+    "ProcessingConfig",
+    "ProcessingResult",
 ]
docs/prd/codex_interface_prd.md (2)

138-138: Fix bare URL (MD034).

-- Codex repo: https://github.com/openai/codex
+- Codex repo: <https://github.com/openai/codex>

70-80: Acceptance checks: add explicit “no TCP listener” CI probe.

You mention it elsewhere; surface here as a must-pass criterion for the gateway.

 - Acceptance: 400 `TOKEN_BUDGET_EXCEEDED` responses pass through unmodified with echoed correlation headers; missing or altered headers trigger alerts/CI failures.
+- Acceptance: CI must fail if any TCP listener is detected for the gateway process (loopback or otherwise).
docs/prd/model_manager_vllm_prd.md (1)

52-56: Make the “no TCP listeners” guard testable.

Document the exact CI check so ops can wire it easily.

 - Startup/CI guard fails if `ss -lntp` shows `python`/`uvicorn` bound to TCP, if `stat -c %a /run/hades/vllm` ≠ `700`, or if the socket lacks `660` permissions.
+ - Startup/CI guard:
+   - `! ss -lntp | grep -E '(python|uvicorn)'`
+   - `test "$(stat -c %a /run/hades/vllm)" = "700"`
+   - `test "$(stat -c %a /run/hades/vllm/api.sock)" = "660"`
core/workflows/workflow_pdf.py (1)

16-21: Sort all (RUF022).

 __all__ = [
-    "DocumentProcessor",
-    "ProcessingConfig",
-    "ProcessingResult",
-    "ExtractionResult",
+    "DocumentProcessor",
+    "ExtractionResult",
+    "ProcessingConfig",
+    "ProcessingResult",
 ]
core/workflows/workflow_arxiv_single_pdf.py (4)

1-1: Shebang requires executable permissions.

The file has a shebang line but lacks executable permissions. Either remove the shebang or make the file executable.

-#!/usr/bin/env python3

Or set executable permissions:

chmod +x core/workflows/workflow_arxiv_single_pdf.py

75-77: Consider centralizing error messages.

The long error message could be moved to a constant for better maintainability.

+_ARXIV_DEPENDENCY_ERROR = (
+    "ArXivAPIClient dependency is unavailable. Ensure optional dependencies are installed."
+)
+
 if api_client is not None:
     self.api_client = api_client
 else:
     if ArXivAPIClient is None:
-        raise ImportError(
-            "ArXivAPIClient dependency is unavailable. Ensure optional dependencies are installed."
-        ) from _ARXIV_IMPORT_ERROR
+        raise ImportError(_ARXIV_DEPENDENCY_ERROR) from _ARXIV_IMPORT_ERROR

229-245: Consider using dataclasses.asdict() consistently.

The _serialize_metadata method has redundant logic for dataclass handling since asdict already handles dataclass objects correctly. The second condition checking __dict__ may not be necessary.

 def _serialize_metadata(self, metadata: Any) -> Dict[str, Any]:
     if hasattr(metadata, "__dataclass_fields__"):
         serialized = asdict(metadata)
-    elif hasattr(metadata, "__dict__"):
-        serialized = dict(metadata.__dict__)
     else:
-        serialized = {}
+        serialized = metadata.__dict__ if hasattr(metadata, "__dict__") else {}

189-197: Metadata OK; add optional defensive validation.
ArXivMetadata in core/tools/arxiv/arxiv_api_client.py defines the expected fields and ArxivSinglePDFWorkflow._serialize_metadata in core/workflows/workflow_arxiv_single_pdf.py already serializes dataclasses and converts published/updated to ISO — downstream KeyError is unlikely. Add explicit schema validation or default keys in _serialize_metadata (or validate/populate fields in ArXivAPIClient._parse_entry) only if consumers require guaranteed keys.

core/tools/arxiv/arxiv_manager.py (1)

109-110: Pattern match results aren't capturing the version suffix properly.

The regex patterns match the ID but don't preserve version information in the match groups. The function validates but doesn't extract the version component needed elsewhere.

The current regex patterns don't capture the version suffix in a match group, which means version information gets lost during validation. Consider updating patterns to capture the version:

 # Check new format (YYMM.NNNNN)
-if cls.ARXIV_ID_PATTERN.match(base_id):
+if cls.ARXIV_ID_PATTERN.match(arxiv_id):
     return True, None
 
 # Check old format (category/YYMMNNN)
-if cls.OLD_ARXIV_ID_PATTERN.match(base_id):
+if cls.OLD_ARXIV_ID_PATTERN.match(arxiv_id):
     return True, None

Also applies to: 113-114

core/workflows/workflow_arxiv_initial_ingest.py (3)

1-1: File has shebang but is not executable.

Either remove the shebang line or make the file executable.

-#!/usr/bin/env python3

Or make executable:

chmod +x core/workflows/workflow_arxiv_initial_ingest.py

169-169: Consider using ASCII alpha in docstring.

The α character in the docstring may cause rendering issues in some documentation tools.

-    - α: Default 1.7 for context amplification
+    - alpha: Default 1.7 for context amplification
 
-    Conveyance C = (W · R · H / T) · Ctx^α measured via throughput metrics.
+    Conveyance C = (W · R · H / T) · Ctx^alpha measured via throughput metrics.

214-217: Extract error message to constant.

Long error messages should be defined as constants for better maintainability.

+_PASSWORD_ERROR = (
+    "ARANGO_PASSWORD not set. Please create a .env file from .env.example "
+    "or set the ARANGO_PASSWORD environment variable."
+)
+
 password = os.environ.get('ARANGO_PASSWORD')
 if not password:
-    raise ValueError(
-        "ARANGO_PASSWORD not set. Please create a .env file from .env.example "
-        "or set the ARANGO_PASSWORD environment variable."
-    )
+    raise ValueError(_PASSWORD_ERROR)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac06da3 and c318fac.

⛔ Files ignored due to path filters (2)
  • tests/conftest.py is excluded by !**/tests/** and included by **/*
  • tests/workflows/test_workflow_arxiv_single_pdf.py is excluded by !**/tests/** and included by **/*
📒 Files selected for processing (10)
  • core/processors/__init__.py (1 hunks)
  • core/processors/document_processor.py (1 hunks)
  • core/tools/arxiv/arxiv_manager.py (8 hunks)
  • core/workflows/__init__.py (2 hunks)
  • core/workflows/workflow_arxiv_initial_ingest.py (1 hunks)
  • core/workflows/workflow_arxiv_single_pdf.py (1 hunks)
  • core/workflows/workflow_pdf.py (1 hunks)
  • docs/prd/codex_interface_prd.md (1 hunks)
  • docs/prd/memgpt_virtual_context_arango_prd.md (1 hunks)
  • docs/prd/model_manager_vllm_prd.md (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/prd/memgpt_virtual_context_arango_prd.md
🧰 Additional context used
📓 Path-based instructions (3)
core/{workflows,embedders}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Select embedder by workload: use SentenceTransformersEmbedder for high throughput (48+ papers/sec) and JinaV4Embedder for sophisticated processing; both must use late chunking

Files:

  • core/workflows/workflow_arxiv_single_pdf.py
  • core/workflows/__init__.py
  • core/workflows/workflow_pdf.py
  • core/workflows/workflow_arxiv_initial_ingest.py
**/*

⚙️ CodeRabbit configuration file

**/*: # HADES — Conveyance Framework (Efficiency view default)

CRITICAL: Correct Formula

C = (W · R · H / T) · Ctx^α

NOT "WHERE × WHAT × CONVEYANCE × TIME" (this is WRONG)

  • W = What (signal/content quality)
  • R = Where (relational/topological positioning)
  • H = Who (agent/capability leverage)
  • T = Time to converge (DIVIDES, not multiplies!)
  • Ctx = wL·L + wI·I + wA·A + wG·G (weights default 0.25)
  • α ∈ [1.5, 2.0], default 1.7, applies ONLY to Ctx

Red flags to request changes:

  • TIME multiplied rather than divided (look for "× TIME")
  • "CONVEYANCE" used as an input factor (C is the outcome, not input)
  • Missing WHO (H) in the efficiency equation
  • α applied to anything other than Ctx
  • Evidence of double-counting time

Review Focus Areas

  1. Todd's 3 Laws:

    • Data is sacrosanct: validate inputs, no silent data loss
    • Reliability: controlled failure modes
    • Ownership: comprehensive logging
  2. Factory patterns & interfaces (H, R)

  3. Backward compatibility (R, T)

  4. Performance evidence (T, W)

  5. Observability (H, T)

  6. Security (W, R)

Conveyance Scorecard Output

  • Time stance: Efficiency (default) or Capability (if T fixed)
  • W/R/H/T summary
  • L/I/A/G scores (each in [0,1])
  • α value (default 1.7)
  • Zero-propagation check
  • Evidence links

Files:

  • core/workflows/workflow_arxiv_single_pdf.py
  • core/workflows/__init__.py
  • core/tools/arxiv/arxiv_manager.py
  • docs/prd/model_manager_vllm_prd.md
  • core/processors/document_processor.py
  • core/workflows/workflow_pdf.py
  • core/workflows/workflow_arxiv_initial_ingest.py
  • core/processors/__init__.py
  • docs/prd/codex_interface_prd.md
core/workflows/**/*.py

⚙️ CodeRabbit configuration file

core/workflows/**/*.py: Embedding strategy expectations:

  • Prefer JinaV4Embedder with late chunking for high-fidelity workflows.
  • When SentenceTransformersEmbedder is selected for throughput, document the trade-off explicitly.
  • Confirm late chunking paths never discard tail tokens.

Files:

  • core/workflows/workflow_arxiv_single_pdf.py
  • core/workflows/__init__.py
  • core/workflows/workflow_pdf.py
  • core/workflows/workflow_arxiv_initial_ingest.py
🧠 Learnings (2)
📚 Learning: 2025-09-17T21:07:06.930Z
Learnt from: r3d91ll
PR: r3d91ll/HADES-Lab#47
File: core/workflows/workflow_arxiv_sorted_simple.py:260-277
Timestamp: 2025-09-17T21:07:06.930Z
Learning: The ArxivSortedWorkflow in core/workflows/workflow_arxiv_sorted_simple.py is designed for processing 2.8 million documents and requires GPU acceleration - CPU fallback would be computationally prohibitive and should not be implemented for this workflow.

Applied to files:

  • core/workflows/workflow_arxiv_single_pdf.py
  • core/workflows/workflow_arxiv_initial_ingest.py
📚 Learning: 2025-09-17T22:48:18.435Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-17T22:48:18.435Z
Learning: Applies to core/{workflows,embedders}/**/*.py : Select embedder by workload: use SentenceTransformersEmbedder for high throughput (48+ papers/sec) and JinaV4Embedder for sophisticated processing; both must use late chunking

Applied to files:

  • core/workflows/workflow_pdf.py
🧬 Code graph analysis (7)
core/workflows/workflow_arxiv_single_pdf.py (3)
core/workflows/workflow_base.py (4)
  • WorkflowBase (67-211)
  • WorkflowConfig (29-38)
  • WorkflowResult (42-64)
  • name (181-183)
core/processors/document_processor.py (4)
  • ProcessingConfig (51-82)
  • ProcessingResult (102-154)
  • process_document (210-321)
  • to_dict (116-154)
core/tools/arxiv/arxiv_api_client.py (4)
  • ArXivAPIClient (75-548)
  • DownloadResult (64-72)
  • ArXivMetadata (33-60)
  • download_paper (325-480)
core/workflows/__init__.py (2)
core/workflows/workflow_arxiv_initial_ingest.py (1)
  • ArxivInitialIngestWorkflow (158-644)
core/workflows/workflow_arxiv_single_pdf.py (1)
  • ArxivSinglePDFWorkflow (37-268)
core/tools/arxiv/arxiv_manager.py (2)
core/processors/document_processor.py (5)
  • DocumentProcessor (157-452)
  • ProcessingConfig (51-82)
  • ProcessingResult (102-154)
  • process_document (210-321)
  • ExtractionResult (86-98)
core/database/arango/memory_client.py (3)
  • ArangoMemoryClient (194-369)
  • resolve_memory_config (122-191)
  • execute_transaction (280-309)
core/processors/document_processor.py (6)
core/embedders/embedders_factory.py (1)
  • EmbedderFactory (31-169)
core/embedders/embedders_base.py (1)
  • EmbeddingConfig (22-30)
core/embedders/embedders_jina.py (3)
  • ChunkWithEmbedding (33-50)
  • embed_with_late_chunking (453-512)
  • embed_texts (156-287)
core/extractors/extractors_docling.py (1)
  • DoclingExtractor (27-414)
core/extractors/extractors_latex.py (1)
  • LaTeXExtractor (25-494)
core/extractors/extractors_base.py (1)
  • ExtractorConfig (41-52)
core/workflows/workflow_pdf.py (1)
core/processors/document_processor.py (4)
  • DocumentProcessor (157-452)
  • ProcessingConfig (51-82)
  • ProcessingResult (102-154)
  • ExtractionResult (86-98)
core/workflows/workflow_arxiv_initial_ingest.py (5)
core/database/arango/memory_client.py (6)
  • CollectionDefinition (100-106)
  • drop_collections (311-319)
  • close (209-215)
  • create_collections (321-353)
  • bulk_insert (242-272)
  • execute_query (224-240)
core/database/database_factory.py (1)
  • get_arango_memory_service (146-199)
core/embedders/embedders_factory.py (1)
  • EmbedderFactory (31-169)
core/embedders/embedders_jina.py (1)
  • embed_batch_with_late_chunking (514-548)
core/database/arango/optimized_client.py (2)
  • close (80-81)
  • query (131-158)
core/processors/__init__.py (1)
core/processors/document_processor.py (4)
  • DocumentProcessor (157-452)
  • ProcessingConfig (51-82)
  • ProcessingResult (102-154)
  • ExtractionResult (86-98)
🪛 Ruff (0.13.1)
core/workflows/workflow_arxiv_single_pdf.py

1-1: Shebang is present but file is not executable

(EXE001)


75-77: Avoid specifying long messages outside the exception class

(TRY003)

core/processors/document_processor.py

82-82: Probable insecure usage of temporary file or directory: "/dev/shm/document_staging"

(S108)


307-307: Do not catch blind exception: Exception

(BLE001)


308-308: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


371-371: Avoid specifying long messages outside the exception class

(TRY003)


455-460: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

core/workflows/workflow_pdf.py

16-21: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

core/workflows/workflow_arxiv_initial_ingest.py

1-1: Shebang is present but file is not executable

(EXE001)


167-167: Docstring contains ambiguous α (GREEK SMALL LETTER ALPHA). Did you mean a (LATIN SMALL LETTER A)?

(RUF002)


169-169: Docstring contains ambiguous α (GREEK SMALL LETTER ALPHA). Did you mean a (LATIN SMALL LETTER A)?

(RUF002)


214-217: Avoid specifying long messages outside the exception class

(TRY003)


275-275: Consider moving this statement to an else block

(TRY300)


277-277: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


292-292: Dictionary key literal "arxiv_metadata" repeated

Remove repeated key literal "arxiv_metadata"

(F601)


375-378: Avoid specifying long messages outside the exception class

(TRY003)


411-411: Avoid specifying long messages outside the exception class

(TRY003)


459-459: Do not catch blind exception: Exception

(BLE001)


473-473: Do not catch blind exception: Exception

(BLE001)

core/processors/__init__.py

17-27: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🪛 markdownlint-cli2 (0.18.1)
docs/prd/codex_interface_prd.md

138-138: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (19)
core/processors/__init__.py (2)

3-8: Good public re-exports.

Centralizing types at package level improves DX and helps migration.


12-15: Exposing token/window chunkers is appropriate.

Matches the new strategies surfaced elsewhere; no issues spotted.

docs/prd/codex_interface_prd.md (1)

7-18: Conveyance formula usage is correct.

C = (W·R·H / T) · Ctx^α with α applied only to Ctx; numbers back-calc to ≈0.52.

docs/prd/model_manager_vllm_prd.md (1)

7-18: Conveyance formula usage is correct.

C = (W·R·H / T) · Ctx^α with α only on Ctx; ≈0.61 checks out.

core/workflows/workflow_pdf.py (1)

2-6: Wrapper for backward compatibility looks good.

Keeps existing import paths stable during the migration.

core/processors/document_processor.py (1)

232-237: Add tail-token safety assertion for late and traditional chunking

Add a small test/assert that verifies the final chunk covers the full token span (example: assert chunks[-1]["end_token"] == len(tokens)) for both chunking_strategy == "late" and "traditional".
Location: core/processors/document_processor.py (around lines 232–237 and 262–275).
Verification attempted but the suggested script failed with ModuleNotFoundError: No module named 'numpy' — re-run verification in CI or locally after installing dependencies to confirm.

core/workflows/workflow_arxiv_single_pdf.py (2)

65-67: LGTM: Proper enforcement of late chunking for ArXiv workflows.

The code correctly enforces late chunking strategy to maintain full document context, aligning with the Conveyance Framework requirement for late chunking mentioned in the coding guidelines.


177-181: LGTM: Clean separation of concerns.

Good design to delegate processing to the DocumentProcessor while keeping ArXiv-specific workflow logic separate.

core/tools/arxiv/arxiv_manager.py (6)

257-261: LGTM: Memory client initialization looks good.

The transition to memory-based Arango client is properly implemented with correct configuration resolution.


336-348: Robust author and category normalization.

Good defensive handling of different metadata formats (list vs string) for authors and categories. This prevents runtime errors from inconsistent data sources.


393-400: LGTM: Proper async execution pattern.

Good use of run_in_executor with partial to offload the expensive processing to a thread pool while maintaining async compatibility.


508-536: Transaction uses correct ArangoDB module.

The transaction correctly uses require('@arangodb').db which is the standard for ArangoDB server-side JavaScript transactions, consistent with ArangoDB documentation.


579-595: LGTM: Proper error handling in batch processing.

Good error isolation - each failed paper gets a synthetic ProcessingResult with error details, ensuring the batch continues and all IDs get results.


597-603: Smart concurrent processing with ordering preservation.

Excellent use of semaphore for controlled concurrency and asyncio.gather to maintain result ordering matching the input IDs.

core/workflows/__init__.py (2)

14-37: LGTM: Clean conditional import pattern.

Good defensive programming with try/except blocks and None defaults to handle optional dependencies gracefully. This allows the module to import successfully even when specific workflow implementations are unavailable.


51-58: LGTM: Dynamic all construction.

Smart pattern to only export workflow classes that successfully imported, preventing AttributeError when clients try to access unavailable workflows.

core/workflows/workflow_arxiv_initial_ingest.py (3)

81-86: LGTM: SentenceTransformers deprecation handling.

Good defensive handling of the deprecated SentenceTransformers with clear warning and automatic fallback to a compatible model.


547-556: LGTM: Proper queue draining logic.

Excellent fix to continue draining the output queue even after stop_event is set, preventing data loss during shutdown.


478-479: Fix LENGTH usage for collection counts.

LENGTH({collection}) is incorrect AQL syntax - it should be LENGTH(collection_name) without braces, or better yet, use COUNT(collection) for clarity.

 # Get count using query
-count_result = self.memory_client.execute_query(f"RETURN LENGTH({self.metadata_collection})")
+count_result = self.memory_client.execute_query(f"FOR d IN {self.metadata_collection} COLLECT WITH COUNT INTO length RETURN length")
 actual_count = count_result[0] if count_result else 0
 
 # Check if embeddings exist using query
-count_result = self.memory_client.execute_query(f"RETURN LENGTH({self.embeddings_collection})")
+count_result = self.memory_client.execute_query(f"FOR e IN {self.embeddings_collection} COLLECT WITH COUNT INTO length RETURN length")
 embeddings_count = count_result[0] if count_result else 0

Also applies to: 490-491

Comment on lines +172 to +191
embedder_type = (self.config.embedder_type or "jina").lower()
model_name = self.config.embedding_model or "jinaai/jina-embeddings-v4"
if model_name.lower() in {"jina-v4", "jinaai/jina-v4"}:
model_name = "jinaai/jina-embeddings-v4"

if embedder_type in {"sentence", "sentence-transformers"} and "sentence-transformers" not in model_name:
logger.warning(
"SentenceTransformers embedder requested without explicit model; "
"defaulting to sentence-transformers/all-mpnet-base-v2.",
)
model_name = "sentence-transformers/all-mpnet-base-v2"

embed_config = EmbeddingConfig(
model_name=model_name,
device=self.config.device or "cuda",
batch_size=self.config.batch_size,
use_fp16=self.config.use_fp16,
chunk_size_tokens=self.config.chunk_size_tokens,
chunk_overlap_tokens=self.config.chunk_overlap_tokens,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix model/driver mismatch and device default.

When embedder_type is 'sentence', you later route to JinaV4 but keep an ST model name; also defaulting device to 'cuda' can break CPU-only nodes.

-        if embedder_type in {"sentence", "sentence-transformers"} and "sentence-transformers" not in model_name:
-            logger.warning(
-                "SentenceTransformers embedder requested without explicit model; "
-                "defaulting to sentence-transformers/all-mpnet-base-v2.",
-            )
-            model_name = "sentence-transformers/all-mpnet-base-v2"
+        if embedder_type in {"sentence", "sentence-transformers"}:
+            logger.warning(
+                "SentenceTransformers path is deprecated; routing to JinaV4Embedder with a compatible model.",
+            )
+            model_name = "jinaai/jina-embeddings-v4"
@@
-        embed_config = EmbeddingConfig(
-            model_name=model_name,
-            device=self.config.device or "cuda",
+        # Device auto-detect with safe CPU fallback
+        _device = self.config.device
+        if not _device:
+            try:
+                import torch  # local import
+                _device = "cuda" if torch.cuda.is_available() else "cpu"
+            except Exception:
+                _device = "cpu"
+        embed_config = EmbeddingConfig(
+            model_name=model_name,
+            device=_device,
             batch_size=self.config.batch_size,
             use_fp16=self.config.use_fp16,
             chunk_size_tokens=self.config.chunk_size_tokens,
             chunk_overlap_tokens=self.config.chunk_overlap_tokens,
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
embedder_type = (self.config.embedder_type or "jina").lower()
model_name = self.config.embedding_model or "jinaai/jina-embeddings-v4"
if model_name.lower() in {"jina-v4", "jinaai/jina-v4"}:
model_name = "jinaai/jina-embeddings-v4"
if embedder_type in {"sentence", "sentence-transformers"} and "sentence-transformers" not in model_name:
logger.warning(
"SentenceTransformers embedder requested without explicit model; "
"defaulting to sentence-transformers/all-mpnet-base-v2.",
)
model_name = "sentence-transformers/all-mpnet-base-v2"
embed_config = EmbeddingConfig(
model_name=model_name,
device=self.config.device or "cuda",
batch_size=self.config.batch_size,
use_fp16=self.config.use_fp16,
chunk_size_tokens=self.config.chunk_size_tokens,
chunk_overlap_tokens=self.config.chunk_overlap_tokens,
)
embedder_type = (self.config.embedder_type or "jina").lower()
model_name = self.config.embedding_model or "jinaai/jina-embeddings-v4"
if model_name.lower() in {"jina-v4", "jinaai/jina-v4"}:
model_name = "jinaai/jina-embeddings-v4"
if embedder_type in {"sentence", "sentence-transformers"}:
logger.warning(
"SentenceTransformers path is deprecated; routing to JinaV4Embedder with a compatible model.",
)
model_name = "jinaai/jina-embeddings-v4"
# Device auto-detect with safe CPU fallback
_device = self.config.device
if not _device:
try:
import torch # local import
_device = "cuda" if torch.cuda.is_available() else "cpu"
except Exception:
_device = "cpu"
embed_config = EmbeddingConfig(
model_name=model_name,
device=_device,
batch_size=self.config.batch_size,
use_fp16=self.config.use_fp16,
chunk_size_tokens=self.config.chunk_size_tokens,
chunk_overlap_tokens=self.config.chunk_overlap_tokens,
)


# Package results
results = []
for record, chunks in zip(records, all_chunks, strict=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using strict=True for zip safety.

The zip call uses strict=False which could silently ignore length mismatches between records and chunks lists.

-for record, chunks in zip(records, all_chunks, strict=False):
+for record, chunks in zip(records, all_chunks, strict=True):

This ensures that the records and chunks lists have the same length, preventing silent data loss if there's a mismatch.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for record, chunks in zip(records, all_chunks, strict=False):
for record, chunks in zip(records, all_chunks, strict=True):
🤖 Prompt for AI Agents
In core/workflows/workflow_arxiv_initial_ingest.py around line 134, the zip call
uses strict=False which can silently drop items if records and all_chunks have
different lengths; change to zip(records, all_chunks, strict=True) to enforce
equal lengths, or before zipping add an explicit length check (if len(records)
!= len(all_chunks): raise ValueError(...)) if you need to support Python
versions without zip strictness, so mismatches fail fast instead of being
silently ignored.

Comment on lines +291 to +292
"arxiv_metadata": ["arxiv_metadata"],
"arxiv_metadata": ["arxiv_metadata"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Duplicate dictionary key in alias_map.

The dictionary has duplicate "arxiv_metadata" keys which will result in only the second value being retained.

 alias_map = {
-    "arxiv_metadata": ["arxiv_metadata"],
     "arxiv_metadata": ["arxiv_metadata"],
     "arxiv_abstract_chunks": ["arxiv_chunks"],
     "arxiv_chunks": ["arxiv_abstract_chunks"],
     "arxiv_abstract_embeddings": ["arxiv_embeddings"],
     "arxiv_embeddings": ["arxiv_abstract_embeddings"],
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"arxiv_metadata": ["arxiv_metadata"],
"arxiv_metadata": ["arxiv_metadata"],
alias_map = {
"arxiv_metadata": ["arxiv_metadata"],
"arxiv_abstract_chunks": ["arxiv_chunks"],
"arxiv_chunks": ["arxiv_abstract_chunks"],
"arxiv_abstract_embeddings": ["arxiv_embeddings"],
"arxiv_embeddings": ["arxiv_abstract_embeddings"],
}
🧰 Tools
🪛 Ruff (0.13.1)

292-292: Dictionary key literal "arxiv_metadata" repeated

Remove repeated key literal "arxiv_metadata"

(F601)

🤖 Prompt for AI Agents
In core/workflows/workflow_arxiv_initial_ingest.py around lines 291-292 there is
a duplicate "arxiv_metadata" key in the alias_map dict which means the first
entry will be overwritten; remove the duplicate key or merge their values so
each key is unique (e.g., keep a single "arxiv_metadata": ["arxiv_metadata"]
entry or combine any different lists into one unique list) and save the file,
then run the workflow tests/lint to verify no regressions.

@r3d91ll r3d91ll merged commit 7ebd001 into main Sep 21, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants