Skip to content

Conversation

@r3d91ll
Copy link
Owner

@r3d91ll r3d91ll commented Sep 8, 2025

Summary

Implements GraphSAGE for discovering connections between theoretical papers and practical code implementations, addressing Issue #32.

Key Features

  • RAM-first architecture: Entire graph loaded into memory (~22GB for current dataset)
  • Heterogeneous graph support: Handles papers, code, chunks as different node types
  • Multiple bridge discovery methods:
    • Direct: Explicit edges
    • Semantic: Embedding similarity
    • Structural: Shared neighborhoods
    • Latent: GraphSAGE-discovered patterns
  • Zero-copy shared memory: For <1ms access times
  • Incremental updates: Add nodes without retraining

Architecture

core/framework/
├── memory_store.py         # RAM-based graph storage
├── graph_embedders.py      # GraphSAGE model

tools/graphsage/
├── pipelines/              # Main execution pipeline
├── bridge_discovery/       # Theory-practice finder
├── utils/                  # Neighborhood sampling
└── configs/               # Configuration

Performance

  • Current graph: ~2.8M nodes, only needs ~22GB RAM
  • With 256GB available, can scale to 26M+ nodes
  • Expected <1ms latency with Unix sockets (vs 100ms with TCP)
  • 100-1000x speedup over network+disk approach

Testing

cd tools/graphsage/pipelines/
python graphsage_pipeline.py --config ../configs/graphsage_config.yaml

Related

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added GraphSAGE pipeline with configurable model/sampling and ArangoDB persistence.
    • Introduced in-memory graph store with shared-memory embeddings.
    • Delivered GPU/FAISS/batched graph builders, full orchestration, and GraphManager exports.
    • Added HTTP↔Unix-socket bridge and Unix socket server for graph operations.
    • New monitoring suite (system, power, metrics) and interdisciplinary analysis tools.
    • Added Jina v4 embedders and sentence/vLLM variants; node feature loader and sampling utilities.
  • Documentation

    • Major docs overhaul: architecture, PRDs, quick-starts, edge taxonomy, baselines; ArangoDB setup clarified.
  • Tests

    • Performance benchmarks for DB vs memory and RAMFS; Unix-socket benchmarking; experiment comparisons.
  • Chores

    • Updated configs/host, added dependency, expanded .gitignore, added data exports; deprecated MCP server docs.

r3d91ll and others added 2 commits September 8, 2025 13:52
Major changes:
- Create import_fresh_arxiv_dataset.py to download and import latest ArXiv data from Kaggle
- Direct import to ArangoDB (removed PostgreSQL dependency)
- Downloads dataset through August 2024 with 2.7M+ papers
- Supports batch processing and progress tracking

Configuration improvements:
- Fix ArangoDB connection to use localhost instead of IP (eliminate network overhead)
- Update CLAUDE.md with correct database access instructions
- Add kagglehub dependency for dataset downloads

Performance testing:
- Add test_memory_rag.py to compare database vs pure memory performance
- Demonstrates 245x speedup with pure memory approach
- Add test_ramfs_arangodb.py for RAMfs performance testing

Testing results:
- Successfully imported fresh dataset
- Processed 100 papers with 100% success rate
- 6.2 papers/minute end-to-end throughput
- Ready for searching 2024 unified intelligence papers

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
…32)

Add complete GraphSAGE implementation for discovering connections between
theoretical papers and practical code implementations.

Core Framework:
- Add GraphMemoryStore for RAM-based graph storage with shared memory support
- Implement GraphSAGEEmbedder with mean/pooling/LSTM aggregators
- Support heterogeneous graphs with multiple node/edge types

GraphSAGE Tool:
- Create complete pipeline for loading, embedding, and bridge discovery
- Implement neighborhood sampling with uniform/importance/adaptive strategies
- Add theory-practice bridge finder with 4 discovery methods:
  - Direct: Explicit edges between papers and code
  - Semantic: Embedding similarity in GraphSAGE space
  - Structural: Shared graph neighborhoods
  - Latent: GraphSAGE-discovered patterns

Key Features:
- RAM-first architecture (entire graph in memory)
- Zero-copy access via shared memory
- <1ms latency potential with Unix sockets
- Incremental updates without retraining
- Conveyance scoring for bridge ranking

This addresses the need for cross-domain, heterogeneous graph learning
that hierarchical approaches like HiRAG would struggle with.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 8, 2025

Walkthrough

Adds a RAM-first GraphSAGE stack (model, memory store, pipeline, configs), extensive ArangoDB tooling (manager, Unix-socket client, benchmarks), a full graph-building/orchestration suite with connectors, monitoring utilities, embedders, and analyses. Updates docs/configs and removes MCP server docs. Introduces multiple CLIs, data exports, and test/benchmark scripts.

Changes

Cohort / File(s) Summary
GraphSAGE core & RAM store
core/framework/graph_embedders.py, core/framework/memory_store.py
New GraphSAGE embedder with aggregators, neighbor sampling, batch inference/training stubs; in-memory GraphMemoryStore with ArangoDB loader, adjacency, shared-memory embeddings, disk persistence.
GraphSAGE toolchain
tools/graphsage/__init__.py, tools/graphsage/configs/graphsage_config.yaml, tools/graphsage/pipelines/graphsage_pipeline.py, tools/graphsage/bridge_discovery/__init__.py, tools/graphsage/utils/__init__.py
Public package, config, checkpointed pipeline (load graph, init model, generate embeddings, discover bridges, persist to ArangoDB), and exports for utils/bridge discovery.
ArangoDB management & transports
core/database/arango/arango_manager.py, core/database/arango/arango_unix_client.py, core/database/__init__.py, core/database/arango/arango_validator.py
Production Arango manager with CLI (stats, backup, drop/truncate/create/verify); Unix-socket HTTP client with AQL, bulk writes, fallback HTTP; import paths updated.
ArangoDB benchmarks & RAMFS
core/database/benchmark_unix_socket.py, core/database/unix_socket_benchmark_results.json, test_ramfs_arangodb.py, test_memory_rag.py
Benchmarks for Unix vs HTTP, RAMFS data copy and search comparisons, in-memory vs DB similarity tests, and recorded results.
Graph build: builders & orchestration
core/graph/builders/* (multiple), core/graph/orchestrate_graph_build.py, core/graph/orchestrations/arxiv_death_of_author_v1.py, core/graph/test_optimized_build.py
Parallel/optimized builders for category, temporal, keyword similarity (GPU/FAISS/NVLink), batched workflows, complete graph CLI, orchestration (socket/HTTP), and experiments/tests.
Graph connectors (HTTP ↔ Unix socket)
core/graph/connectors/http_socket_bridge.py, core/graph/connectors/socket_server.py, core/graph/connectors/test_client.py
FastAPI HTTP bridge to Unix-socket backend and aiohttp Unix socket server exposing build/export/status ops; async test client.
Graph management & utils
core/graph/graph_manager.py, core/graph/utils/* (analysis, features, sampling, monitor), core/graph/builders/analyze_interdisciplinary.py
Export to PyG, stats, interdisciplinary analysis, node feature loading, subgraph sampling, progress monitors, exploration utilities.
Embedders (Jina v4)
core/framework/embedders.py, core/framework/sentence_embedder.py, core/framework/vllm_embedder.py, core/framework/vllm_retrieval_embedder.py
Device-aware loading tweaks; new SentenceTransformers embedder; vLLM-based general and retrieval embedders for Jina v4.
Documentation (graph architecture & PRDs)
core/graph/*.md, docs/* (PRDs, baseline, notes, edge types)
New architecture, optimization notes, summaries, PRDs (config/monitoring/restructure/assistant), baseline results, and edge taxonomy docs.
Data exports
core/graph/exports/*
Added graph statistics and theory-practice bridge samples (JSON).
Config & ignores
tools/arxiv/configs/acid_pipeline_phased.yaml, pyproject.toml, .gitignore
Arango host to localhost, add kagglehub dependency, ignore checkpoints/cache and workspace dirs.
Processors/utils surface changes
core/processors/__init__.py, core/utils/__init__.py, core/workflows/workflow_pdf_batch.py
Re-scoped processors to text chunking exports; removed State/Checkpoint managers from utils; updated Arango import path.
Removed MCP server docs/config
core/mcp_server/* (README, CLAUDE.md, config, server, launch, requirements)
Deleted MCP server documentation/config/launcher/server code.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Pipeline as GraphSAGEPipeline
  participant Store as GraphMemoryStore
  participant Model as GraphSAGEEmbedder
  participant DB as ArangoDB

  User->>Pipeline: run(config)
  activate Pipeline
  Pipeline->>Store: load_from_arangodb(cfg)
  Store-->>Pipeline: Graph loaded (nodes/edges)
  Pipeline->>Model: initialize(config, store)
  Pipeline->>Model: generate_embeddings(batch)
  Model->>Store: sample_neighbors(k per layer)
  Model-->>Pipeline: embeddings (NxD)
  Pipeline->>DB: upsert graphsage_embeddings (batches)
  Pipeline->>Pipeline: discover_bridges()
  Pipeline->>DB: upsert bridge_predictions
  Pipeline-->>User: bridges + stats
  deactivate Pipeline
Loading
sequenceDiagram
  autonumber
  actor Client as HTTP Client
  participant HTTP as FastAPI Bridge
  participant UDS as Unix Socket Server
  participant Builder as Graph Builder(s)

  Client->>HTTP: POST /graph/build {params}
  alt Unix socket available
    HTTP->>UDS: forward /process {operation, params}
    UDS->>Builder: run build (thread/executor)
    Builder-->>UDS: result {stats}
    UDS-->>HTTP: JSON {success,data}
  else Fallback/Error
    HTTP-->>Client: 503/500 with error
  end
  HTTP-->>Client: 200 {success,data}
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~180 minutes

Possibly related issues

Possibly related PRs

Poem

A rabbit taps keys with a hop and a dash,
Pipes to a socket—no networky stash.
SAGE learns the graph, bridges appear,
RAM hums along, queries draw near.
With paws on the logs and whiskers that twitch,
We built it in memory—fast as a glitch. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Linked Issues Check ⚠️ Warning The PR implements many of Issue #32's core coding objectives: it adds GraphSAGE machinery (core/framework/graph_embedders.py with aggregator variants and GraphSAGEEmbedder), a RAM-first GraphMemoryStore with shared‑memory support, a configurable tools/graphsage pipeline and bridge-discovery modules, and Unix‑socket / optimized ArangoDB clients that map directly to the linked requirements [#32]. However, important gaps remain: GraphSAGEPipeline.train_model is a placeholder (no training flow implemented), multi‑layer propagation in graph_embedders is only partially implemented (fallbacks / NotImplementedError noted), and an explicit, persistent graph_neighborhoods caching/write path is not clearly present. Because these are material to an end‑to‑end, production GraphSAGE solution, the PR does not fully satisfy the linked issue's end‑to‑end requirements. Address the outstanding gaps before merging: implement or document the training flow (GraphSAGEPipeline.train_model) and complete true multi‑layer propagation in graph_embedders, add/verify the graph_neighborhoods caching/collection handling and its persistence, and provide focused unit/integration tests and a short README showing the end‑to‑end training→embedding→bridge discovery workflow and performance validation.
Out of Scope Changes Check ⚠️ Warning The changeset includes several large edits that appear out of scope for the GraphSAGE feature and increase merge risk: notably removal of the entire MCP server surface (core/mcp_server/* including server.py, launch.py, server_config.yaml, README.md, CLAUDE.md) and deletion of core/framework/requirements_framework.txt, neither of which are necessary to implement GraphSAGE and could break unrelated workflows. In addition, the PR bundles broad infra and documentation work (many new graph‑builder scripts, monitoring modules, PRDs, and exported artifacts) alongside the core GraphSAGE implementation, making the review surface much larger than the linked issue intended. Split this PR: move unrelated deletions and large infra/docs additions into separate, focused PRs; if MCP server files were intentionally removed include an explicit rationale and migration plan or restore them and open a dedicated follow‑up PR, and keep the GraphSAGE change set narrowly scoped with accompanying tests and docs.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: GraphSAGE for Theory-Practice Bridge Discovery" is concise, single‑sented, and accurately highlights the main change (adding GraphSAGE-based bridge discovery). It clearly communicates the primary intent so a reviewer scanning history will understand the PR's focus.
Docstring Coverage ✅ Passed Docstring coverage is 96.68% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/graphsage-bridge-discovery

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

🧹 Nitpick comments (50)
pyproject.toml (1)

44-44: Make KaggleHub optional behind a Poetry extra.

Kaggle-based ingestion is ancillary. Gate kagglehub behind an extra to avoid pulling it into prod images that don’t use Kaggle.

Apply within this line:

-kagglehub = "^0.3.13"
+kagglehub = { version = "^0.3.13", optional = true }

Add this extra (outside the selected range):

[tool.poetry.extras]
kaggle = ["kagglehub"]
tools/arxiv/configs/acid_pipeline_phased.yaml (1)

6-6: Parameterize ArangoDB host via environment variables
In tools/arxiv/configs/acid_pipeline_phased.yaml, replace

host: 'http://localhost:8529'

with

host: "${ARANGO_SCHEME:-http}://${ARANGO_HOST:-localhost}:${ARANGO_PORT:-8529}"
CLAUDE.md (2)

199-212: Avoid bare except in the verification snippet; print actionable errors.

Silent failures hinder debugging.

 for coll in ['arxiv_papers', 'arxiv_chunks', 'arxiv_embeddings']:
-    try:
-        count = db.collection(coll).count()
-        print(f'{coll}: {count:,} documents')
-    except:
-        pass
+    try:
+        count = db.collection(coll).count()
+        print(f'{coll}: {count:,} documents')
+    except Exception as e:
+        print(f'{coll}: error ({e})')

318-327: Fix markdownlint MD004 (unordered list style).

Use asterisks to match repo lint settings.

-     - `arxiv_papers`: Paper metadata and processing status (~2.7M papers)
-     - `arxiv_chunks`: Text segments with context preservation
-     - `arxiv_embeddings`: 2048-dimensional Jina v4 vectors
-     - `arxiv_equations`: Mathematical equations from papers
-     - `arxiv_tables`: Extracted tables
-     - `arxiv_images`: Image metadata and captions
+     * `arxiv_papers`: Paper metadata and processing status (~2.7M papers)
+     * `arxiv_chunks`: Text segments with context preservation
+     * `arxiv_embeddings`: 2048-dimensional Jina v4 vectors
+     * `arxiv_equations`: Mathematical equations from papers
+     * `arxiv_tables`: Extracted tables
+     * `arxiv_images`: Image metadata and captions
test_memory_rag.py (8)

41-47: Remove unused variable and ensure full cursor consumption.

results is unused. Simpler to exhaust the cursor without assigning.

-            cursor = db.aql.execute(
+            cursor = db.aql.execute(
                 query,
                 bind_vars={'query_vector': query_vector},
                 batch_size=10
             )
-            results = list(cursor)
+            _ = list(cursor)

52-53: Don’t catch broad Exception; handle Arango-specific errors.

Improves signal and avoids masking issues.

-        except Exception as e:
-            print(f"  Query failed: {e}")
+        except Exception as e:
+            # Optionally narrow: from arango.exceptions import ArangoServerError, AQLQueryExecuteError
+            print(f"  Query failed: {e}")

If you prefer strict narrowing, I can patch imports and exception classes.


8-13: Tidy imports and add sys for exit; drop unused types.

Also optional: pre-import Arango exceptions if narrowing.

-import os
+import os
+import sys
 import time
 import numpy as np
 from arango import ArangoClient
-from typing import List, Dict, Any
+from typing import Dict
+# from arango.exceptions import ArangoServerError, AQLQueryExecuteError  # optional

146-149: Allow ArangoDB host/port override via env for portability.

Respects ARANGO_HOST/ARANGO_PORT while keeping localhost default.

-    # Connect to ArangoDB
-    client = ArangoClient(hosts='http://localhost:8529')
+    # Connect to ArangoDB
+    host = os.environ.get('ARANGO_HOST', 'localhost')
+    port = os.environ.get('ARANGO_PORT', '8529')
+    scheme = os.environ.get('ARANGO_SCHEME', 'http')
+    client = ArangoClient(hosts=f'{scheme}://{host}:{port}')
     db = client.db('academy_store', username='root', password=os.environ.get('ARANGO_PASSWORD'))

1-1: Remove shebang or mark file executable.

Shebang triggers EXE001; tests typically don’t need it.

-#!/usr/bin/env python3

191-191: Drop redundant f-prefix.

Lint: F541.

-            print(f"\nFor perspective:")
+            print("\nFor perspective:")

217-221: Use sys.exit for non-interactive scripts.

exit() is for REPL.

-    if not os.environ.get('ARANGO_PASSWORD'):
-        print("Error: ARANGO_PASSWORD environment variable not set")
-        exit(1)
+    if not os.environ.get('ARANGO_PASSWORD'):
+        print("Error: ARANGO_PASSWORD environment variable not set")
+        sys.exit(1)

23-24: Smaller payload to DB: use float32 for query vector.

Saves bandwidth and (slightly) parsing time.

-        query_vector = np.random.randn(2048).tolist()
+        query_vector = np.random.randn(2048).astype(np.float32).tolist()
core/framework/memory_store.py (4)

283-291: Don’t swallow exceptions on cleanup.

Replace bare except with narrow handling and log once; aligns with Ruff BLE/E722 hints.

     def cleanup(self):
         """Clean up shared memory."""
         if self.shared_memory:
             self.shared_memory.close()
             try:
                 self.shared_memory.unlink()
-            except:
-                pass
+            except FileNotFoundError:
+                # Already unlinked
+                return

1-10: Library module: remove shebang.

Shebang on a non-executable library file triggers Ruff EXE001; drop it.


266-282: Improve memory usage estimate clarity.

The estimator is very rough (e.g., 8 bytes per edge); consider documenting assumptions and exposing the estimate components in GraphStats for transparency.


35-41: Docstring: tie to framework tags explicitly.

Add explicit CONVEYANCE/WHERE/WHAT/TIME notes to comply with repo guidelines.

 class GraphMemoryStore:
-    """
-    RAM-based graph storage with shared memory support.
-    
-    Keeps entire graph in memory for ultra-fast access (<1ms).
-    Supports heterogeneous graphs with multiple node and edge types.
-    """
+    """
+    RAM-based graph storage with shared memory support.
+    - CONVEYANCE: minimizes movement by colocating topology and embeddings in RAM.
+    - WHERE: adjacency and neighborhoods are O(1) accessible for aggregation.
+    - WHAT: optional embedding buffer exposes feature space to embedders.
+    - TIME: sub-ms access via zero-copy shared memory across processes.
+    """
test_ramfs_arangodb.py (6)

171-177: Remove unused variable.

results is unused; assign to _ to appease linters.

-            results = list(cursor)
+            _ = list(cursor)

126-137: Return type should allow error string.

Function returns an 'error' key in some paths; widen type to Dict[str, Any].

-def benchmark_search(db, num_queries: int = 100) -> Dict[str, float]:
+def benchmark_search(db, num_queries: int = 100) -> Dict[str, Any]:

214-218: Fix extraneous f-strings.

Remove f where no placeholders exist (Ruff F541).

-    print(f"\nDisk Performance:")
+    print("\nDisk Performance:")

268-275: Fix extraneous f-strings.

Same here (Ruff F541).

-        print(f"\nMemory Performance:")
+        print("\nMemory Performance:")

176-181: Catch narrower exceptions from python-arango.

Prefer ArangoServerError/ArangoClientError to avoid blind except Exception.

-from arango import ArangoClient
+from arango import ArangoClient
+from arango.exceptions import ArangoServerError, ArangoClientError
@@
-        except Exception as e:
+        except (ArangoServerError, ArangoClientError) as e:
             print(f"Query failed: {e}")
             continue

1-3: Shebang on non-executable file.

Either make it executable or drop the shebang (Ruff EXE001). Given this is a module, drop it.

tools/arxiv/utils/import_fresh_arxiv_dataset.py (6)

1-9: Augment module docstring with framework tags.

Tie to CONVEYANCE/WHERE/WHAT/TIME as per repo guidelines.

 """
 Import fresh ArXiv dataset from Kaggle (August 2024 version) directly to ArangoDB.
 
 This script:
 1. Downloads the latest ArXiv dataset using kagglehub
 2. Imports it directly into ArangoDB
 3. Shows statistics about the data
+
+Framework notes:
+- CONVEYANCE: bulk, atomic ingestion minimizes intermediate transformations.
+- WHERE/WHAT: preserves full-doc metadata to enable late chunking downstream.
+- TIME: prepares recency-aware subsets (e.g., 2024) for prioritized processing.
 """

127-136: Make limit Optional and import the type.

Align with PEP 484 (RUF013).

-from typing import Dict, Any, List
+from typing import Dict, Any, List, Optional
@@
-def import_to_arangodb(dataset_path: Path, limit: int = None):
+def import_to_arangodb(dataset_path: Path, limit: Optional[int] = None):

151-152: Avoid list indexing for first match.

Use next(Path.glob(...)) (RUF015).

-    json_file = list(dataset_path.glob("*.json"))[0]
+    json_file = next(dataset_path.glob("*.json"))

67-93: Avoid bare except in analysis loop.

Catch ValueError for date parsing and log a bounded number of failures.

-                            except:
-                                pass
+                            except ValueError:
+                                # ignore unparsable dates
+                                pass

239-274: Collections list: align with latest Arango layout.

You still create arxiv_structures; if that bucket is deprecated per docs, either drop it or add a comment why it remains.


321-332: Exit early messaging: keep non-executable file free of shebang.

No action if you intend to run as a script, else remove shebang at Line 1 (Ruff EXE001).

tools/graphsage/utils/__init__.py (1)

1-1: Enrich docstring to tie utilities to the CONVEYANCE framework.

Align with repository guidance by briefly stating WHERE/WHAT/TIME and how the sampler/utilities support GraphSAGE conveyance.

Apply:

-"""GraphSAGE utilities."""
+"""GraphSAGE utilities.
+
+Supports CONVEYANCE across WHERE (graph neighborhoods), WHAT (node/content features),
+and TIME (inductive sampling dynamics) to enable Theory–Practice bridge discovery.
+"""
tools/graphsage/bridge_discovery/__init__.py (2)

5-5: Sort all to satisfy Ruff (RUF022).

-__all__ = ['TheoryPracticeFinder', 'BridgeDiscoveryConfig', 'Bridge']
+__all__ = ['Bridge', 'BridgeDiscoveryConfig', 'TheoryPracticeFinder']

1-1: Optional: tie docstring to the bridge-discovery rationale.

-"""Bridge discovery modules."""
+"""Bridge discovery modules.
+
+Operationalizes CONVEYANCE by linking theoretical (papers) and practical (code) nodes
+via direct, semantic, structural, and latent mechanisms.
+"""
tools/graphsage/__init__.py (2)

12-19: Sort all for consistency and Ruff compliance (RUF022).

 __all__ = [
-    'GraphSAGEPipeline',
-    'PipelineConfig',
-    'TheoryPracticeFinder',
-    'BridgeDiscoveryConfig',
-    'NeighborhoodSampler',
-    'SamplingConfig'
+    'BridgeDiscoveryConfig',
+    'GraphSAGEPipeline',
+    'NeighborhoodSampler',
+    'PipelineConfig',
+    'SamplingConfig',
+    'TheoryPracticeFinder',
 ]

8-10: Optional: import via subpackage to leverage package-level exports.

-from .bridge_discovery.theory_practice_finder import TheoryPracticeFinder, BridgeDiscoveryConfig
+from .bridge_discovery import TheoryPracticeFinder, BridgeDiscoveryConfig
tools/graphsage/configs/graphsage_config.yaml (1)

57-57: Fix YAML lint: trailing spaces and missing EOF newline.

Explanation:

  • Remove trailing spaces on Lines 57, 63, 69.
  • Ensure a newline at end of file (adds a logical Line 77).

Also applies to: 63-63, 69-69, 77-77

tools/graphsage/bridge_discovery/theory_practice_finder.py (5)

1-1: Remove non-executable shebang (Ruff EXE001) or make the file executable.

-#!/usr/bin/env python3

If you intend to run this as a script, set the executable bit instead.


14-19: Fix typing: import Any, use Dict[str, Any], and add conveyance_score field.

-from typing import List, Dict, Tuple, Optional, Set
+from typing import Any, List, Dict, Tuple, Optional, Set
@@
 class Bridge:
     """Represents a theory-practice bridge."""
     paper_id: str
     code_id: str
     confidence: float
     bridge_type: str  # 'direct', 'semantic', 'structural', 'latent'
-    evidence: Dict[str, any]  # Supporting evidence for the bridge
+    evidence: Dict[str, Any]  # Supporting evidence for the bridge
+    conveyance_score: float = 0.0  # Filled by rank_bridges

This avoids dynamic attribute injection later and improves type correctness.

Also applies to: 25-33


85-93: Silence Ruff B007: unused loop variable.

-        for node_id, node_idx in self.graph_store.node_ids.items():
+        for _node_id, node_idx in self.graph_store.node_ids.items():

373-381: Rename ambiguous single-letter variable 'I' (Ruff E741).

-            I = bridge.evidence.get('instruction_fit', 0.5)
+            instr_fit = bridge.evidence.get('instruction_fit', 0.5)
@@
-            Ctx = 0.25 * (L + I + A + G)
+            Ctx = 0.25 * (L + instr_fit + A + G)

93-93: Prefer logging over print for library code.

Switch to the logging module with levels and optional timestamps; keeps pipeline output configurable.

Example:

-import time
+import time
+import logging
+logger = logging.getLogger(__name__)
@@
-        print(f"Categorized {len(self.paper_nodes)} paper nodes, {len(self.code_nodes)} code nodes")
+        logger.info("Categorized %d paper nodes, %d code nodes", len(self.paper_nodes), len(self.code_nodes))
@@
-        print(f"Computed embeddings: shape {self.embeddings.shape}")
+        logger.info("Computed embeddings: shape %s", self.embeddings.shape)
@@
-                print(f"Processed {idx + 1}/{len(paper_indices)} papers")
+                logger.debug("Processed %d/%d papers", idx + 1, len(paper_indices))
@@
-            print("Finding direct bridges...")
+            logger.info("Finding direct bridges...")

Also applies to: 106-106, 190-192, 323-351

tools/graphsage/utils/neighborhood_sampler.py (3)

132-147: Guard against missing precomputed probs.

If adjacency changes after precompute, indexes may drift; recompute on miss to avoid biased fallback.

-        if self.sampling_probs and node in self.sampling_probs:
+        if self.sampling_probs and node in self.sampling_probs:
             probs = self.sampling_probs[node]
@@
-        else:
-            return self._uniform_sample(neighbors, num_samples)
+        else:
+            # Recompute on-demand for robustness
+            self.sampling_probs = self._compute_sampling_probabilities()
+            if self.sampling_probs and node in self.sampling_probs:
+                probs = self.sampling_probs[node]
+                indices = np.random.choice(len(neighbors), size=min(num_samples, len(neighbors)), replace=False, p=probs)
+                return [neighbors[i] for i in indices]
+            return self._uniform_sample(neighbors, num_samples)

239-242: Rename unused loop variable.

-        for node, hop_neighbors in neighborhoods.items():
+        for _, hop_neighbors in neighborhoods.items():

300-304: Remove unused total_edges and guard division.

-        total_edges = sum(degrees)
         sampled_edges_per_node = sum(self.config.num_neighbors)
-        stats['sampling_ratio'] = sampled_edges_per_node / stats['avg_degree'] if stats['avg_degree'] > 0 else 0
+        stats['sampling_ratio'] = (sampled_edges_per_node / stats['avg_degree']) if stats['avg_degree'] else 0.0
tools/graphsage/pipelines/graphsage_pipeline.py (6)

121-133: Use returned load stats or drop the assignment.

Avoid the unused variable and surface helpful metrics.

-            stats = self.graph_store.load_from_arangodb(db_config)
+            load_stats = self.graph_store.load_from_arangodb(db_config)
+            try:
+                print(f"Nodes: {load_stats.num_nodes:,} | Edges: {load_stats.num_edges:,} | "
+                      f"Types: N{load_stats.num_node_types}/E{load_stats.num_edge_types} | "
+                      f"Mem: {load_stats.memory_usage_gb:.2f} GB | Load: {load_stats.load_time_seconds:.2f}s")
+            except AttributeError:
+                pass

156-156: Remove unnecessary f-string.

-        print(f"Model initialized:")
+        print("Model initialized:")

277-282: Catch specific Arango errors instead of blind Exception.

-from arango import ArangoClient
+from arango import ArangoClient
+from arango.exceptions import ArangoError
-            except Exception as e:
+            except ArangoError as e:
                 print(f"Error saving batch: {e}")
-            except Exception as e:
+            except ArangoError as e:
                 print(f"Error saving {bridge_type} bridges: {e}")

If your driver version doesn’t expose ArangoError in this path, I can adjust to the correct exception types.

Also applies to: 321-325, 36-37


205-236: Avoid recomputing embeddings in TheoryPracticeFinder.

Finder precomputes embeddings again; pass the already generated array or a view to avoid duplicate work.

I can extend TheoryPracticeFinder to accept precomputed embeddings and skip generation. Want a patch?


405-420: CLI UX: allow password via env or config; avoid hard exit.

Consider: read from config if not provided, and prompt once if missing; return non-zero exit code from main instead of sys.exit in library code.


161-167: Silence unused arg warning.

-    def train_model(self, num_epochs: int = 10):
+    def train_model(self, _num_epochs: int = 10):
core/framework/graph_embedders.py (2)

240-240: Ruff: prefer iterable unpacking.

-        dims = [config.input_dim] + config.hidden_dims + [config.output_dim]
+        dims = [config.input_dim, *config.hidden_dims, config.output_dim]

255-284: Unify sampling with NeighborhoodSampler (optional).

Embedder’s sampling ignores SamplingConfig/adaptive strategies. Consider injecting NeighborhoodSampler to centralize policies.

I can refactor sample_neighbors to delegate to NeighborhoodSampler while preserving current defaults.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a313de and d1e67c6.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • CLAUDE.md (2 hunks)
  • core/framework/graph_embedders.py (1 hunks)
  • core/framework/memory_store.py (1 hunks)
  • pyproject.toml (1 hunks)
  • test_memory_rag.py (1 hunks)
  • test_ramfs_arangodb.py (1 hunks)
  • tools/arxiv/configs/acid_pipeline_phased.yaml (1 hunks)
  • tools/arxiv/utils/import_fresh_arxiv_dataset.py (1 hunks)
  • tools/graphsage/__init__.py (1 hunks)
  • tools/graphsage/bridge_discovery/__init__.py (1 hunks)
  • tools/graphsage/bridge_discovery/theory_practice_finder.py (1 hunks)
  • tools/graphsage/configs/graphsage_config.yaml (1 hunks)
  • tools/graphsage/pipelines/graphsage_pipeline.py (1 hunks)
  • tools/graphsage/utils/__init__.py (1 hunks)
  • tools/graphsage/utils/neighborhood_sampler.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (12)
{tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

When in tools/ or experiments/, import from core.framework.* for shared infrastructure (e.g., core.framework.embedders)

Files:

  • tools/graphsage/__init__.py
  • tools/graphsage/utils/__init__.py
  • tools/graphsage/bridge_discovery/__init__.py
  • tools/graphsage/bridge_discovery/theory_practice_finder.py
  • tools/graphsage/utils/neighborhood_sampler.py
  • tools/arxiv/utils/import_fresh_arxiv_dataset.py
  • tools/graphsage/pipelines/graphsage_pipeline.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use relative imports for intra-module references (from .utils import helper_function)

Files:

  • tools/graphsage/__init__.py
  • tools/graphsage/utils/__init__.py
  • test_memory_rag.py
  • test_ramfs_arangodb.py
  • tools/graphsage/bridge_discovery/__init__.py
  • tools/graphsage/bridge_discovery/theory_practice_finder.py
  • core/framework/memory_store.py
  • tools/graphsage/utils/neighborhood_sampler.py
  • core/framework/graph_embedders.py
  • tools/arxiv/utils/import_fresh_arxiv_dataset.py
  • tools/graphsage/pipelines/graphsage_pipeline.py
{core,tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

All code must include docstrings tying implementation to the theoretical framework (e.g., explain CONVEYANCE, WHERE/WHAT/TIME)

Files:

  • tools/graphsage/__init__.py
  • tools/graphsage/utils/__init__.py
  • tools/graphsage/bridge_discovery/__init__.py
  • tools/graphsage/bridge_discovery/theory_practice_finder.py
  • core/framework/memory_store.py
  • tools/graphsage/utils/neighborhood_sampler.py
  • core/framework/graph_embedders.py
  • tools/arxiv/utils/import_fresh_arxiv_dataset.py
  • tools/graphsage/pipelines/graphsage_pipeline.py
{core,tools}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Infrastructure (core/, tools/) must not depend on experiments/

Files:

  • tools/graphsage/__init__.py
  • tools/graphsage/utils/__init__.py
  • tools/graphsage/bridge_discovery/__init__.py
  • tools/graphsage/bridge_discovery/theory_practice_finder.py
  • core/framework/memory_store.py
  • tools/graphsage/utils/neighborhood_sampler.py
  • core/framework/graph_embedders.py
  • tools/arxiv/utils/import_fresh_arxiv_dataset.py
  • tools/graphsage/pipelines/graphsage_pipeline.py
tools/arxiv/configs/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

Store ArXiv processing configuration as YAML under tools/arxiv/configs/

Files:

  • tools/arxiv/configs/acid_pipeline_phased.yaml
tools/arxiv/configs/acid_pipeline_phased.yaml

📄 CodeRabbit inference engine (tools/arxiv/CLAUDE.md)

tools/arxiv/configs/acid_pipeline_phased.yaml: Provide configurable worker counts under phases.extraction.workers and phases.embedding.workers
Default batch size should be 24 for both extraction and embedding

Files:

  • tools/arxiv/configs/acid_pipeline_phased.yaml
{tools/arxiv,core/framework}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{tools/arxiv,core/framework}/**/*.py: Format Python code with Black for tools/arxiv and core/framework
Run Ruff linting for tools/arxiv and core/framework

Files:

  • core/framework/memory_store.py
  • core/framework/graph_embedders.py
  • tools/arxiv/utils/import_fresh_arxiv_dataset.py
core/framework/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run mypy type checking on core/framework only

Files:

  • core/framework/memory_store.py
  • core/framework/graph_embedders.py
tools/arxiv/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Late chunking: always process full documents before chunking in ArXiv processing

Files:

  • tools/arxiv/utils/import_fresh_arxiv_dataset.py
{core/database,tools/arxiv}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Database interactions must be atomic (all-or-nothing)

Files:

  • tools/arxiv/utils/import_fresh_arxiv_dataset.py
{core/processors,tools/arxiv}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Preserve document structure and context throughout the processing pipeline

Files:

  • tools/arxiv/utils/import_fresh_arxiv_dataset.py
{tools/**/pipelines/**/*.py,tools/github/github_pipeline_manager.py}

📄 CodeRabbit inference engine (CLAUDE.md)

Pipelines must support resumption from checkpoints (error recovery)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
🧠 Learnings (4)
📚 Learning: 2025-09-04T03:49:10.310Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T03:49:10.310Z
Learning: Applies to tools/arxiv/{pipelines/arxiv_pipeline.py,utils/lifecycle.py,utils/rebuild_database.py,utils/check_db_status.py,utils/check_papers.py} : Use ArangoDB collections with these names and fields: arxiv_papers(_key,status,...), arxiv_chunks(paper_id,text,chunk_index,context_window_used), arxiv_embeddings(paper_id,chunk_id,vector(2048),model='jina-v4'), arxiv_structures(aggregated structural elements)

Applied to files:

  • tools/arxiv/configs/acid_pipeline_phased.yaml
  • CLAUDE.md
  • tools/arxiv/utils/import_fresh_arxiv_dataset.py
📚 Learning: 2025-09-04T03:49:10.310Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T03:49:10.310Z
Learning: Applies to tools/arxiv/{pipelines/arxiv_pipeline.py,utils/lifecycle.py,utils/rebuild_database.py,utils/check_db_status.py,utils/check_papers.py} : Read ArangoDB connection from environment: ARANGO_HOST and ARANGO_PASSWORD

Applied to files:

  • tools/arxiv/configs/acid_pipeline_phased.yaml
  • CLAUDE.md
📚 Learning: 2025-09-04T03:49:10.310Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T03:49:10.310Z
Learning: Applies to tools/arxiv/utils/lifecycle.py : Lifecycle must: check/update PostgreSQL, download missing PDF/LaTeX from ArXiv when needed, process via ACID, synchronize PostgreSQL and ArangoDB, update HiRAG collections

Applied to files:

  • CLAUDE.md
📚 Learning: 2025-09-04T03:49:10.310Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T03:49:10.310Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Process PDFs directly from the filesystem without database queries

Applied to files:

  • CLAUDE.md
🧬 Code graph analysis (9)
tools/graphsage/__init__.py (3)
tools/graphsage/pipelines/graphsage_pipeline.py (2)
  • GraphSAGEPipeline (75-376)
  • PipelineConfig (40-72)
tools/graphsage/bridge_discovery/theory_practice_finder.py (2)
  • TheoryPracticeFinder (50-433)
  • BridgeDiscoveryConfig (36-47)
tools/graphsage/utils/neighborhood_sampler.py (2)
  • NeighborhoodSampler (34-313)
  • SamplingConfig (22-31)
tools/graphsage/utils/__init__.py (1)
tools/graphsage/utils/neighborhood_sampler.py (2)
  • NeighborhoodSampler (34-313)
  • SamplingConfig (22-31)
test_memory_rag.py (1)
tools/arxiv/utils/import_fresh_arxiv_dataset.py (1)
  • main (322-346)
tools/graphsage/bridge_discovery/__init__.py (1)
tools/graphsage/bridge_discovery/theory_practice_finder.py (3)
  • TheoryPracticeFinder (50-433)
  • BridgeDiscoveryConfig (36-47)
  • Bridge (26-32)
tools/graphsage/bridge_discovery/theory_practice_finder.py (3)
core/framework/graph_embedders.py (9)
  • GraphSAGEEmbedder (223-415)
  • GraphSAGEConfig (33-50)
  • generate_embeddings (392-415)
  • forward (62-64)
  • forward (78-97)
  • forward (114-136)
  • forward (152-177)
  • forward (200-220)
  • forward (285-361)
core/framework/memory_store.py (2)
  • GraphMemoryStore (35-335)
  • get_k_hop_neighbors (233-264)
tools/graphsage/pipelines/graphsage_pipeline.py (1)
  • generate_embeddings (180-203)
tools/graphsage/utils/neighborhood_sampler.py (1)
core/framework/graph_embedders.py (1)
  • sample_neighbors (255-283)
core/framework/graph_embedders.py (3)
core/framework/memory_store.py (2)
  • GraphMemoryStore (35-335)
  • get_neighbors (213-231)
tools/graphsage/utils/neighborhood_sampler.py (1)
  • sample_neighbors (99-126)
tools/graphsage/pipelines/graphsage_pipeline.py (1)
  • generate_embeddings (180-203)
tools/arxiv/utils/import_fresh_arxiv_dataset.py (1)
test_memory_rag.py (1)
  • main (141-214)
tools/graphsage/pipelines/graphsage_pipeline.py (4)
core/framework/memory_store.py (6)
  • GraphMemoryStore (35-335)
  • load_from_disk (316-335)
  • load_from_arangodb (71-116)
  • save_to_disk (296-314)
  • create_shared_memory (185-211)
  • _calculate_memory_usage (266-281)
core/framework/graph_embedders.py (4)
  • GraphSAGEEmbedder (223-415)
  • GraphSAGEConfig (33-50)
  • AggregatorType (24-29)
  • generate_embeddings (392-415)
tools/graphsage/utils/neighborhood_sampler.py (2)
  • NeighborhoodSampler (34-313)
  • SamplingConfig (22-31)
tools/graphsage/bridge_discovery/theory_practice_finder.py (3)
  • TheoryPracticeFinder (50-433)
  • BridgeDiscoveryConfig (36-47)
  • discover_all_bridges (314-353)
🪛 Ruff (0.12.2)
tools/graphsage/__init__.py

12-19: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

test_memory_rag.py

1-1: Shebang is present but file is not executable

(EXE001)


46-46: Local variable results is assigned to but never used

Remove assignment to unused variable results

(F841)


52-52: Do not catch blind exception: Exception

(BLE001)


191-191: f-string without any placeholders

Remove extraneous f prefix

(F541)

test_ramfs_arangodb.py

1-1: Shebang is present but file is not executable

(EXE001)


26-26: Unused function argument: size_gb

(ARG001)


36-36: Probable insecure usage of temporary file or directory: "/tmp/arangodb_ramfs"

(S108)


43-43: Starting a process with a partial executable path

(S607)


51-51: subprocess call: check for execution of untrusted input

(S603)


52-52: Starting a process with a partial executable path

(S607)


57-57: subprocess call: check for execution of untrusted input

(S603)


58-58: Starting a process with a partial executable path

(S607)


88-88: Avoid specifying long messages outside the exception class

(TRY003)


171-171: Local variable results is assigned to but never used

Remove assignment to unused variable results

(F841)


178-178: Do not catch blind exception: Exception

(BLE001)


214-214: f-string without any placeholders

Remove extraneous f prefix

(F541)


268-268: f-string without any placeholders

Remove extraneous f prefix

(F541)

tools/graphsage/bridge_discovery/__init__.py

5-5: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tools/graphsage/bridge_discovery/theory_practice_finder.py

1-1: Shebang is present but file is not executable

(EXE001)


85-85: Loop control variable node_id not used within loop body

Rename unused node_id to _node_id

(B007)


375-375: Ambiguous variable name: I

(E741)

core/framework/memory_store.py

1-1: Shebang is present but file is not executable

(EXE001)


130-130: Local variable collection is assigned to but never used

Remove assignment to unused variable collection

(F841)


145-145: Do not catch blind exception: Exception

(BLE001)


182-182: Do not catch blind exception: Exception

(BLE001)


289-289: Do not use bare except

(E722)


289-290: try-except-pass detected, consider logging the exception

(S110)

tools/graphsage/utils/neighborhood_sampler.py

1-1: Shebang is present but file is not executable

(EXE001)


126-126: Avoid specifying long messages outside the exception class

(TRY003)


239-239: Loop control variable node not used within loop body

Rename unused node to _node

(B007)


301-301: Local variable total_edges is assigned to but never used

Remove assignment to unused variable total_edges

(F841)

core/framework/graph_embedders.py

1-1: Shebang is present but file is not executable

(EXE001)


198-198: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

tools/arxiv/utils/import_fresh_arxiv_dataset.py

1-1: Shebang is present but file is not executable

(EXE001)


53-53: Avoid specifying long messages outside the exception class

(TRY003)


91-91: Do not use bare except

(E722)


91-92: try-except-pass detected, consider logging the exception

(S110)


127-127: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


151-151: Prefer next(iter(dataset_path.glob("*.json"))) over single element slice

Replace with next(iter(dataset_path.glob("*.json")))

(RUF015)


199-199: Do not catch blind exception: Exception

(BLE001)


206-206: Do not use bare except

(E722)


210-210: Do not catch blind exception: Exception

(BLE001)


220-220: Do not use bare except

(E722)


225-225: Do not use bare except

(E722)


235-235: f-string without any placeholders

Remove extraneous f prefix

(F541)

tools/graphsage/pipelines/graphsage_pipeline.py

1-1: Shebang is present but file is not executable

(EXE001)


128-128: Local variable stats is assigned to but never used

Remove assignment to unused variable stats

(F841)


156-156: f-string without any placeholders

Remove extraneous f prefix

(F541)


161-161: Unused method argument: num_epochs

(ARG002)


280-280: Do not catch blind exception: Exception

(BLE001)


324-324: Do not catch blind exception: Exception

(BLE001)

🪛 markdownlint-cli2 (0.17.2)
CLAUDE.md

321-321: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


322-322: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


323-323: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


324-324: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


325-325: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


326-326: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

🪛 YAMLlint (1.37.1)
tools/graphsage/configs/graphsage_config.yaml

[error] 57-57: trailing spaces

(trailing-spaces)


[error] 63-63: trailing spaces

(trailing-spaces)


[error] 69-69: trailing spaces

(trailing-spaces)


[error] 77-77: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (7)
pyproject.toml (1)

44-44: Confirm Python 3.11 compatibility and wheel availability: Verified kagglehub-0.3.13 requires Python ≥3.9 (covering 3.11) and publishes a universal wheel (py3-none-any.whl), so CI builds are safe.

tools/arxiv/configs/acid_pipeline_phased.yaml (2)

29-29: LGTM: extraction batch size defaults to 24.


46-46: LGTM: embedding batch size defaults to 24.

CLAUDE.md (1)

193-198: LGTM: Clear localhost ArangoDB access steps.

tools/graphsage/utils/__init__.py (1)

3-5: Re-exports look good and follow relative-import guidance.

tools/graphsage/utils/neighborhood_sampler.py (1)

223-253: Confirm minibatch layout matches downstream consumers.

create_minibatch returns per-layer sets merged across seeds; GraphSAGEEmbedder.sample_neighbors expects per-seed neighbor lists. Verify usage paths or adapt the shape to consumer expectations.

Would you like me to generate a drop-in adapter to align this minibatch format with GraphSAGEEmbedder?

core/framework/graph_embedders.py (1)

301-315: Initial feature source is all-zeros by default; embeddings become near-constant.

Without real input features (e.g., text/code embeddings), outputs carry little signal. Load initial features into graph_store.node_embeddings or add a feature provider hook.

I can wire in your existing Jina/other embeddings as inputs; want a patch?

Comment on lines +56 to +65
# Pipeline options
load_from_cache: bool = False
cache_path: str = "./cache/graph_cache.json"
save_embeddings: bool = True
save_bridges: bool = True

# Memory settings
max_memory_gb: float = 100.0
use_shared_memory: bool = True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add checkpoint/resume to satisfy pipeline resilience requirement.

Minimal step-level checkpoints to resume after failures.

 class PipelineConfig:
@@
-    save_bridges: bool = True
+    save_bridges: bool = True
+    # Checkpointing
+    checkpoint_path: str = "./cache/graphsage_checkpoint.json"
+    resume: bool = True
@@
     def run(self):
         """Run the complete pipeline."""
@@
-        # Step 1: Load graph
-        self.load_graph()
+        # Load checkpoint
+        ckpt = {}
+        if self.config.resume and Path(self.config.checkpoint_path).exists():
+            try:
+                ckpt = json.load(open(self.config.checkpoint_path))
+            except Exception:
+                ckpt = {}
+        def mark(step):
+            ckpt[step] = True
+            os.makedirs(os.path.dirname(self.config.checkpoint_path), exist_ok=True)
+            json.dump(ckpt, open(self.config.checkpoint_path, "w"))
+
+        # Step 1: Load graph
+        if not ckpt.get("loaded"):
+            self.load_graph()
+            mark("loaded")
@@
-        # Step 2: Initialize model
-        self.initialize_model()
+        # Step 2: Initialize model
+        if not ckpt.get("initialized"):
+            self.initialize_model()
+            mark("initialized")
@@
-        # Step 4: Generate embeddings
-        self.generate_embeddings()
+        # Step 4: Generate embeddings
+        if not ckpt.get("embeddings"):
+            self.generate_embeddings()
+            mark("embeddings")
@@
-        # Step 5: Discover bridges
-        bridges = self.discover_bridges()
+        # Step 5: Discover bridges
+        if not ckpt.get("bridges"):
+            bridges = self.discover_bridges()
+            mark("bridges")
+        else:
+            bridges = {}

Also applies to: 335-357

Comment on lines 192 to 195
# Store in graph store
if self.graph_store.node_embeddings is not None:
self.graph_store.node_embeddings[:] = embeddings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Do not overwrite input feature buffer with output embeddings (shape mismatch).

node_embeddings is sized to input_dim (e.g., 2048). You copy output_dim embeddings (e.g., 256) into it, causing a runtime error and corrupting inputs for later forward passes.

-        # Store in graph store
-        if self.graph_store.node_embeddings is not None:
-            self.graph_store.node_embeddings[:] = embeddings
+        # Do not overwrite input feature buffer; keep outputs separate (persist via DB/cache).

Follow-up: if you want zero-copy for outputs, add a dedicated shared-memory buffer (e.g., graph_store.output_embeddings) sized to output_dim.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tools/graphsage/pipelines/graphsage_pipeline.py around lines 192–195, the
code is copying output embeddings into graph_store.node_embeddings (sized to
input_dim) which causes shape mismatch and corrupts input features; instead,
allocate or use a dedicated buffer sized to embeddings.shape (e.g.,
graph_store.output_embeddings) and write the embeddings there (or resize/replace
the buffer atomically if the store provides an API), and update graph_store
initialization/schema to expose output_embeddings for zero-copy shared-memory
usage so input node_embeddings are never overwritten.

Comment on lines +256 to +274
batch_size = 1000
num_saved = 0

for start_idx in range(0, len(embeddings), batch_size):
end_idx = min(start_idx + batch_size, len(embeddings))
batch = []

for idx in range(start_idx, end_idx):
node_id = self.graph_store.index_to_node.get(idx, f"node_{idx}")

doc = {
'_key': str(node_id).replace('/', '_'),
'node_id': node_id,
'node_index': idx,
'embedding': embeddings[idx].tolist(),
'embedding_dim': len(embeddings[idx]),
'model': 'graphsage',
'timestamp': time.time()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Arango insert_many overwrite semantics may differ by version.


🌐 Web query:

Does python-arango's `Collection.insert_many` support an `overwrite` keyword? If not, what is the recommended way to upsert/replace documents in bulk?

💡 Result:

Short answer: Yes — python-arango's Collection.insert_many accepts an overwrite parameter (and related overwrite_mode/return_old options). For bulk replace/upsert you can:

  • Use Collection.insert_many(documents, overwrite=True) to replace existing documents by key. [1]
  • For faster imports with more bulk options (replace/update/ignore) use Collection.import_bulk(..., on_duplicate='replace' or overwrite=True). [1]
  • If you need conditional logic or merges, use an AQL UPSERT (execute a FOR ... UPSERT query) — it's the most flexible option. [2][3]

Sources:
[1] python-arango API docs (insert_many / import_bulk).
[2] ArangoDB docs: INSERT/overwriteMode vs UPSERT.
[3] Discussion / examples recommending AQL UPSERT for bulk upsert.


Pass overwrite=True to Collection.insert_many to replace existing documents tools/graphsage/pipelines/graphsage_pipeline.py:256–274

🤖 Prompt for AI Agents
In tools/graphsage/pipelines/graphsage_pipeline.py around lines 256 to 274, the
batching code builds document dicts but when persisting batches you must pass
overwrite=True to Collection.insert_many so existing documents are replaced;
update the code path that calls insert_many for each batch to include
overwrite=True (and ensure batch is non-empty before calling) so upserts
overwrite prior documents rather than erroring or duplicating.

Comment on lines 170 to 184
# Sample proportionally (prefer cross-type connections)
if diff_type:
num_diff = min(len(diff_type), num_samples * 2 // 3)
num_same = min(len(same_type), num_samples - num_diff)

sampled = []
if num_diff > 0:
sampled.extend(random.sample(diff_type, num_diff))
if num_same > 0:
sampled.extend(random.sample(same_type, num_same))

return sampled[:num_samples]
else:
return self._uniform_sample(neighbors, num_samples)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure adaptive sampling fills to num_samples.

Current split can underfill when one bucket is small. Top up from the other bucket to reach num_samples.

-        if diff_type:
-            num_diff = min(len(diff_type), num_samples * 2 // 3)
-            num_same = min(len(same_type), num_samples - num_diff)
+        if diff_type:
+            target_diff = max(1, int(round(num_samples * 2 / 3)))
+            num_diff = min(len(diff_type), target_diff)
+            num_same = min(len(same_type), num_samples - num_diff)
@@
-            return sampled[:num_samples]
+            # Top up if underfilled
+            remaining = num_samples - len(sampled)
+            if remaining > 0:
+                pool = [n for n in (same_type + diff_type) if n not in sampled]
+                if pool:
+                    sampled.extend(random.sample(pool, min(remaining, len(pool))))
+            return sampled[:num_samples]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Sample proportionally (prefer cross-type connections)
if diff_type:
num_diff = min(len(diff_type), num_samples * 2 // 3)
num_same = min(len(same_type), num_samples - num_diff)
sampled = []
if num_diff > 0:
sampled.extend(random.sample(diff_type, num_diff))
if num_same > 0:
sampled.extend(random.sample(same_type, num_same))
return sampled[:num_samples]
else:
return self._uniform_sample(neighbors, num_samples)
# Sample proportionally (prefer cross-type connections)
if diff_type:
# Determine target number of cross-type samples (at least 1)
target_diff = max(1, int(round(num_samples * 2 / 3)))
num_diff = min(len(diff_type), target_diff)
num_same = min(len(same_type), num_samples - num_diff)
sampled = []
if num_diff > 0:
sampled.extend(random.sample(diff_type, num_diff))
if num_same > 0:
sampled.extend(random.sample(same_type, num_same))
# Top up if underfilled
remaining = num_samples - len(sampled)
if remaining > 0:
pool = [n for n in (same_type + diff_type) if n not in sampled]
if pool:
sampled.extend(random.sample(pool, min(remaining, len(pool))))
return sampled[:num_samples]
else:
return self._uniform_sample(neighbors, num_samples)
🤖 Prompt for AI Agents
In tools/graphsage/utils/neighborhood_sampler.py around lines 170 to 184, the
current proportional split can underfill when one bucket (diff_type or
same_type) is smaller than its target; adjust the logic to first compute target
counts (e.g., num_diff = min(len(diff_type), num_samples * 2 // 3), num_same =
min(len(same_type), num_samples - num_diff)), then compute any shortfall
(short_diff = num_samples - (num_diff + num_same)) and top up from the other
bucket by increasing num_diff or num_same by min(short_diff, len(other_bucket) -
other_count); ensure you never sample more than available by using min(...) and
fall back to taking all items from a bucket when its target equals its length;
finally sample from each bucket accordingly, combine and return exactly
num_samples (or fewer if total neighbors < num_samples) by slicing to
num_samples.

- Creates edges from existing node data
- Builds paper-chunk relationships (1,546 edges)
- Creates similarity edges between embeddings (14,544 edges)
- Prepares graph for GraphSAGE processing
…tation

- Add input validation for config files, node indices, and parameters
- Fix memory management with proper shared memory cleanup
- Replace bare except clauses with specific exception handling
- Fix division by zero in MeanAggregator when no neighbors exist
- Add database connection validation (username, database, password)
- Fix sampling variance calculation with proper validation
- Add embeddings existence check in semantic bridge discovery

These fixes address all critical issues identified by CodeRabbit in PR #33
ensuring code security, stability, and proper error handling.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (5)
tools/graphsage/utils/build_graph.py (5)

177-218: Directionality and matching robustness in code↔paper edges

  • The collection is named code_implements_paper but edges are from paper→code. Prefer code→paper for clarity, or rename.
  • Ensure case-insensitive matching by lowercasing the bind var.
-            cursor = self.db.aql.execute("""
+            cursor = self.db.aql.execute("""
                 FOR code IN github_papers
                 FILTER CONTAINS(LOWER(code.path), @pattern)
                 RETURN code._id
-            """, bind_vars={'pattern': code_pattern})
+            """, bind_vars={'pattern': code_pattern.lower()})
@@
-            for code_id in code_files:
-                edge = {
-                    '_from': f'arxiv_papers/{paper_id}',
-                    '_to': code_id,
-                    'type': 'implements',
-                    'confidence': 0.8
-                }
+            for code_id in code_files:
+                # Direction: code -> paper (implements)
+                edge = {
+                    "_from": code_id,
+                    "_to": f"arxiv_papers/{paper_id}",
+                    "type": "implements",
+                    "confidence": 0.8,
+                }
                 edges.append(edge)

269-294: Replace bare excepts; consider cheaper counts for very large collections

Catching everything hides issues, and count() can be heavy. Catch specific exceptions and log; optionally switch to AQL RETURN COUNT() if counts are slow.

-        for coll_name in node_collections:
-            try:
-                count = self.db.collection(coll_name).count()
-                print(f"  {coll_name}: {count:,}")
-                total_nodes += count
-            except:
-                pass
+        for coll_name in node_collections:
+            try:
+                count = self.db.collection(coll_name).count()
+                print(f"  {coll_name}: {count:,}")
+                total_nodes += count
+            except (ArangoError, ServerConnectionError) as e:
+                print(f"  {coll_name}: error fetching count: {e}")
@@
-        for coll_name, collection in self.edge_collections.items():
-            if collection:
-                try:
-                    count = collection.count()
-                    print(f"  {coll_name}: {count:,}")
-                    total_edges += count
-                except:
-                    pass
+        for coll_name, collection in self.edge_collections.items():
+            if collection:
+                try:
+                    count = collection.count()
+                    print(f"  {coll_name}: {count:,}")
+                    total_edges += count
+                except (ArangoError, ServerConnectionError) as e:
+                    print(f"  {coll_name}: error fetching count: {e}")

314-326: Verify collection naming for similarity samples

You sample from chunk_similarity but (pre-patch) edges were between embedding docs. Ensure endpoints match the collection semantics (chunk↔chunk or embedding↔embedding) consistently.


338-341: Parameterize thresholds/top-k from config

Avoid hard-coded 0.75/10; read from tools/graphsage/configs/graphsage_config.yaml to keep pipeline-tunable.

If you share the config schema, I can wire these parameters through build_all() and CLI flags.


1-1: Shebang present but file isn’t marked executable

Either remove the shebang or make the file executable and/or expose a console_script entry point. Lint rule EXE001.

-#!/usr/bin/env python3
+#!/usr/bin/env python3  # Remove this line if not invoking as a script
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1e67c6 and b51e00b.

📒 Files selected for processing (1)
  • tools/graphsage/utils/build_graph.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
{tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

When in tools/ or experiments/, import from core.framework.* for shared infrastructure (e.g., core.framework.embedders)

Files:

  • tools/graphsage/utils/build_graph.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use relative imports for intra-module references (from .utils import helper_function)

Files:

  • tools/graphsage/utils/build_graph.py
{core,tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

All code must include docstrings tying implementation to the theoretical framework (e.g., explain CONVEYANCE, WHERE/WHAT/TIME)

Files:

  • tools/graphsage/utils/build_graph.py
{core,tools}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Infrastructure (core/, tools/) must not depend on experiments/

Files:

  • tools/graphsage/utils/build_graph.py
🪛 Ruff (0.12.2)
tools/graphsage/utils/build_graph.py

1-1: Shebang is present but file is not executable

(EXE001)


92-92: Do not catch blind exception: Exception

(BLE001)


170-170: Do not catch blind exception: Exception

(BLE001)


214-214: Do not catch blind exception: Exception

(BLE001)


257-257: Do not catch blind exception: Exception

(BLE001)


279-279: Do not use bare except

(E722)


279-280: try-except-pass detected, consider logging the exception

(S110)


292-292: Do not use bare except

(E722)


292-293: try-except-pass detected, consider logging the exception

(S110)

🔇 Additional comments (1)
tools/graphsage/utils/build_graph.py (1)

62-81: Guard against double-prefixing collection names
Use a conditional check on doc["paper_id"] to avoid prepending arxiv_papers/ when it already contains a slash, for example:

paper_ref = doc["paper_id"]
_from    = paper_ref if "/" in paper_ref else f"arxiv_papers/{paper_ref}"
edge     = {"_from": _from, "_to": doc["chunk_id"], "type": "has_chunk"}

Comment on lines 2 to 12
"""
Build Graph from Existing Data.
Creates edges between:
1. Papers and their chunks
2. Similar papers (based on embeddings)
3. Papers and code (where we can infer connections)
4. Citation networks (if available)
This prepares the graph for GraphSAGE processing.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add theory-grounded docstrings per repo guidelines

Per coding_guidelines, include docstrings tying implementation to the theoretical framework (CONVEYANCE, WHERE/WHAT/TIME). Expand the module docstring accordingly.

-"""
-Build Graph from Existing Data.
-
-Creates edges between:
-1. Papers and their chunks
-2. Similar papers (based on embeddings)
-3. Papers and code (where we can infer connections)
-4. Citation networks (if available)
-
-This prepares the graph for GraphSAGE processing.
-"""
+"""
+Graph construction utilities to prepare a heterogeneous graph for GraphSAGE.
+
+CONVEYANCE: materialize structural relations (paper↔chunk, semantic similarity, code↔paper, citations)
+that GraphSAGE can aggregate to discover theory-practice bridges.
+
+WHERE: ArangoDB collections (arxiv_papers, arxiv_chunks, arxiv_embeddings, github_papers) → edge
+collections (paper_has_chunk, chunk_similarity, code_implements_paper, paper_cites_paper).
+
+WHAT: deterministic edge builders with batching, optional ANN top‑k similarity, and idempotent inserts.
+
+TIME: designed for incremental runs—edges can be rebuilt per segment without full retraining.
+"""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""
Build Graph from Existing Data.
Creates edges between:
1. Papers and their chunks
2. Similar papers (based on embeddings)
3. Papers and code (where we can infer connections)
4. Citation networks (if available)
This prepares the graph for GraphSAGE processing.
"""
"""
Graph construction utilities to prepare a heterogeneous graph for GraphSAGE.
CONVEYANCE: materialize structural relations (paperchunk, semantic similarity, codepaper, citations)
that GraphSAGE can aggregate to discover theory-practice bridges.
WHERE: ArangoDB collections (arxiv_papers, arxiv_chunks, arxiv_embeddings, github_papers) → edge
collections (paper_has_chunk, chunk_similarity, code_implements_paper, paper_cites_paper).
WHAT: deterministic edge builders with batching, optional ANN top-k similarity, and idempotent inserts.
TIME: designed for incremental runsedges can be rebuilt per segment without full retraining.
"""
🤖 Prompt for AI Agents
In tools/graphsage/utils/build_graph.py around lines 2 to 12, the module
docstring is too brief and must be expanded to follow repo coding_guidelines by
explicitly connecting the implementation to the theoretical framework: update
the module-level docstring to describe how this builder maps to CONVEYANCE and
the WHERE/WHAT/TIME dimensions, state how nodes/edges correspond to those
concepts (papers, chunks, code, citations), and note assumptions and expected
inputs/outputs for GraphSAGE preprocessing; keep it concise, use the repo's
docstring style and include a short usage note indicating what downstream
GraphSAGE component expects.

Comment on lines 22 to 23
from arango import ArangoClient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Import and use specific Arango exceptions (avoid blind/bare excepts)

Static analysis flagged blind/bare excepts. Import specific exceptions and use them in the catch sites below.

-from arango import ArangoClient
+from arango import ArangoClient
+from arango.exceptions import ArangoError, ServerConnectionError
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from arango import ArangoClient
from arango import ArangoClient
from arango.exceptions import ArangoError, ServerConnectionError
🤖 Prompt for AI Agents
In tools/graphsage/utils/build_graph.py around lines 22-23, replace the current
bare import and later bare except handlers by importing the Arango exception
types and catching them explicitly: add "from arango.exceptions import
ArangoError" (and other specific exceptions you hit, e.g., DocumentInsertError,
DatabaseCreateError) near the top, then change any "except:" to "except
ArangoError:" (or the more specific exception type where applicable), and ensure
you log or re-raise the caught exception instance instead of silently swallowing
it. This makes error handling explicit and preserves the original exception
details for debugging.

Comment on lines 24 to 26
sys.path.append(str(Path(__file__).parent.parent.parent.parent))


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove sys.path hacks; rely on package/relative imports

Modifying sys.path breaks packaging and violates the guidelines. This file doesn’t import intra-repo modules; drop the path mutation.

-sys.path.append(str(Path(__file__).parent.parent.parent.parent))
-
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sys.path.append(str(Path(__file__).parent.parent.parent.parent))
# (Removed sys.path.append hack; relying on proper package-relative imports instead)
🤖 Prompt for AI Agents
In tools/graphsage/utils/build_graph.py around lines 24 to 26, remove the
sys.path.append hack that mutates sys.path; this file does not import intra-repo
modules so simply delete the sys.path modification and rely on package-relative
imports or normal Python package resolution, ensuring the project runs under
proper packaging (adjust imports to be absolute/relative if needed and update
setup/pytest config rather than mutating sys.path).

Comment on lines 39 to 57
# Edge collections
self.edge_collections = {
'paper_has_chunk': None,
'chunk_similarity': None,
'paper_similarity': None,
'paper_cites_paper': None,
'code_implements_paper': None
}

self._ensure_collections()

def _ensure_collections(self):
"""Ensure all edge collections exist."""
for collection_name in self.edge_collections.keys():
if not self.db.has_collection(collection_name):
self.db.create_collection(collection_name, edge=True)
print(f"Created edge collection: {collection_name}")
self.edge_collections[collection_name] = self.db.collection(collection_name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure uniqueness of edges; create a compound index on (_from, _to, type)

Without a uniqueness constraint you can double-insert the same edge (especially in similarity). Add a persistent unique index during collection bootstrap.

         for collection_name in self.edge_collections.keys():
             if not self.db.has_collection(collection_name):
                 self.db.create_collection(collection_name, edge=True)
                 print(f"Created edge collection: {collection_name}")
             self.edge_collections[collection_name] = self.db.collection(collection_name)
+            # Ensure unique compound index to make inserts idempotent
+            try:
+                idx_fields = ["_from", "_to", "type"]
+                existing = self.edge_collections[collection_name].indexes()
+                has_unique = any(
+                    i.get("type") == "persistent"
+                    and i.get("unique")
+                    and i.get("fields") == idx_fields
+                for i in existing
+                )
+                if not has_unique:
+                    self.edge_collections[collection_name].add_persistent_index(
+                        idx_fields, unique=True, sparse=False
+                    )
+            except (ArangoError, ServerConnectionError) as e:
+                print(f"Warning: failed to ensure index on {collection_name}: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Edge collections
self.edge_collections = {
'paper_has_chunk': None,
'chunk_similarity': None,
'paper_similarity': None,
'paper_cites_paper': None,
'code_implements_paper': None
}
self._ensure_collections()
def _ensure_collections(self):
"""Ensure all edge collections exist."""
for collection_name in self.edge_collections.keys():
if not self.db.has_collection(collection_name):
self.db.create_collection(collection_name, edge=True)
print(f"Created edge collection: {collection_name}")
self.edge_collections[collection_name] = self.db.collection(collection_name)
def _ensure_collections(self):
"""Ensure all edge collections exist."""
for collection_name in self.edge_collections.keys():
if not self.db.has_collection(collection_name):
self.db.create_collection(collection_name, edge=True)
print(f"Created edge collection: {collection_name}")
self.edge_collections[collection_name] = self.db.collection(collection_name)
# Ensure unique compound index to make inserts idempotent
try:
idx_fields = ["_from", "_to", "type"]
existing = self.edge_collections[collection_name].indexes()
has_unique = any(
i.get("type") == "persistent"
and i.get("unique")
and i.get("fields") == idx_fields
for i in existing
)
if not has_unique:
self.edge_collections[collection_name].add_persistent_index(
idx_fields, unique=True, sparse=False
)
except (ArangoError, ServerConnectionError) as e:
print(f"Warning: failed to ensure index on {collection_name}: {e}")
🤖 Prompt for AI Agents
In tools/graphsage/utils/build_graph.py around lines 39 to 57, the edge
collections are created without any uniqueness constraint which allows duplicate
edges; after creating or retrieving each edge collection call its index-creation
API to add a compound persistent unique index on ['_from','_to','type'] (e.g.
collection.add_persistent_index(['_from','_to','type'], unique=True)) and
catch/ignore index-already-exists errors so repeated boots are idempotent; do
this for every collection in self.edge_collections immediately after
self.db.collection(collection_name) is obtained.

Comment on lines 90 to 93
self.edge_collections['paper_has_chunk'].insert_many(batch, overwrite=True)
total_inserted += len(batch)
except Exception as e:
print(f"Error inserting batch: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Catch specific Arango exceptions instead of broad Exception

Replace the blanket catch with Arango-specific exceptions to satisfy linting and avoid masking bugs.

-                except Exception as e:
+                except (ArangoError, ServerConnectionError) as e:
                     print(f"Error inserting batch: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.edge_collections['paper_has_chunk'].insert_many(batch, overwrite=True)
total_inserted += len(batch)
except Exception as e:
print(f"Error inserting batch: {e}")
self.edge_collections['paper_has_chunk'].insert_many(batch, overwrite=True)
total_inserted += len(batch)
except (ArangoError, ServerConnectionError) as e:
print(f"Error inserting batch: {e}")
🧰 Tools
🪛 Ruff (0.12.2)

92-92: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In tools/graphsage/utils/build_graph.py around lines 90 to 93, the try/except is
catching a broad Exception; change it to catch Arango-specific exceptions (e.g.,
arango.exceptions.ArangoError and, if appropriate,
arango.exceptions.DocumentInsertError) so we don't mask unrelated errors. Import
the needed exception classes from arango.exceptions at the top of the file,
replace "except Exception as e" with specific except blocks for the Arango
exceptions (logging or printing the error), and re-raise or let other unexpected
exceptions propagate so they aren't silently swallowed.

Comment on lines 99 to 176
def build_embedding_similarity_edges(self, similarity_threshold: float = 0.8):
"""Create edges between similar embeddings."""
print(f"\nBuilding similarity edges (threshold={similarity_threshold})...")

# Get all embeddings
cursor = self.db.aql.execute("""
FOR embed IN arxiv_embeddings
RETURN {
id: embed._id,
key: embed._key,
paper_id: embed.paper_id,
chunk_id: embed.chunk_id,
vector: embed.vector
}
""")

embeddings = list(cursor)

if not embeddings:
print("No embeddings found")
return

print(f"Processing {len(embeddings)} embeddings...")

# Convert to numpy array
vectors = np.array([e['vector'] for e in embeddings])

# Compute pairwise similarities (in batches for memory efficiency)
batch_size = 100
edges = []

for i in tqdm(range(0, len(embeddings), batch_size)):
batch_end = min(i + batch_size, len(embeddings))
batch_vectors = vectors[i:batch_end]

# Compute similarities with all other vectors
similarities = cosine_similarity(batch_vectors, vectors)

# Find high similarity pairs
for local_idx in range(batch_end - i):
global_idx = i + local_idx

# Get top similar items (excluding self)
sim_scores = similarities[local_idx]
sim_scores[global_idx] = -1 # Exclude self

# Find indices above threshold
similar_indices = np.where(sim_scores >= similarity_threshold)[0]

for sim_idx in similar_indices[:10]: # Limit to top 10
if sim_idx != global_idx:
edge = {
'_from': embeddings[global_idx]['id'],
'_to': embeddings[sim_idx]['id'],
'similarity': float(sim_scores[sim_idx]),
'type': 'semantic_similarity'
}
edges.append(edge)

if edges:
# Insert edges
print(f"\nInserting {len(edges)} similarity edges...")

batch_size = 1000
total_inserted = 0

for i in range(0, len(edges), batch_size):
batch = edges[i:i + batch_size]
try:
self.edge_collections['chunk_similarity'].insert_many(batch, overwrite=True)
total_inserted += len(batch)
except Exception as e:
print(f"Error inserting batch: {e}")

print(f"Created {total_inserted} similarity edges")
else:
print("No similar embeddings found above threshold")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Fix O(N^2) similarity computation; switch to ANN top-k and de-duplicate

Computing pairwise cosine with sklearn over all vectors is O(N^2) and will not scale to 2.8M nodes. Use FAISS (normalized vectors + inner product), cap to top_k, avoid self-edges, and enforce a canonical direction to prevent duplicates. Also write edges between chunk nodes when chunk_id exists; otherwise fall back to embedding nodes. Finally, name alignment: you insert into chunk_similarity while endpoints are arxiv_embeddings; either switch to chunk endpoints or rename the collection. This patch does both (uses chunk endpoints when available).

-    def build_embedding_similarity_edges(self, similarity_threshold: float = 0.8):
+    def build_embedding_similarity_edges(self, similarity_threshold: float = 0.8, top_k: int = 10):
         """Create edges between similar embeddings."""
         print(f"\nBuilding similarity edges (threshold={similarity_threshold})...")
@@
-        # Convert to numpy array
-        vectors = np.array([e['vector'] for e in embeddings])
+        # Prepare vectors (float32 + L2 normalize for cosine via inner product)
+        vectors = np.array([e["vector"] for e in embeddings], dtype=np.float32)
+        norms = np.linalg.norm(vectors, axis=1, keepdims=True) + 1e-12
+        vectors = vectors / norms
+
+        # Prefer chunk endpoints when available, else use embedding docs
+        ids_for_edges = [
+            (e["chunk_id"] if e.get("chunk_id") else e["id"])
+            for e in embeddings
+        ]
@@
-        # Compute pairwise similarities (in batches for memory efficiency)
-        batch_size = 100
-        edges = []
-
-        for i in tqdm(range(0, len(embeddings), batch_size)):
-            batch_end = min(i + batch_size, len(embeddings))
-            batch_vectors = vectors[i:batch_end]
-            
-            # Compute similarities with all other vectors
-            similarities = cosine_similarity(batch_vectors, vectors)
-            
-            # Find high similarity pairs
-            for local_idx in range(batch_end - i):
-                global_idx = i + local_idx
-                
-                # Get top similar items (excluding self)
-                sim_scores = similarities[local_idx]
-                sim_scores[global_idx] = -1  # Exclude self
-                
-                # Find indices above threshold
-                similar_indices = np.where(sim_scores >= similarity_threshold)[0]
-                
-                for sim_idx in similar_indices[:10]:  # Limit to top 10
-                    if sim_idx != global_idx:
-                        edge = {
-                            '_from': embeddings[global_idx]['id'],
-                            '_to': embeddings[sim_idx]['id'],
-                            'similarity': float(sim_scores[sim_idx]),
-                            'type': 'semantic_similarity'
-                        }
-                        edges.append(edge)
+        # ANN search with FAISS (IndexFlatIP). If FAISS is unavailable, fail fast with guidance.
+        try:
+            import faiss  # type: ignore
+        except ImportError:
+            raise RuntimeError(
+                "FAISS is required for scalable similarity edge construction. "
+                "Install faiss-cpu or faiss-gpu and re-run."
+            )
+
+        dim = vectors.shape[1]
+        index = faiss.IndexFlatIP(dim)
+        index.add(vectors)  # zero-copy in RAM
+
+        edges = []
+        batch_size = 2048
+        for i in tqdm(range(0, len(vectors), batch_size)):
+            batch = vectors[i : i + batch_size]
+            D, I = index.search(batch, top_k + 1)  # includes self at rank 0
+            for row, (scores, idxs) in enumerate(zip(D, I)):
+                src = ids_for_edges[i + row]
+                for score, j in zip(scores, idxs):
+                    if j < 0:
+                        continue
+                    if j == (i + row):
+                        continue  # skip self
+                    if score < similarity_threshold:
+                        continue
+                    dst = ids_for_edges[j]
+                    # canonicalize direction to avoid duplicates (use lexicographic order)
+                    _from, _to = (src, dst) if src <= dst else (dst, src)
+                    edges.append(
+                        {
+                            "_from": _from,
+                            "_to": _to,
+                            "similarity": float(score),
+                            "type": "semantic_similarity",
+                        }
+                    )
@@
-            print(f"\nInserting {len(edges)} similarity edges...")
+            print(f"\nInserting {len(edges)} similarity edges...")
@@
-                try:
-                    self.edge_collections['chunk_similarity'].insert_many(batch, overwrite=True)
-                    total_inserted += len(batch)
-                except Exception as e:
-                    print(f"Error inserting batch: {e}")
+                try:
+                    # With the unique index, duplicates are ignored when overwrite=True
+                    self.edge_collections["chunk_similarity"].insert_many(
+                        batch, overwrite=True
+                    )
+                    total_inserted += len(batch)
+                except (ArangoError, ServerConnectionError) as e:
+                    print(f"Error inserting batch: {e}")

Follow-ups:

  • If embeddings represent papers (not chunks), either rename the collection to embedding_similarity or keep chunk endpoints consistently. Please confirm which is intended.
  • Add faiss to requirements if adopting this change.

Refactor similarity edge construction for scalability and correctness

  • Replace the current O(N²) cosine_similarity loop with an ANN index (e.g., FAISS IndexFlatIP) over L2-normalized float32 vectors, using a configurable top_k and applying the similarity threshold during search.
  • Canonicalize _from/_to ordering and skip self-matches to eliminate duplicate and reflexive edges.
  • Use chunk_id endpoints when present (fallback to embedding IDs), or else rename the chunk_similarity collection to embedding_similarity for consistency.
  • Add faiss-cpu (or faiss-gpu) to the project’s requirements.
🧰 Tools
🪛 Ruff (0.12.2)

170-170: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In tools/graphsage/utils/build_graph.py around lines 99-176, replace the O(N^2)
pairwise cosine_similarity loop with a FAISS ANN search: L2-normalize all
vectors and build a faiss.IndexFlatIP over float32 vectors, expose configurable
top_k and similarity_threshold, then for each embedding query the index for
top_k neighbors and filter by threshold; ensure you skip self-matches and
canonicalize edge direction (e.g., always store smaller_id/_from ->
larger_id/_to) to avoid reflexive and duplicate edges; when creating edge
documents use chunk_id endpoints when present (fallback to embedding _id) and
write to a consistently named collection (choose embedding_similarity or keep
chunk_similarity but update naming everywhere), insert edges in batches with
overwrite/Upsert semantics and add faiss-cpu (or faiss-gpu) to the project
requirements.

Comment on lines 219 to 261
def build_citation_edges(self):
"""Build citation edges between papers."""
print("\nBuilding citation edges...")

# Check if we have citation data
cursor = self.db.aql.execute("""
FOR paper IN arxiv_papers
FILTER paper.references != null OR paper.citations != null
LIMIT 100
RETURN {
id: paper.arxiv_id,
refs: paper.references,
cites: paper.citations
}
""")

papers_with_citations = list(cursor)

if not papers_with_citations:
print("No citation data found in papers")
return

edges = []
for paper in papers_with_citations:
# Process references
if paper.get('refs'):
for ref_id in paper['refs']:
edge = {
'_from': f"arxiv_papers/{paper['id']}",
'_to': f"arxiv_papers/{ref_id}",
'type': 'cites'
}
edges.append(edge)

if edges:
try:
self.edge_collections['paper_cites_paper'].insert_many(edges, overwrite=True)
print(f"Created {len(edges)} citation edges")
except Exception as e:
print(f"Error creating citation edges: {e}")
else:
print("No citations to process")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Process all citations, not just a sample; also support inbound ‘citations’ field

LIMIT 100 drops most edges; and the code ignores paper.citations. Remove the limit, iterate in batches if needed, and build edges for both references and citations. Also guard that target papers exist.

-        cursor = self.db.aql.execute("""
-            FOR paper IN arxiv_papers
-            FILTER paper.references != null OR paper.citations != null
-            LIMIT 100
-            RETURN {
-                id: paper.arxiv_id,
-                refs: paper.references,
-                cites: paper.citations
-            }
-        """)
+        cursor = self.db.aql.execute("""
+            FOR paper IN arxiv_papers
+            FILTER paper.references != null OR paper.citations != null
+            RETURN {
+                id: paper.arxiv_id,
+                refs: paper.references,
+                cites: paper.citations
+            }
+        """)
@@
-        for paper in papers_with_citations:
+        for paper in papers_with_citations:
             # Process references
             if paper.get('refs'):
                 for ref_id in paper['refs']:
-                    edge = {
-                        '_from': f"arxiv_papers/{paper['id']}",
-                        '_to': f"arxiv_papers/{ref_id}",
-                        'type': 'cites'
-                    }
-                    edges.append(edge)
+                    _from = f"arxiv_papers/{paper['id']}"
+                    _to = f"arxiv_papers/{ref_id}"
+                    # Only add if the referenced paper exists
+                    try:
+                        if self.db.has_document(_to):
+                            edges.append({'_from': _from, '_to': _to, 'type': 'cites'})
+                    except (ArangoError, ServerConnectionError):
+                        pass
+            # Process inbound citations if present (paper is cited by others)
+            if paper.get('cites'):
+                for citing_id in paper['cites']:
+                    _from = f"arxiv_papers/{citing_id}"
+                    _to = f"arxiv_papers/{paper['id']}"
+                    try:
+                        if self.db.has_document(_from):
+                            edges.append({'_from': _from, '_to': _to, 'type': 'cites'})
+                    except (ArangoError, ServerConnectionError):
+                        pass
@@
-            except Exception as e:
+            except (ArangoError, ServerConnectionError) as e:
                 print(f"Error creating citation edges: {e}")

🏁 Script executed:

#!/bin/bash
# Show build_citation_edges method with context and check for LIMIT 100 and citation handling

echo "===== build_citation_edges snippet (lines 200-300) ====="
sed -n '200,300p' tools/graphsage/utils/build_graph.py || echo "File or range not found"

echo
echo "===== LIMIT 100 occurrences ====="
grep -n "LIMIT 100" tools/graphsage/utils/build_graph.py || echo "None found"

echo
echo "===== Citation field processing checks ====="
grep -n "paper.references" -n tools/graphsage/utils/build_graph.py || echo "No references processing found"
grep -n "paper.citations" -n tools/graphsage/utils/build_graph.py || echo "No citations processing found"

Length of output: 3857


Remove LIMIT and process all citation edges including inbound citations
In build_citation_edges (tools/graphsage/utils/build_graph.py:219-261), remove the LIMIT 100 clause to cover all papers, batch‐iterate as needed, handle both references and citations fields, and verify each target document exists before inserting the edge.

🧰 Tools
🪛 Ruff (0.12.2)

257-257: Do not catch blind exception: Exception

(BLE001)

- Fix CUDA device mismatch in LSTMAggregator random permutation
- Clean up imports and use proper typing (DefaultDict) in memory_store
- Fix edge counting for undirected graphs (divide by 2)
- Improve shared memory allocation with guards and memory limits
- Ensure deterministic neighbor ordering with sorted() for reproducibility

These changes improve stability, memory management, and reproducibility
of the GraphSAGE implementation.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (21)
core/framework/memory_store.py (6)

132-161: Fix unused variable and per-collection count logging.

Drop unused collection and print per-collection increment, not cumulative.

-            try:
-                collection = db.collection(collection_name)
+            try:
                 print(f"Loading {collection_name}...")
+                start_index = index
@@
-                print(f"  Loaded {index} nodes from {collection_name}")
+                print(f"  Loaded {index - start_index} nodes from {collection_name}")

14-19: Tighten typing for adjacency and drop unused import.

Use DefaultDict for mypy and remove unused multiprocessing alias.

Apply:

-from typing import Dict, List, Set, Tuple, Optional, Any
+from typing import Dict, List, Set, Tuple, Optional, Any, DefaultDict
@@
-from collections import defaultdict
-import multiprocessing as mp
+from collections import defaultdict
@@
-        self.adjacency: Dict[int, Set[int]] = defaultdict(set)
+        self.adjacency: DefaultDict[int, Set[int]] = defaultdict(set)

117-124: Edge count is double-counted for undirected adjacency.

Report unique edges by halving summed degrees.

-        self.stats = GraphStats(
-            num_nodes=len(self.node_ids),
-            num_edges=sum(len(neighbors) for neighbors in self.adjacency.values()),
+        undirected_edges = sum(len(neighbors) for neighbors in self.adjacency.values()) // 2
+        self.stats = GraphStats(
+            num_nodes=len(self.node_ids),
+            num_edges=undirected_edges,
             num_node_types=len(set(self.node_types.values())),
             num_edge_types=len(set(self.edge_types.values())),
             memory_usage_gb=memory_usage,
             load_time_seconds=load_time
         )

170-193: Canonicalize edge_types keys for undirected storage.

Avoid storing both (u, v) and (v, u). Use (min, max).

-                        self.adjacency[from_idx].add(to_idx)
-                        self.adjacency[to_idx].add(from_idx)  # Undirected
-                        self.edge_types[(from_idx, to_idx)] = edge_type
+                        self.adjacency[from_idx].add(to_idx)
+                        self.adjacency[to_idx].add(from_idx)  # Undirected
+                        key = (min(from_idx, to_idx), max(from_idx, to_idx))
+                        self.edge_types[key] = edge_type

199-226: Guard shared-memory creation (zero nodes, memory limit, stale segments).

Prevent zero-size shm, enforce max_memory_gb, cleanup pre-existing segment.

     def create_shared_memory(self, embedding_dim: int = 2048):
@@
-        num_nodes = len(self.node_ids)
+        num_nodes = len(self.node_ids)
+        if num_nodes == 0:
+            raise ValueError("Cannot allocate embeddings: graph has zero nodes.")
@@
-        size = num_nodes * embedding_dim * np.float32().itemsize
+        size = num_nodes * embedding_dim * np.float32().itemsize
+        expected_gb = size / (1024 ** 3)
+        if expected_gb > self.max_memory_gb:
+            raise MemoryError(
+                f"Requested {expected_gb:.2f} GB exceeds max_memory_gb={self.max_memory_gb:.2f} GB."
+            )
@@
-        self.shared_memory = shared_memory.SharedMemory(create=True, size=size)
+        if self.shared_memory is not None:
+            self.cleanup()
+        self.shared_memory = shared_memory.SharedMemory(create=True, size=size)

336-355: Re-wrap adjacency as defaultdict on load to preserve write-time behavior.

Plain dict loses auto-set creation.

-        self.adjacency = {int(k): set(v) for k, v in data['adjacency'].items()}
+        self.adjacency = defaultdict(
+            set, {int(k): set(v) for k, v in data['adjacency'].items()}
+        )
tools/graphsage/utils/build_graph.py (7)

2-12: Expand module docstring to tie into CONVEYANCE and WHERE/WHAT/TIME (repo guideline).

Bring this in line with coding_guidelines.

-"""
-Build Graph from Existing Data.
-...
-This prepares the graph for GraphSAGE processing.
-"""
+"""
+Graph construction utilities for a heterogeneous GraphSAGE graph.
+
+CONVEYANCE: materialize relations (paper↔chunk, semantic similarity, code↔paper, citations)
+that GraphSAGE aggregates to surface theory–practice bridges.
+
+WHERE: ArangoDB node collections (arxiv_papers, arxiv_chunks, arxiv_embeddings, github_papers)
+→ edge collections (paper_has_chunk, chunk_similarity, code_implements_paper, paper_cites_paper).
+
+WHAT: deterministic, idempotent edge builders with batching and optional ANN similarity.
+
+TIME: supports incremental runs; builders can be re-executed per segment without retraining.
+"""

219-261: Process all citations (remove LIMIT) and handle inbound citations; optionally verify endpoints exist.

Current LIMIT 100 truncates edges and ‘citations’ field is ignored.

-        cursor = self.db.aql.execute("""
+        cursor = self.db.aql.execute("""
             FOR paper IN arxiv_papers
             FILTER paper.references != null OR paper.citations != null
-            LIMIT 100
             RETURN {
                 id: paper.arxiv_id,
                 refs: paper.references,
                 cites: paper.citations
             }
         """)
@@
-        for paper in papers_with_citations:
+        for paper in papers_with_citations:
             # Process references
             if paper.get('refs'):
                 for ref_id in paper['refs']:
-                    edge = {
-                        '_from': f"arxiv_papers/{paper['id']}",
-                        '_to': f"arxiv_papers/{ref_id}",
-                        'type': 'cites'
-                    }
-                    edges.append(edge)
+                    edges.append({'_from': f"arxiv_papers/{paper['id']}", '_to': f"arxiv_papers/{ref_id}", 'type': 'cites'})
+            # Process inbound citations if present
+            if paper.get('cites'):
+                for citing_id in paper['cites']:
+                    edges.append({'_from': f"arxiv_papers/{citing_id}", '_to': f"arxiv_papers/{paper['id']}", 'type': 'cites'})

24-25: Remove sys.path mutation; rely on package/relative imports.

Violates packaging guidelines and is unnecessary here.

-sys.path.append(str(Path(__file__).parent.parent.parent.parent))
-

22-23: Import specific Arango exceptions (avoid blind excepts).

Prepare for narrowed exception handling below.

 from arango import ArangoClient
+from arango.exceptions import ArangoError, ServerConnectionError

50-57: Ensure unique compound index on (_from, _to, type) for edge collections.

Prevents duplicate edges; makes inserts idempotent.

     def _ensure_collections(self):
         """Ensure all edge collections exist."""
         for collection_name in self.edge_collections.keys():
             if not self.db.has_collection(collection_name):
                 self.db.create_collection(collection_name, edge=True)
                 print(f"Created edge collection: {collection_name}")
             self.edge_collections[collection_name] = self.db.collection(collection_name)
+            # Ensure unique index on (_from, _to, type)
+            try:
+                idx_fields = ["_from", "_to", "type"]
+                existing = self.edge_collections[collection_name].indexes()
+                has_unique = any(
+                    i.get("type") == "persistent"
+                    and i.get("unique")
+                    and i.get("fields") == idx_fields
+                for i in existing
+                )
+                if not has_unique:
+                    self.edge_collections[collection_name].add_persistent_index(
+                        idx_fields, unique=True, sparse=False
+                    )
+            except (ArangoError, ServerConnectionError) as e:
+                print(f"Warning: failed to ensure index on {collection_name}: {e}")

99-176: Replace O(N²) cosine loop with FAISS ANN top-k over normalized float32 vectors.

The current approach won’t scale beyond small subsets; switch to ANN, skip self, canonicalize direction, and use chunk endpoints when available.

-    def build_embedding_similarity_edges(self, similarity_threshold: float = 0.8):
+    def build_embedding_similarity_edges(self, similarity_threshold: float = 0.8, top_k: int = 10):
@@
-        # Convert to numpy array
-        vectors = np.array([e['vector'] for e in embeddings])
+        # Prepare vectors (float32 + L2 normalize for cosine via inner product)
+        vectors = np.array([e['vector'] for e in embeddings], dtype=np.float32)
+        norms = np.linalg.norm(vectors, axis=1, keepdims=True) + 1e-12
+        vectors = vectors / norms
+
+        # Prefer chunk endpoints when available, else use embedding docs
+        ids_for_edges = [
+            (e["chunk_id"] if e.get("chunk_id") else e["id"])
+            for e in embeddings
+        ]
@@
-        # Compute pairwise similarities (in batches for memory efficiency)
-        batch_size = 100
-        edges = []
-
-        for i in tqdm(range(0, len(embeddings), batch_size)):
-            batch_end = min(i + batch_size, len(embeddings))
-            batch_vectors = vectors[i:batch_end]
-            
-            # Compute similarities with all other vectors
-            similarities = cosine_similarity(batch_vectors, vectors)
-            
-            # Find high similarity pairs
-            for local_idx in range(batch_end - i):
-                global_idx = i + local_idx
-                
-                # Get top similar items (excluding self)
-                sim_scores = similarities[local_idx]
-                sim_scores[global_idx] = -1  # Exclude self
-                
-                # Find indices above threshold
-                similar_indices = np.where(sim_scores >= similarity_threshold)[0]
-                
-                for sim_idx in similar_indices[:10]:  # Limit to top 10
-                    if sim_idx != global_idx:
-                        edge = {
-                            '_from': embeddings[global_idx]['id'],
-                            '_to': embeddings[sim_idx]['id'],
-                            'similarity': float(sim_scores[sim_idx]),
-                            'type': 'semantic_similarity'
-                        }
-                        edges.append(edge)
+        # Build FAISS index (IP on normalized vectors == cosine)
+        try:
+            import faiss  # type: ignore
+        except ImportError:
+            raise RuntimeError("FAISS is required for scalable similarity edge construction.")
+        index = faiss.IndexFlatIP(vectors.shape[1])
+        index.add(vectors)
+
+        edges = []
+        batch_size = 2048
+        for i in tqdm(range(0, len(vectors), batch_size)):
+            batch = vectors[i:i + batch_size]
+            D, I = index.search(batch, top_k + 1)  # includes self
+            for row, (scores, idxs) in enumerate(zip(D, I)):
+                src = ids_for_edges[i + row]
+                for score, j in zip(scores, idxs):
+                    if j < 0 or j == (i + row) or score < similarity_threshold:
+                        continue
+                    dst = ids_for_edges[j]
+                    # Canonicalize direction to avoid duplicates
+                    _from, _to = (src, dst) if src <= dst else (dst, src)
+                    edges.append(
+                        {"_from": _from, "_to": _to, "similarity": float(score), "type": "semantic_similarity"}
+                    )
@@
-                try:
-                    self.edge_collections['chunk_similarity'].insert_many(batch, overwrite=True)
+                try:
+                    self.edge_collections['chunk_similarity'].insert_many(batch, overwrite=True)
                 except Exception as e:
                     print(f"Error inserting batch: {e}")

Follow-up: add faiss-cpu/faiss-gpu to requirements.


92-94: Narrow exception handling to Arango exceptions.

Avoid masking unrelated errors and satisfy Ruff BLE001.

-                except Exception as e:
+                except (ArangoError, ServerConnectionError) as e:
                     print(f"Error inserting batch: {e}")

Apply similarly at Lines 168-171, 214-216, 257-258.

Also applies to: 168-171, 214-216, 257-258

tools/graphsage/bridge_discovery/theory_practice_finder.py (1)

259-318: Speed up latent-bridge scoring (eval mode, no_grad, batch code embeddings).

Removes redundant forward calls and dropout randomness.

-        bridges = []
-        
-        # Use GraphSAGE to predict links
-        # Sample some paper-code pairs
+        bridges = []
+        # Use GraphSAGE to predict links
         num_samples = min(100, len(self.paper_nodes))
         sampled_papers = np.random.choice(self.paper_nodes, num_samples, replace=False)
         sampled_codes = np.random.choice(self.code_nodes, 
                                         min(50, len(self.code_nodes)), 
                                         replace=False)
-        
-        for paper_idx in sampled_papers:
-            paper_id = self.graph_store.index_to_node[paper_idx]
-            
-            # Get GraphSAGE embedding with neighborhood context
-            paper_emb = self.graphsage_model.forward([paper_idx]).detach().cpu().numpy()
-            
-            # Score against code nodes
-            code_scores = []
-            for code_idx in sampled_codes:
-                code_emb = self.graphsage_model.forward([code_idx]).detach().cpu().numpy()
-                
-                # Compute similarity in GraphSAGE space
-                similarity = cosine_similarity(paper_emb, code_emb)[0, 0]
-                code_scores.append((code_idx, similarity))
-            
-            # Sort by score
-            code_scores.sort(key=lambda x: x[1], reverse=True)
-            
-            # Take top matches
-            for code_idx, score in code_scores[:self.config.max_bridges_per_paper]:
-                if score >= self.config.similarity_threshold * 0.8:  # Slightly lower threshold
-                    code_id = self.graph_store.index_to_node[code_idx]
-                    
-                    bridge = Bridge(
-                        paper_id=paper_id,
-                        code_id=code_id,
-                        confidence=float(score),
-                        bridge_type='latent',
-                        evidence={
-                            'graphsage_similarity': float(score),
-                            'discovered_through': 'neighborhood_aggregation'
-                        }
-                    )
-                    bridges.append(bridge)
+        self.graphsage_model.eval()
+        with torch.no_grad():
+            code_embs = self.graphsage_model.forward(list(sampled_codes)).detach().cpu().numpy()
+            for paper_idx in sampled_papers:
+                paper_id = self.graph_store.index_to_node[paper_idx]
+                paper_emb = self.graphsage_model.forward([paper_idx]).detach().cpu().numpy()
+                sims = cosine_similarity(paper_emb, code_embs)[0]
+                top_indices = np.argsort(sims)[-self.config.max_bridges_per_paper:][::-1]
+                for j in top_indices:
+                    score = float(sims[j])
+                    if score >= self.config.similarity_threshold * 0.8:
+                        code_idx = int(sampled_codes[j])
+                        code_id = self.graph_store.index_to_node[code_idx]
+                        bridge = Bridge(
+                            paper_id=paper_id,
+                            code_id=code_id,
+                            confidence=score,
+                            bridge_type='latent',
+                            evidence={
+                                'graphsage_similarity': score,
+                                'discovered_through': 'neighborhood_aggregation'
+                            }
+                        )
+                        bridges.append(bridge)
tools/graphsage/utils/neighborhood_sampler.py (2)

170-184: Ensure adaptive sampling fills to num_samples.

Top up from the other bucket if one is undersized.

-        if diff_type:
-            num_diff = min(len(diff_type), num_samples * 2 // 3)
-            num_same = min(len(same_type), num_samples - num_diff)
+        if diff_type:
+            target_diff = max(1, int(round(num_samples * 2 / 3)))
+            num_diff = min(len(diff_type), target_diff)
+            num_same = min(len(same_type), num_samples - num_diff)
@@
-            return sampled[:num_samples]
+            remaining = num_samples - len(sampled)
+            if remaining > 0:
+                pool = [n for n in (same_type + diff_type) if n not in sampled]
+                if pool:
+                    sampled.extend(random.sample(pool, min(remaining, len(pool))))
+            return sampled[:num_samples]

72-76: Deterministic neighbor ordering to align with importance sampling.

Sets are unordered; use sorted(...) to keep probs aligned with neighbor IDs.

-        for node_idx in range(len(self.graph_store.node_ids)):
-            neighbors = list(self.graph_store.adjacency.get(node_idx, set()))
+        for node_idx in range(len(self.graph_store.node_ids)):
+            neighbors = sorted(self.graph_store.adjacency.get(node_idx, set()))
@@
-        neighbors = list(self.graph_store.adjacency.get(node, set()))
+        neighbors = sorted(self.graph_store.adjacency.get(node, set()))

Also applies to: 110-116

tools/graphsage/pipelines/graphsage_pipeline.py (4)

27-35: Remove sys.path hacks; use package-relative imports within tools.graphsage.

Keep core.framework imports absolute; make intra-package imports relative. Run via: python -m tools.graphsage.pipelines.graphsage_pipeline.

-# Add parent directories to path
-sys.path.append(str(Path(__file__).parent.parent.parent.parent))
-
 from core.framework.memory_store import GraphMemoryStore
 from core.framework.graph_embedders import GraphSAGEEmbedder, GraphSAGEConfig, AggregatorType
-from tools.graphsage.utils.neighborhood_sampler import NeighborhoodSampler, SamplingConfig
-from tools.graphsage.bridge_discovery.theory_practice_finder import (
+from ..utils.neighborhood_sampler import NeighborhoodSampler, SamplingConfig
+from ..bridge_discovery.theory_practice_finder import (
     TheoryPracticeFinder, BridgeDiscoveryConfig
 )

56-65: Add checkpoint/resume fields to PipelineConfig to satisfy pipeline resilience requirement.

     save_bridges: bool = True
     
     # Memory settings
     max_memory_gb: float = 100.0
     use_shared_memory: bool = True
+    
+    # Checkpointing
+    checkpoint_path: str = "./cache/graphsage_checkpoint.json"
+    resume: bool = True

192-195: Do not overwrite input feature buffer with output embeddings (shape mismatch).

graph_store.node_embeddings is sized to input_dim; embeddings is output_dim. This will crash and corrupt inputs.

-        # Store in graph store
-        if self.graph_store.node_embeddings is not None:
-            self.graph_store.node_embeddings[:] = embeddings
+        # NOTE: Keep inputs and outputs separate; skip in-place copy here.
+        # If zero-copy outputs are needed, add a dedicated shared-memory buffer
+        # (e.g., graph_store.output_embeddings) sized to embeddings.shape.

Outside this hunk, add an output buffer to GraphMemoryStore (example):

# core/framework/memory_store.py
def create_output_shared_memory(self, output_dim: int):
    num_nodes = len(self.node_ids)
    size = num_nodes * output_dim * np.float32().itemsize
    self.output_shared_memory = shared_memory.SharedMemory(create=True, size=size)
    self.output_embeddings = np.ndarray((num_nodes, output_dim), dtype=np.float32, buffer=self.output_shared_memory.buf)

329-357: Add minimal checkpoint/resume to meet tools//pipelines requirement.**

 def run(self):
     """Run the complete pipeline."""
@@
-        # Step 1: Load graph
-        self.load_graph()
+        # Load checkpoint
+        ckpt = {}
+        if self.config.resume and Path(self.config.checkpoint_path).exists():
+            try:
+                ckpt = json.load(open(self.config.checkpoint_path))
+            except Exception:
+                ckpt = {}
+        def mark(step: str):
+            ckpt[step] = True
+            os.makedirs(os.path.dirname(self.config.checkpoint_path), exist_ok=True)
+            json.dump(ckpt, open(self.config.checkpoint_path, "w"))
+
+        # Step 1: Load graph
+        if not ckpt.get("loaded"):
+            self.load_graph()
+            mark("loaded")
@@
-        # Step 2: Initialize model
-        self.initialize_model()
+        # Step 2: Initialize model
+        if not ckpt.get("initialized"):
+            self.initialize_model()
+            mark("initialized")
@@
-        # Step 4: Generate embeddings
-        self.generate_embeddings()
+        # Step 4: Generate embeddings
+        if not ckpt.get("embeddings"):
+            self.generate_embeddings()
+            mark("embeddings")
@@
-        # Step 5: Discover bridges
-        bridges = self.discover_bridges()
+        # Step 5: Discover bridges
+        if not ckpt.get("bridges"):
+            bridges = self.discover_bridges()
+            mark("bridges")
+        else:
+            bridges = {}
core/framework/graph_embedders.py (1)

173-177: Fix device mismatch when permuting neighbors on CUDA.

perm must be on the same device as neighbor_feats.

-        perm = torch.randperm(num_neighbors)
+        perm = torch.randperm(num_neighbors, device=neighbor_feats.device)
🧹 Nitpick comments (13)
tools/graphsage/bridge_discovery/theory_practice_finder.py (3)

25-33: Fix typing and extend Bridge with conveyance_score.

Use typing.Any, and add conveyance_score with default.

-from dataclasses import dataclass
+from dataclasses import dataclass, field
+from typing import Any
@@
 class Bridge:
@@
-    evidence: Dict[str, any]  # Supporting evidence for the bridge
+    evidence: Dict[str, Any]  # Supporting evidence for the bridge
+    conveyance_score: float = field(default=0.0, compare=False)

80-94: Rename unused loop var; minor cleanup.

node_id is unused; rename to _node_id to satisfy linters.

-        for node_id, node_idx in self.graph_store.node_ids.items():
+        for _node_id, node_idx in self.graph_store.node_ids.items():

361-396: Rename ambiguous variable and avoid adding undeclared attrs dynamically.

Rename I → instr and rely on Bridge.conveyance_score field.

-            I = bridge.evidence.get('instruction_fit', 0.5)
+            instr = bridge.evidence.get('instruction_fit', 0.5)
@@
-            Ctx = 0.25 * (L + I + A + G)
+            Ctx = 0.25 * (L + instr + A + G)
tools/graphsage/utils/neighborhood_sampler.py (2)

247-252: Optional: sort per-layer nodes to stabilize minibatches.

Stabilizes training/debugging and hashing of unique node sets.

-        for layer in range(k + 1):
-            nodes = list(layer_nodes[layer])
+        for layer in range(k + 1):
+            nodes = sorted(layer_nodes[layer])
             layer_lists.append(nodes)
             unique_nodes.update(nodes)

309-312: Remove unused variable.

total_edges is unused.

-        total_edges = sum(degrees)
         sampled_edges_per_node = sum(self.config.num_neighbors)
tools/graphsage/pipelines/graphsage_pipeline.py (7)

128-129: Remove unused variable assignment.

stats is unused.

-            stats = self.graph_store.load_from_arangodb(db_config)
+            self.graph_store.load_from_arangodb(db_config)

156-160: Fix lint: f-string without placeholders.

-        print(f"Model initialized:")
+        print("Model initialized:")

161-167: Silence unused argument warning for num_epochs.

-    def train_model(self, num_epochs: int = 10):
+    def train_model(self, _num_epochs: int = 10):

36-37: Narrow exception handling when persisting embeddings.

Catching Exception hides real DB issues; narrow to python-arango server errors.

 from arango import ArangoClient
+from arango.exceptions import ArangoServerError
@@
-            except Exception as e:
+            except ArangoServerError as e:
                 print(f"Error saving batch: {e}")

Also applies to: 277-283


321-325: Narrow exception handling when persisting bridges.

-            except Exception as e:
+            except ArangoServerError as e:
                 print(f"Error saving {bridge_type} bridges: {e}")

413-416: Make default config path robust when running as a module.

-    parser.add_argument('--config', type=str, default='../configs/graphsage_config.yaml',
+    default_cfg = str((Path(__file__).resolve().parent.parent / 'configs' / 'graphsage_config.yaml'))
+    parser.add_argument('--config', type=str, default=default_cfg,

1-1: Shebang present but file is not executable. Remove or chmod +x.

-#!/usr/bin/env python3
core/framework/graph_embedders.py (1)

1-1: Shebang present but file is not executable. Remove or chmod +x.

-#!/usr/bin/env python3
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b51e00b and 79c70be.

📒 Files selected for processing (6)
  • core/framework/graph_embedders.py (1 hunks)
  • core/framework/memory_store.py (1 hunks)
  • tools/graphsage/bridge_discovery/theory_practice_finder.py (1 hunks)
  • tools/graphsage/pipelines/graphsage_pipeline.py (1 hunks)
  • tools/graphsage/utils/build_graph.py (1 hunks)
  • tools/graphsage/utils/neighborhood_sampler.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
{tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

When in tools/ or experiments/, import from core.framework.* for shared infrastructure (e.g., core.framework.embedders)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/neighborhood_sampler.py
  • tools/graphsage/bridge_discovery/theory_practice_finder.py
  • tools/graphsage/utils/build_graph.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use relative imports for intra-module references (from .utils import helper_function)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/neighborhood_sampler.py
  • core/framework/graph_embedders.py
  • core/framework/memory_store.py
  • tools/graphsage/bridge_discovery/theory_practice_finder.py
  • tools/graphsage/utils/build_graph.py
{core,tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

All code must include docstrings tying implementation to the theoretical framework (e.g., explain CONVEYANCE, WHERE/WHAT/TIME)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/neighborhood_sampler.py
  • core/framework/graph_embedders.py
  • core/framework/memory_store.py
  • tools/graphsage/bridge_discovery/theory_practice_finder.py
  • tools/graphsage/utils/build_graph.py
{core,tools}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Infrastructure (core/, tools/) must not depend on experiments/

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/neighborhood_sampler.py
  • core/framework/graph_embedders.py
  • core/framework/memory_store.py
  • tools/graphsage/bridge_discovery/theory_practice_finder.py
  • tools/graphsage/utils/build_graph.py
{tools/**/pipelines/**/*.py,tools/github/github_pipeline_manager.py}

📄 CodeRabbit inference engine (CLAUDE.md)

Pipelines must support resumption from checkpoints (error recovery)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
{tools/arxiv,core/framework}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{tools/arxiv,core/framework}/**/*.py: Format Python code with Black for tools/arxiv and core/framework
Run Ruff linting for tools/arxiv and core/framework

Files:

  • core/framework/graph_embedders.py
  • core/framework/memory_store.py
core/framework/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run mypy type checking on core/framework only

Files:

  • core/framework/graph_embedders.py
  • core/framework/memory_store.py
🧠 Learnings (3)
📚 Learning: 2025-09-04T03:47:49.352Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T03:47:49.352Z
Learning: Applies to {tools,experiments}/**/*.py : When in tools/ or experiments/, import from core.framework.* for shared infrastructure (e.g., core.framework.embedders)

Applied to files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
📚 Learning: 2025-09-04T03:47:49.352Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T03:47:49.352Z
Learning: Applies to {tools/**/pipelines/**/*.py,tools/github/github_pipeline_manager.py} : Pipelines must support resumption from checkpoints (error recovery)

Applied to files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
📚 Learning: 2025-09-04T03:49:10.310Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T03:49:10.310Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Implement checkpointing with atomic saves and automatic resume on failure; persist to acid_phased_checkpoint.json

Applied to files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
🧬 Code graph analysis (5)
tools/graphsage/pipelines/graphsage_pipeline.py (3)
core/framework/memory_store.py (6)
  • GraphMemoryStore (35-355)
  • load_from_disk (336-355)
  • load_from_arangodb (71-130)
  • save_to_disk (316-334)
  • create_shared_memory (199-225)
  • _calculate_memory_usage (280-295)
core/framework/graph_embedders.py (4)
  • GraphSAGEEmbedder (233-425)
  • GraphSAGEConfig (33-50)
  • AggregatorType (24-29)
  • generate_embeddings (402-425)
tools/graphsage/bridge_discovery/theory_practice_finder.py (3)
  • TheoryPracticeFinder (50-439)
  • BridgeDiscoveryConfig (36-47)
  • discover_all_bridges (320-359)
tools/graphsage/utils/neighborhood_sampler.py (1)
core/framework/graph_embedders.py (1)
  • sample_neighbors (265-293)
core/framework/graph_embedders.py (3)
core/framework/memory_store.py (2)
  • GraphMemoryStore (35-355)
  • get_neighbors (227-245)
tools/graphsage/utils/neighborhood_sampler.py (1)
  • sample_neighbors (99-126)
tools/graphsage/pipelines/graphsage_pipeline.py (1)
  • generate_embeddings (180-203)
tools/graphsage/bridge_discovery/theory_practice_finder.py (3)
core/framework/graph_embedders.py (9)
  • GraphSAGEEmbedder (233-425)
  • GraphSAGEConfig (33-50)
  • generate_embeddings (402-425)
  • forward (62-64)
  • forward (78-107)
  • forward (124-146)
  • forward (162-187)
  • forward (210-230)
  • forward (295-371)
core/framework/memory_store.py (2)
  • GraphMemoryStore (35-355)
  • get_k_hop_neighbors (247-278)
tools/graphsage/pipelines/graphsage_pipeline.py (1)
  • generate_embeddings (180-203)
tools/graphsage/utils/build_graph.py (1)
tools/graphsage/pipelines/graphsage_pipeline.py (1)
  • main (410-437)
🪛 Ruff (0.12.2)
tools/graphsage/pipelines/graphsage_pipeline.py

1-1: Shebang is present but file is not executable

(EXE001)


128-128: Local variable stats is assigned to but never used

Remove assignment to unused variable stats

(F841)


156-156: f-string without any placeholders

Remove extraneous f prefix

(F541)


161-161: Unused method argument: num_epochs

(ARG002)


280-280: Do not catch blind exception: Exception

(BLE001)


324-324: Do not catch blind exception: Exception

(BLE001)


384-384: Avoid specifying long messages outside the exception class

(TRY003)


386-386: Avoid specifying long messages outside the exception class

(TRY003)

tools/graphsage/utils/neighborhood_sampler.py

1-1: Shebang is present but file is not executable

(EXE001)


126-126: Avoid specifying long messages outside the exception class

(TRY003)


239-239: Loop control variable node not used within loop body

Rename unused node to _node

(B007)


266-266: Avoid specifying long messages outside the exception class

(TRY003)


309-309: Local variable total_edges is assigned to but never used

Remove assignment to unused variable total_edges

(F841)

core/framework/graph_embedders.py

1-1: Shebang is present but file is not executable

(EXE001)


208-208: Avoid specifying long messages outside the exception class

(TRY003)


250-250: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

core/framework/memory_store.py

1-1: Shebang is present but file is not executable

(EXE001)


85-85: Avoid specifying long messages outside the exception class

(TRY003)


87-87: Avoid specifying long messages outside the exception class

(TRY003)


92-92: Avoid specifying long messages outside the exception class

(TRY003)


102-102: Do not catch blind exception: Exception

(BLE001)


103-103: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


103-103: Avoid specifying long messages outside the exception class

(TRY003)


144-144: Local variable collection is assigned to but never used

Remove assignment to unused variable collection

(F841)


159-159: Do not catch blind exception: Exception

(BLE001)


196-196: Do not catch blind exception: Exception

(BLE001)


306-306: Do not catch blind exception: Exception

(BLE001)

tools/graphsage/bridge_discovery/theory_practice_finder.py

1-1: Shebang is present but file is not executable

(EXE001)


85-85: Loop control variable node_id not used within loop body

Rename unused node_id to _node_id

(B007)


163-163: Avoid specifying long messages outside the exception class

(TRY003)


381-381: Ambiguous variable name: I

(E741)

tools/graphsage/utils/build_graph.py

1-1: Shebang is present but file is not executable

(EXE001)


92-92: Do not catch blind exception: Exception

(BLE001)


170-170: Do not catch blind exception: Exception

(BLE001)


214-214: Do not catch blind exception: Exception

(BLE001)


257-257: Do not catch blind exception: Exception

(BLE001)


279-280: try-except-pass detected, consider logging the exception

(S110)


279-279: Do not catch blind exception: Exception

(BLE001)


292-293: try-except-pass detected, consider logging the exception

(S110)


292-292: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)
tools/graphsage/pipelines/graphsage_pipeline.py (1)

278-279: LGTM: Upsert semantics via overwrite=True on insert_many.

Comment on lines 46 to 51
def __post_init__(self):
if self.hidden_dims is None:
self.hidden_dims = [512, 256]
if self.num_neighbors is None:
self.num_neighbors = [25, 10] # Sample 25 at layer 1, 10 at layer 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix layer/sampling mismatch defaults and enforce invariant.

As written, hidden_dims (2) + output makes 3 layers but num_neighbors has length 2, causing runtime errors. Make defaults consistent and validate.

     def __post_init__(self):
-        if self.hidden_dims is None:
-            self.hidden_dims = [512, 256]
-        if self.num_neighbors is None:
-            self.num_neighbors = [25, 10]  # Sample 25 at layer 1, 10 at layer 2
+        if self.hidden_dims is None:
+            self.hidden_dims = [512]  # single hidden layer by default
+        if self.num_neighbors is None:
+            self.num_neighbors = [25]  # one-hop sampling by default
+        # Invariant: number of layers == len(num_neighbors)
+        if len(self.hidden_dims) + 1 != len(self.num_neighbors):
+            raise ValueError("len(hidden_dims)+1 must equal len(num_neighbors)")

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +250 to +261
dims = [config.input_dim] + config.hidden_dims + [config.output_dim]

for i in range(len(dims) - 1):
layer = GraphSAGELayer(
input_dim=dims[i],
output_dim=dims[i + 1],
aggregator_type=config.aggregator_type,
dropout=config.dropout,
concat=config.concat
)
self.layers.append(layer)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Validate layer count vs sampling depth; minor list concat nit.

-        dims = [config.input_dim] + config.hidden_dims + [config.output_dim]
+        dims = [config.input_dim, *config.hidden_dims, config.output_dim]
+        if len(dims) - 1 != len(config.num_neighbors):
+            raise ValueError(
+                f"Layer count mismatch: {len(dims) - 1} layers vs {len(config.num_neighbors)} sampling depths."
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dims = [config.input_dim] + config.hidden_dims + [config.output_dim]
for i in range(len(dims) - 1):
layer = GraphSAGELayer(
input_dim=dims[i],
output_dim=dims[i + 1],
aggregator_type=config.aggregator_type,
dropout=config.dropout,
concat=config.concat
)
self.layers.append(layer)
dims = [config.input_dim, *config.hidden_dims, config.output_dim]
if len(dims) - 1 != len(config.num_neighbors):
raise ValueError(
f"Layer count mismatch: {len(dims) - 1} layers vs {len(config.num_neighbors)} sampling depths."
)
for i in range(len(dims) - 1):
layer = GraphSAGELayer(
input_dim=dims[i],
output_dim=dims[i + 1],
aggregator_type=config.aggregator_type,
dropout=config.dropout,
concat=config.concat
)
self.layers.append(layer)
🧰 Tools
🪛 Ruff (0.12.2)

250-250: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents
In core/framework/graph_embedders.py around lines 250 to 261, ensure the number
of created GraphSAGELayer instances matches the model's sampling_depth and fix a
minor list concatenation nit: validate that (len(dims) - 1) ==
config.sampling_depth and if not either raise a clear ValueError describing the
mismatch or adjust hidden_dims (e.g., truncate or pad with repeated/sensible
values) before building dims; also ensure hidden_dims is safely concatenated by
converting it to a list first (dims = [config.input_dim] +
list(config.hidden_dims) + [config.output_dim]) to avoid issues when hidden_dims
is not a list.

Comment on lines +95 to +104
try:
client = ArangoClient(hosts=db_config.get('host', 'http://localhost:8529'))
db = client.db(
db_config['database'],
username=db_config['username'],
password=password
)
except Exception as e:
raise ConnectionError(f"Failed to connect to ArangoDB: {e}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Catch specific Arango exceptions and preserve traceback context.

Replace broad Exception with ArangoError/ServerConnectionError and use raise ... from e.

-from arango import ArangoClient
+from arango import ArangoClient
+from arango.exceptions import ArangoError, ServerConnectionError
@@
-        except Exception as e:
-            raise ConnectionError(f"Failed to connect to ArangoDB: {e}")
+        except (ArangoError, ServerConnectionError) as e:
+            raise ConnectionError("Failed to connect to ArangoDB") from e

Replicate similar narrowing in other catch sites in this file (Lines 159-160, 196-197, 306-307). I can patch them all if preferred.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
client = ArangoClient(hosts=db_config.get('host', 'http://localhost:8529'))
db = client.db(
db_config['database'],
username=db_config['username'],
password=password
)
except Exception as e:
raise ConnectionError(f"Failed to connect to ArangoDB: {e}")
++ b/core/framework/memory_store.py
@@
from arango import ArangoClient
from arango.exceptions import ArangoError, ServerConnectionError
@@
- except Exception as e:
except (ArangoError, ServerConnectionError) as e:
raise ConnectionError("Failed to connect to ArangoDB") from e
🧰 Tools
🪛 Ruff (0.12.2)

102-102: Do not catch blind exception: Exception

(BLE001)


103-103: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


103-103: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In core/framework/memory_store.py around lines 95-104, replace the broad "except
Exception as e" with specific arango.exceptions (import ArangoError and
ServerConnectionError) and re-raise using "raise ConnectionError(...) from e" to
preserve traceback; apply the same change to the other catch sites at lines
159-160, 196-197, and 306-307 so each except clause catches ArangoError and
ServerConnectionError (or the most specific arango exception available) and uses
"raise ... from e" when raising the ConnectionError.

Comment on lines 164 to 171
edge_collections = [
('semantic_similarity', 'similar'),
('theory_practice_bridges', 'implements'),
('paper_implements_theory', 'implements'),
]

for collection_name, edge_type in edge_collections:
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Mismatch with builder: memory store does not load the edge collections created by tools/graphsage/utils/build_graph.py.

GraphBuilder writes paper_has_chunk, chunk_similarity, paper_cites_paper, code_implements_paper, but here only semantic_similarity, theory_practice_bridges, paper_implements_theory are loaded. Direct/semantic bridges will be missing.

Apply:

-        edge_collections = [
-            ('semantic_similarity', 'similar'),
-            ('theory_practice_bridges', 'implements'),
-            ('paper_implements_theory', 'implements'),
-        ]
+        edge_collections = [
+            # legacy/alt names
+            ('semantic_similarity', 'similar'),
+            ('theory_practice_bridges', 'implements'),
+            ('paper_implements_theory', 'implements'),
+            # builder-produced collections
+            ('chunk_similarity', 'similar'),
+            ('paper_has_chunk', 'has_chunk'),
+            ('paper_cites_paper', 'cites'),
+            ('code_implements_paper', 'implements'),
+        ]

Follow-up: if “cites” must remain directed, consider not mirroring adjacency for that edge_type. I can draft that change if you want.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
edge_collections = [
('semantic_similarity', 'similar'),
('theory_practice_bridges', 'implements'),
('paper_implements_theory', 'implements'),
]
for collection_name, edge_type in edge_collections:
try:
edge_collections = [
# legacy/alt names
('semantic_similarity', 'similar'),
('theory_practice_bridges', 'implements'),
('paper_implements_theory', 'implements'),
# builder-produced collections
('chunk_similarity', 'similar'),
('paper_has_chunk', 'has_chunk'),
('paper_cites_paper', 'cites'),
('code_implements_paper', 'implements'),
]
for collection_name, edge_type in edge_collections:
try:
🤖 Prompt for AI Agents
In core/framework/memory_store.py around lines 164-171, the edge_collections
list is missing the collections created by tools/graphsage/utils/build_graph.py;
update edge_collections to include ('paper_has_chunk',
'<appropriate_edge_type>'), ('chunk_similarity','<appropriate_edge_type>'),
('paper_cites_paper','cites') and ('code_implements_paper','implements') in
addition to the existing entries, and ensure handling for directed edges: do not
mirror adjacency for 'paper_cites_paper' (keep it directed) while mirroring
undirected types like chunk_similarity; adjust any downstream logic that
currently assumes only the three old names so all builder-produced edges are
loaded and directedness preserved.

Comment on lines 108 to 140
def find_direct_bridges(self) -> List[Bridge]:
"""
Find direct bridges (explicit citations/references).
Returns:
List of direct bridges
"""
bridges = []

# Look for explicit edges between papers and code
for paper_idx in self.paper_nodes:
paper_id = self.graph_store.index_to_node[paper_idx]

# Check neighbors
neighbors = self.graph_store.adjacency.get(paper_idx, set())

for neighbor_idx in neighbors:
neighbor_type = self.graph_store.node_types.get(neighbor_idx, "")

if neighbor_type in ['code', 'repo']:
code_id = self.graph_store.index_to_node[neighbor_idx]

bridge = Bridge(
paper_id=paper_id,
code_id=code_id,
confidence=1.0, # Direct connection
bridge_type='direct',
evidence={'edge_exists': True}
)
bridges.append(bridge)

return bridges

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

“Direct” bridges currently include any neighbor edge; restrict to explicit code links.

Using adjacency will include similarity/citation edges. Filter by edge_types and allowed types (e.g., implements/code_implements_paper).

-        for paper_idx in self.paper_nodes:
+        allowed_types = {"implements", "code_implements_paper"}
+        # edge_types keys are canonicalized (min, max)
+        for paper_idx in self.paper_nodes:
             paper_id = self.graph_store.index_to_node[paper_idx]
-            neighbors = self.graph_store.adjacency.get(paper_idx, set())
+            neighbors = self.graph_store.adjacency.get(paper_idx, set())
             for neighbor_idx in neighbors:
                 neighbor_type = self.graph_store.node_types.get(neighbor_idx, "")
-                
-                if neighbor_type in ['code', 'repo']:
+                if neighbor_type in ['code', 'repo']:
+                    key = (min(paper_idx, neighbor_idx), max(paper_idx, neighbor_idx))
+                    etype = self.graph_store.edge_types.get(key)
+                    if etype not in allowed_types:
+                        continue
                     code_id = self.graph_store.index_to_node[neighbor_idx]
                     bridge = Bridge(
                         paper_id=paper_id,
                         code_id=code_id,
                         confidence=1.0,
                         bridge_type='direct',
                         evidence={'edge_exists': True}
                     )
                     bridges.append(bridge)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def find_direct_bridges(self) -> List[Bridge]:
"""
Find direct bridges (explicit citations/references).
Returns:
List of direct bridges
"""
bridges = []
# Look for explicit edges between papers and code
for paper_idx in self.paper_nodes:
paper_id = self.graph_store.index_to_node[paper_idx]
# Check neighbors
neighbors = self.graph_store.adjacency.get(paper_idx, set())
for neighbor_idx in neighbors:
neighbor_type = self.graph_store.node_types.get(neighbor_idx, "")
if neighbor_type in ['code', 'repo']:
code_id = self.graph_store.index_to_node[neighbor_idx]
bridge = Bridge(
paper_id=paper_id,
code_id=code_id,
confidence=1.0, # Direct connection
bridge_type='direct',
evidence={'edge_exists': True}
)
bridges.append(bridge)
return bridges
def find_direct_bridges(self) -> List[Bridge]:
"""
Find direct bridges (explicit citations/references).
Returns:
List of direct bridges
"""
bridges = []
# Look for explicit edges between papers and code
allowed_types = {"implements", "code_implements_paper"}
# edge_types keys are canonicalized (min, max)
for paper_idx in self.paper_nodes:
paper_id = self.graph_store.index_to_node[paper_idx]
neighbors = self.graph_store.adjacency.get(paper_idx, set())
for neighbor_idx in neighbors:
neighbor_type = self.graph_store.node_types.get(neighbor_idx, "")
if neighbor_type in ['code', 'repo']:
key = (min(paper_idx, neighbor_idx), max(paper_idx, neighbor_idx))
etype = self.graph_store.edge_types.get(key)
if etype not in allowed_types:
continue
code_id = self.graph_store.index_to_node[neighbor_idx]
bridge = Bridge(
paper_id=paper_id,
code_id=code_id,
confidence=1.0, # Direct connection
bridge_type='direct',
evidence={'edge_exists': True}
)
bridges.append(bridge)
return bridges
🤖 Prompt for AI Agents
In tools/graphsage/bridge_discovery/theory_practice_finder.py around lines
108-140, the current find_direct_bridges() treats any neighbor edge as a
"direct" bridge; change it to only consider explicit code-link edge types.
Update the neighbor iteration to retrieve edge metadata (use whatever structure
holds edge types in graph_store, e.g., graph_store.edge_types[(paper_idx,
neighbor_idx)] or adjacency entries that include edge_type), add an
allowed_edge_types set (e.g.,
{'implements','code_implements_paper','references_code','links_to_repo'}), and
only create Bridge objects when neighbor_type is in ['code','repo'] AND the
edge_type is in that allowed set; include the actual edge_type in the evidence
field. Ensure you handle missing edge metadata defensively (skip if not present)
and keep confidence=1.0 for these filtered explicit links.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (10)
core/framework/memory_store.py (5)

144-162: Remove unused collection; print per-collection delta; narrow except.

Eliminates Ruff F841 and makes counts accurate per collection.

         for collection_name, node_type in node_collections:
             try:
-                collection = db.collection(collection_name)
-                print(f"Loading {collection_name}...")
+                start_index = index
+                print(f"Loading {collection_name}...")
@@
-                print(f"  Loaded {index} nodes from {collection_name}")
+                print(f"  Loaded {index - start_index} nodes from {collection_name}")
-            except Exception as e:
-                print(f"  Warning: Could not load {collection_name}: {e}")
+            except (ArangoError, ServerConnectionError) as e:
+                print(f"  Warning: Could not load {collection_name}: {e}")

20-21: Narrow ArangoDB exceptions and import specific types.

Import concrete exceptions to avoid blind catches and enable raise ... from e.

 from arango import ArangoClient
+from arango.exceptions import ArangoError, ServerConnectionError

101-103: Avoid blind catch; preserve traceback with raise ... from e.

Catching Exception hides real Arango errors and violates Ruff BLE001/B904.

-        except Exception as e:
-            raise ConnectionError(f"Failed to connect to ArangoDB: {e}")
+        except (ArangoError, ServerConnectionError) as e:
+            raise ConnectionError("Failed to connect to ArangoDB") from e

165-201: Load all builder-produced edge collections; preserve directedness; canonicalize undirected keys.

Brings parity with graph builder and prevents wrong mirroring for directed edges (e.g., cites).

     def _load_edges(self, db):
         """Load edges from edge collections."""
-        edge_collections = [
-            ('semantic_similarity', 'similar'),
-            ('theory_practice_bridges', 'implements'),
-            ('paper_implements_theory', 'implements'),
-        ]
+        # (collection_name, edge_type, directed)
+        edge_collections = [
+            # legacy/alt names
+            ('semantic_similarity', 'similar', False),
+            ('theory_practice_bridges', 'implements', False),
+            ('paper_implements_theory', 'implements', False),
+            # builder-produced collections
+            ('chunk_similarity', 'similar', False),
+            ('paper_has_chunk', 'has_chunk', False),
+            ('paper_cites_paper', 'cites', True),          # keep directed
+            ('code_implements_paper', 'implements', True), # code -> paper
+        ]
@@
-        for collection_name, edge_type in edge_collections:
+        for collection_name, edge_type, directed in edge_collections:
             try:
                 print(f"Loading edges from {collection_name}...")
@@
-                    if from_id in self.node_ids and to_id in self.node_ids:
+                    if from_id in self.node_ids and to_id in self.node_ids:
                         from_idx = self.node_ids[from_id]
                         to_idx = self.node_ids[to_id]
-                        
-                        self.adjacency[from_idx].add(to_idx)
-                        self.adjacency[to_idx].add(from_idx)  # Undirected
-                        self.edge_types[(from_idx, to_idx)] = edge_type
+                        # Add adjacency and type
+                        if directed:
+                            self.adjacency[from_idx].add(to_idx)
+                            key = (from_idx, to_idx)
+                        else:
+                            self.adjacency[from_idx].add(to_idx)
+                            self.adjacency[to_idx].add(from_idx)  # undirected
+                            key = (min(from_idx, to_idx), max(from_idx, to_idx))
+                        self.edge_types[key] = edge_type
                         edge_count += 1
@@
-            except Exception as e:
-                print(f"  Warning: Could not load {collection_name}: {e}")
+            except (ArangoError, ServerConnectionError) as e:
+                print(f"  Warning: Could not load {collection_name}: {e}")

372-381: Re-wrap adjacency as defaultdict(set) on load.

Preserves write-time behavior of missing keys.

-        self.adjacency = {int(k): set(v) for k, v in data['adjacency'].items()}
+        self.adjacency = defaultdict(
+            set, {int(k): set(v) for k, v in data['adjacency'].items()}
+        )
tools/graphsage/utils/neighborhood_sampler.py (3)

72-76: Deterministic neighbor ordering for importance sampling — nice.

Sorting neighbors fixes the precompute/sampling alignment issue.


110-112: Deterministic neighbor ordering in sampling — good.

Avoids nondeterministic batches.


171-185: Top up adaptive sampling to always reach num_samples.

Current split can underfill when one bucket is small.

-        if diff_type:
-            num_diff = min(len(diff_type), num_samples * 2 // 3)
-            num_same = min(len(same_type), num_samples - num_diff)
-            
-            sampled = []
-            if num_diff > 0:
-                sampled.extend(random.sample(diff_type, num_diff))
-            if num_same > 0:
-                sampled.extend(random.sample(same_type, num_same))
-            
-            return sampled[:num_samples]
+        if diff_type:
+            target_diff = max(1, int(round(num_samples * 2 / 3)))
+            num_diff = min(len(diff_type), target_diff)
+            num_same = min(len(same_type), num_samples - num_diff)
+
+            sampled = []
+            if num_diff > 0:
+                sampled.extend(random.sample(diff_type, num_diff))
+            if num_same > 0:
+                sampled.extend(random.sample(same_type, num_same))
+
+            # Top up if underfilled
+            remaining = num_samples - len(sampled)
+            if remaining > 0:
+                pool = [n for n in (same_type + diff_type) if n not in sampled]
+                if pool:
+                    sampled.extend(random.sample(pool, min(remaining, len(pool))))
+            return sampled[:num_samples]
core/framework/graph_embedders.py (2)

173-177: Good fix: create permutation tensor on the correct device.

Prevents CUDA device mismatch.


250-261: Use iterable unpacking; validate layer count vs sampling depth.

Prevents runtime KeyError in forward when layers ≠ sampling depths.

-        dims = [config.input_dim] + config.hidden_dims + [config.output_dim]
+        dims = [config.input_dim, *config.hidden_dims, config.output_dim]
+        if len(dims) - 1 != len(config.num_neighbors):
+            raise ValueError(
+                f"Layer count mismatch: {len(dims) - 1} layers vs {len(config.num_neighbors)} sampling depths."
+            )
🧹 Nitpick comments (8)
core/framework/memory_store.py (2)

202-215: Zero-node guard should raise, not silently return.

Fail fast helps upstream callers handle an empty graph deterministically.

-        # Guard against zero-node graphs
-        if num_nodes == 0:
-            print("Warning: No nodes in graph, skipping shared memory creation")
-            return
+        # Guard against zero-node graphs
+        if num_nodes == 0:
+            raise ValueError("Cannot create shared memory: graph has zero nodes.")

248-251: Preserve traceback on shared memory failure.

Complies with Ruff B904; message can be shorter (TRY003).

-        except Exception as e:
-            self.cleanup()
-            raise RuntimeError(f"Failed to create shared memory: {e}")
+        except Exception as e:
+            self.cleanup()
+            raise RuntimeError("Failed to create shared memory") from e
tools/graphsage/utils/neighborhood_sampler.py (2)

240-252: Silence Ruff B007: rename unused loop variable.

Minor readability/lint fix.

-        for node, hop_neighbors in neighborhoods.items():
+        for _node, hop_neighbors in neighborhoods.items():

310-313: Remove unused total_edges.

Avoids Ruff F841.

-        # Sampling coverage
-        total_edges = sum(degrees)
+        # Sampling coverage
         sampled_edges_per_node = sum(self.config.num_neighbors)
core/framework/graph_embedders.py (4)

24-30: Enum exposes GCN but not implemented.

Either implement a GCN aggregator or remove GCN from the enum to avoid configuration footguns.

 class AggregatorType(Enum):
     """Types of aggregation functions."""
     MEAN = "mean"
     POOL = "pool"
     LSTM = "lstm"
-    GCN = "gcn"

295-307: Optional: guard against multi-layer usage until fixed.

Avoids silent misuse in production.

     def forward(self, node_indices: List[int], 
                 initial_features: Optional[torch.Tensor] = None) -> torch.Tensor:
@@
-        # Sample neighbors
+        # Sample neighbors
         samples = self.sample_neighbors(node_indices, self.config.num_neighbors)
+        if len(self.layers) > 1:
+            raise NotImplementedError(
+                "Multi-layer GraphSAGE requires bottom-up neighbor feature propagation; not implemented yet."
+            )

1-1: Style: shebang + non-executable file (Ruff EXE001).

Either drop the shebang or mark file executable. Low priority.


208-208: TRY003: prefer shorter messages in exceptions.

Not blocking; adjust over time to satisfy Ruff.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79c70be and 38fe25c.

📒 Files selected for processing (3)
  • core/framework/graph_embedders.py (1 hunks)
  • core/framework/memory_store.py (1 hunks)
  • tools/graphsage/utils/neighborhood_sampler.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
{tools/arxiv,core/framework}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{tools/arxiv,core/framework}/**/*.py: Format Python code with Black for tools/arxiv and core/framework
Run Ruff linting for tools/arxiv and core/framework

Files:

  • core/framework/memory_store.py
  • core/framework/graph_embedders.py
core/framework/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run mypy type checking on core/framework only

Files:

  • core/framework/memory_store.py
  • core/framework/graph_embedders.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use relative imports for intra-module references (from .utils import helper_function)

Files:

  • core/framework/memory_store.py
  • core/framework/graph_embedders.py
  • tools/graphsage/utils/neighborhood_sampler.py
{core,tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

All code must include docstrings tying implementation to the theoretical framework (e.g., explain CONVEYANCE, WHERE/WHAT/TIME)

Files:

  • core/framework/memory_store.py
  • core/framework/graph_embedders.py
  • tools/graphsage/utils/neighborhood_sampler.py
{core,tools}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Infrastructure (core/, tools/) must not depend on experiments/

Files:

  • core/framework/memory_store.py
  • core/framework/graph_embedders.py
  • tools/graphsage/utils/neighborhood_sampler.py
{tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

When in tools/ or experiments/, import from core.framework.* for shared infrastructure (e.g., core.framework.embedders)

Files:

  • tools/graphsage/utils/neighborhood_sampler.py
🧬 Code graph analysis (1)
core/framework/graph_embedders.py (2)
core/framework/memory_store.py (2)
  • GraphMemoryStore (34-381)
  • get_neighbors (252-271)
tools/graphsage/pipelines/graphsage_pipeline.py (1)
  • generate_embeddings (180-203)
🪛 Ruff (0.12.2)
core/framework/memory_store.py

1-1: Shebang is present but file is not executable

(EXE001)


84-84: Avoid specifying long messages outside the exception class

(TRY003)


86-86: Avoid specifying long messages outside the exception class

(TRY003)


91-91: Avoid specifying long messages outside the exception class

(TRY003)


101-101: Do not catch blind exception: Exception

(BLE001)


102-102: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


102-102: Avoid specifying long messages outside the exception class

(TRY003)


147-147: Local variable collection is assigned to but never used

Remove assignment to unused variable collection

(F841)


162-162: Do not catch blind exception: Exception

(BLE001)


199-199: Do not catch blind exception: Exception

(BLE001)


226-229: Avoid specifying long messages outside the exception class

(TRY003)


248-248: Do not catch blind exception: Exception

(BLE001)


250-250: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


250-250: Avoid specifying long messages outside the exception class

(TRY003)


332-332: Do not catch blind exception: Exception

(BLE001)

core/framework/graph_embedders.py

1-1: Shebang is present but file is not executable

(EXE001)


208-208: Avoid specifying long messages outside the exception class

(TRY003)


250-250: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

tools/graphsage/utils/neighborhood_sampler.py

1-1: Shebang is present but file is not executable

(EXE001)


127-127: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Loop control variable node not used within loop body

Rename unused node to _node

(B007)


267-267: Avoid specifying long messages outside the exception class

(TRY003)


310-310: Local variable total_edges is assigned to but never used

Remove assignment to unused variable total_edges

(F841)

🔇 Additional comments (1)
core/framework/memory_store.py (1)

116-125: Edge count fix (undirected) looks correct.

Dividing summed degrees by 2 avoids double-counting. Good.

Comment on lines 36 to 51
hidden_dims: List[int] = None # Hidden layer dimensions
output_dim: int = 256 # Final embedding dimension
num_layers: int = 2 # Number of GraphSAGE layers (K)
aggregator_type: AggregatorType = AggregatorType.MEAN
dropout: float = 0.5
concat: bool = True # Concatenate self and neighbor features
batch_size: int = 512
num_neighbors: List[int] = None # Neighbors to sample per layer
device: str = "cuda" if torch.cuda.is_available() else "cpu"

def __post_init__(self):
if self.hidden_dims is None:
self.hidden_dims = [512, 256]
if self.num_neighbors is None:
self.num_neighbors = [25, 10] # Sample 25 at layer 1, 10 at layer 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Fix mypy types and default mismatch; enforce layer/sampling invariant.

List[int] = None fails mypy; also defaults produce 3 layers but only 2 sampling depths. Make it 1-layer by default and validate.

-    hidden_dims: List[int] = None  # Hidden layer dimensions
+    hidden_dims: Optional[List[int]] = None  # Hidden layer dimensions
@@
-    num_neighbors: List[int] = None  # Neighbors to sample per layer
+    num_neighbors: Optional[List[int]] = None  # Neighbors to sample per layer
@@
-        if self.hidden_dims is None:
-            self.hidden_dims = [512, 256]
-        if self.num_neighbors is None:
-            self.num_neighbors = [25, 10]  # Sample 25 at layer 1, 10 at layer 2
+        if self.hidden_dims is None:
+            self.hidden_dims = [512]  # single hidden layer by default
+        if self.num_neighbors is None:
+            self.num_neighbors = [25]  # one-hop sampling by default
+        # Invariant: number of layers == len(num_neighbors)
+        if len(self.hidden_dims) + 1 != len(self.num_neighbors):
+            raise ValueError("len(hidden_dims)+1 must equal len(num_neighbors)")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
hidden_dims: List[int] = None # Hidden layer dimensions
output_dim: int = 256 # Final embedding dimension
num_layers: int = 2 # Number of GraphSAGE layers (K)
aggregator_type: AggregatorType = AggregatorType.MEAN
dropout: float = 0.5
concat: bool = True # Concatenate self and neighbor features
batch_size: int = 512
num_neighbors: List[int] = None # Neighbors to sample per layer
device: str = "cuda" if torch.cuda.is_available() else "cpu"
def __post_init__(self):
if self.hidden_dims is None:
self.hidden_dims = [512, 256]
if self.num_neighbors is None:
self.num_neighbors = [25, 10] # Sample 25 at layer 1, 10 at layer 2
from typing import List, Optional
import torch
from dataclasses import dataclass
@dataclass
class GraphSAGEConfig:
hidden_dims: Optional[List[int]] = None # Hidden layer dimensions
output_dim: int = 256 # Final embedding dimension
num_layers: int = 2 # Number of GraphSAGE layers (K)
aggregator_type: AggregatorType = AggregatorType.MEAN
dropout: float = 0.5
concat: bool = True # Concatenate self and neighbor features
batch_size: int = 512
num_neighbors: Optional[List[int]] = None # Neighbors to sample per layer
device: str = "cuda" if torch.cuda.is_available() else "cpu"
def __post_init__(self):
if self.hidden_dims is None:
self.hidden_dims = [512] # single hidden layer by default
if self.num_neighbors is None:
self.num_neighbors = [25] # one-hop sampling by default
# Invariant: number of GraphSAGE layers == len(num_neighbors)
if len(self.hidden_dims) + 1 != len(self.num_neighbors):
raise ValueError("len(hidden_dims)+1 must equal len(num_neighbors)")

CRITICAL BUG FIX: The forward pass was incorrectly using original node
embeddings for all layers instead of transformed features from previous layers.

Changes:
- Add proper layer-wise feature propagation logic
- For layer 0: use original embeddings from graph_store
- For layer > 0: use transformed features from previous layer
- Add temporary safety fallback: limit to single layer when using stored embeddings
- Default config to single layer (num_layers=1) for safety
- Add warning when multi-layer is attempted with current limitations

This fix ensures correct GraphSAGE behavior. Multi-layer support requires
full graph forward pass implementation which will be added in future update.

Addresses critical CodeRabbit review comment about incorrect neighbor feature
propagation across layers.
- Remove shebangs from library modules (not executable)
- Add checkpoint/resume functionality to GraphSAGE pipeline
- Support incremental processing with checkpoint persistence
- Skip completed phases on pipeline restart

These changes improve code organization and add robustness for
long-running pipeline operations with the ability to resume from
failures.
@r3d91ll
Copy link
Owner Author

r3d91ll commented Sep 8, 2025

@coderabbitai, can you review the last three commits. these should have addressed the issues you noted. please confirm

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 8, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

I'll examine the last three commits to review the changes and confirm if they address the previously noted issues.


🏁 Script executed:

#!/bin/bash
# Get the last 3 commits with detailed information
echo "=== Last 3 commits ===" 
git log -3 --oneline --decorate

echo -e "\n=== Detailed commit information ==="
git log -3 --stat --pretty=format:"%h - %an, %ar : %s"

echo -e "\n=== Changes in the last 3 commits ==="
git show --name-only HEAD~2..HEAD

Length of output: 14678


Let me check the git history more specifically to understand what commits are available:


🏁 Script executed:

#!/bin/bash
# Check git log more carefully
echo "=== Complete git log ===" 
git log --oneline --decorate --all

echo -e "\n=== Checking if there are actually 3 commits ==="
git rev-list --count HEAD

echo -e "\n=== Show the differences in the single commit ==="
git show --stat HEAD

echo -e "\n=== Check what files were changed in the last commit ==="
git diff --name-only HEAD~1 2>/dev/null || echo "No previous commit to compare with"

Length of output: 15218


Now I understand the situation better. Let me check if there are earlier commits or branches to understand what changes were made in this PR:


🏁 Script executed:

#!/bin/bash
# Check if there's a base branch to compare against
echo "=== Available branches ==="
git branch -a

echo -e "\n=== Check for origin/main as baseline ==="
git log --oneline origin/main..HEAD 2>/dev/null || echo "Cannot compare with origin/main"

echo -e "\n=== Since this is a grafted commit, let's examine the key GraphSAGE files ==="
echo "Checking GraphSAGE-related files..."

# Check the main GraphSAGE files specifically mentioned in the commit
echo -e "\n=== GraphSAGE Pipeline file ==="
head -20 tools/graphsage/pipelines/graphsage_pipeline.py

echo -e "\n=== Graph Embedders file ==="
head -20 core/framework/graph_embedders.py

echo -e "\n=== Memory Store file ==="
head -20 core/framework/memory_store.py

Length of output: 2178


Now let me examine the specific GraphSAGE files mentioned in the commit to understand the changes made:


🏁 Script executed:

#!/bin/bash
# Let's check for specific issues that were mentioned in the commit message
echo "=== Checking for shebangs in library modules ==="
grep -r "^#!" tools/graphsage/ || echo "No shebangs found in GraphSAGE tools"
grep -r "^#!" core/framework/ || echo "No shebangs found in core framework"

echo -e "\n=== Checking for checkpoint functionality in GraphSAGE pipeline ==="
grep -n -A5 -B5 "checkpoint\|resume" tools/graphsage/pipelines/graphsage_pipeline.py || echo "No checkpoint functionality found"

echo -e "\n=== Checking the GraphSAGE config for checkpoint settings ==="
grep -n -A3 -B3 "checkpoint\|resume" tools/graphsage/configs/graphsage_config.yaml || echo "No checkpoint config found"

echo -e "\n=== Let's see the structure of key GraphSAGE classes ==="
echo "Class definitions in graph_embedders.py:"
grep -n "^class\|^def " core/framework/graph_embedders.py | head -15

echo -e "\n=== Class definitions in memory_store.py:"
grep -n "^class\|^def " core/framework/memory_store.py | head -15

Length of output: 5062


Now let me check for specific issues that might have been addressed, especially around executable shebangs and code organization:


🏁 Script executed:

#!/bin/bash
# Let's look at the first few lines of the files that still have shebangs
echo "=== Files with shebangs that should be library modules ==="

echo "theory_practice_finder.py (first 10 lines):"
head -10 tools/graphsage/bridge_discovery/theory_practice_finder.py

echo -e "\nneighborhood_sampler.py (first 10 lines):"
head -10 tools/graphsage/utils/neighborhood_sampler.py

echo -e "\nembedders.py (first 10 lines):"
head -10 core/framework/embedders.py

echo -e "\n=== Check if these files have main execution blocks ==="
echo "Checking for if __name__ == '__main__' blocks:"

grep -n "if __name__" tools/graphsage/bridge_discovery/theory_practice_finder.py || echo "No main block in theory_practice_finder.py"
grep -n "if __name__" tools/graphsage/utils/neighborhood_sampler.py || echo "No main block in neighborhood_sampler.py" 
grep -n "if __name__" core/framework/embedders.py || echo "No main block in embedders.py"

echo -e "\n=== Check the pipeline which should have a shebang ==="
grep -n "if __name__" tools/graphsage/pipelines/graphsage_pipeline.py

Length of output: 1681


Based on my analysis of the commit and the current state of the code, I can now provide a comprehensive review. Let me examine one more aspect to complete the picture:


🏁 Script executed:

#!/bin/bash
# Let's check the specific improvements mentioned in the commit
echo "=== Checking incremental processing capabilities ==="
grep -n -A3 -B3 "incremental\|skip.*completed\|already.*completed" tools/graphsage/pipelines/graphsage_pipeline.py

echo -e "\n=== Checking error handling and robustness improvements ==="
grep -n -A2 -B2 "try:\|except\|Warning\|Error" tools/graphsage/pipelines/graphsage_pipeline.py | head -20

echo -e "\n=== Checking the commit's stated improvements ==="
echo "The commit mentions:"
echo "1. Remove shebangs from library modules"
echo "2. Add checkpoint/resume functionality" 
echo "3. Support incremental processing"
echo "4. Skip completed phases on restart"
echo ""
echo "Let's verify these claims..."

Length of output: 2617


@r3d91ll, I've completed a thorough review of the most recent commit. Here's my assessment of how well it addresses the previously noted issues:

Review Summary

Partially Addressed⚠️ - The commit makes good progress but there are still some outstanding issues:

Successfully Addressed:

  1. Checkpoint/Resume Functionality: Excellent implementation with:

    • JSON-based checkpoint persistence (graphsage_checkpoint.json)
    • Phase-wise completion tracking
    • Proper error handling for checkpoint I/O
    • Incremental processing that skips completed phases
  2. Robust Pipeline Design: Good error handling and recovery mechanisms

⚠️ Partially Addressed:

Shebang Issue: While the commit message claims shebangs were removed from library modules, several files still have them:

  • tools/graphsage/bridge_discovery/theory_practice_finder.py
  • tools/graphsage/utils/neighborhood_sampler.py
  • core/framework/embedders.py

These are library modules (no if __name__ == "__main__" blocks) and shouldn't have executable shebangs.

Well Implemented Features:

  1. GraphSAGE Architecture: Solid implementation with proper aggregators (Mean, Pooling, LSTM)
  2. Memory Management: RAM-first design with shared memory support
  3. Bridge Discovery: Multiple discovery methods (direct, semantic, structural, latent)
  4. Configuration Management: Comprehensive YAML-based configuration

🔧 Remaining Issue:

The shebangs should be removed from the three library files mentioned above. Only executable entry points like graphsage_pipeline.py should have shebangs.

Would you like me to create an issue to track the remaining shebang cleanup, or would you prefer to address this in a quick follow-up commit?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (13)
core/framework/memory_store.py (4)

146-161: Remove unused variable and fix per-collection count.

Drop unused collection and print only newly loaded count.

-            try:
-                collection = db.collection(collection_name)
-                print(f"Loading {collection_name}...")
+            try:
+                print(f"Loading {collection_name}...")
+                start_index = index
@@
-                print(f"  Loaded {index} nodes from {collection_name}")
+                print(f"  Loaded {index - start_index} nodes from {collection_name}")

93-101: Catch specific Arango exceptions and chain; avoid blind excepts.

Replace broad Exception with Arango exceptions and use raise ... from e.

@@
-from arango import ArangoClient
+from arango import ArangoClient
+from arango.exceptions import ArangoError, ServerConnectionError
@@
-        except Exception as e:
-            raise ConnectionError(f"Failed to connect to ArangoDB: {e}")
+        except (ArangoError, ServerConnectionError) as e:
+            raise ConnectionError("Failed to connect to ArangoDB") from e
@@
-            except Exception as e:
-                print(f"  Warning: Could not load {collection_name}: {e}")
+            except (ArangoError, ServerConnectionError) as e:
+                print(f"  Warning: Could not load {collection_name}: {e}")
@@
-            except Exception as e:
-                print(f"  Warning: Could not load {collection_name}: {e}")
+            except (ArangoError, ServerConnectionError) as e:
+                print(f"  Warning: Could not load {collection_name}: {e}")

Also applies to: 161-163, 198-200, 19-20


166-170: Load all builder-produced edge collections; canonicalize keys.

Include collections from the builder and store edge keys as (min,max) to avoid duplicates.

-        edge_collections = [
-            ('semantic_similarity', 'similar'),
-            ('theory_practice_bridges', 'implements'),
-            ('paper_implements_theory', 'implements'),
-        ]
+        edge_collections = [
+            ("semantic_similarity", "similar"),
+            ("theory_practice_bridges", "implements"),
+            ("paper_implements_theory", "implements"),
+            # builder-created
+            ("chunk_similarity", "similar"),
+            ("paper_has_chunk", "has_chunk"),
+            ("paper_cites_paper", "cites"),
+            ("code_implements_paper", "implements"),
+        ]
@@
-                        self.adjacency[from_idx].add(to_idx)
-                        self.adjacency[to_idx].add(from_idx)  # Undirected
-                        self.edge_types[(from_idx, to_idx)] = edge_type
+                        # Treat as undirected in adjacency for now (including cites)
+                        self.adjacency[from_idx].add(to_idx)
+                        self.adjacency[to_idx].add(from_idx)
+                        key = (min(from_idx, to_idx), max(from_idx, to_idx))
+                        self.edge_types[key] = edge_type
                         edge_count += 1

Also applies to: 172-195


371-379: Preserve defaultdict semantics when loading from disk.

Re-wrap adjacency with defaultdict(set) to keep write-time behavior.

-        self.adjacency = {int(k): set(v) for k, v in data['adjacency'].items()}
+        self.adjacency = defaultdict(
+            set, {int(k): set(v) for k, v in data['adjacency'].items()}
+        )
tools/graphsage/utils/build_graph.py (6)

1-11: Expand module docstring to tie to CONVEYANCE and WHERE/WHAT/TIME.

Aligns with repo guidelines.

-"""
-Build Graph from Existing Data.
-
-Creates edges between:
+"""
+Graph construction utilities for GraphSAGE.
+
+CONVEYANCE: materialize structural relations that GraphSAGE aggregates to
+discover theory-practice bridges.
+
+WHERE: ArangoDB collections → edge collections:
+paper_has_chunk, chunk_similarity, code_implements_paper, paper_cites_paper.
+
+WHAT: deterministic, idempotent builders with batching/top‑k ANN similarity.
+
+TIME: supports incremental runs; can rebuild subsets without retraining.
+
+Creates edges between:
 1. Papers and their chunks
 2. Similar papers (based on embeddings)
 3. Papers and code (where we can infer connections)
 4. Citation networks (if available)
 
 This prepares the graph for GraphSAGE processing.
 """

51-56: Ensure unique compound index on edge collections (_from,_to,type).

Prevents duplicate edges; keeps inserts idempotent.

     def _ensure_collections(self):
         """Ensure all edge collections exist."""
         for collection_name in self.edge_collections.keys():
             if not self.db.has_collection(collection_name):
                 self.db.create_collection(collection_name, edge=True)
                 print(f"Created edge collection: {collection_name}")
             self.edge_collections[collection_name] = self.db.collection(collection_name)
+            try:
+                idx_fields = ["_from", "_to", "type"]
+                existing = self.edge_collections[collection_name].indexes()
+                has_unique = any(
+                    i.get("type") == "persistent"
+                    and i.get("unique")
+                    and i.get("fields") == idx_fields
+                for i in existing
+                )
+                if not has_unique:
+                    self.edge_collections[collection_name].add_persistent_index(
+                        idx_fields, unique=True, sparse=False
+                    )
+            except (ArangoError, ServerConnectionError) as e:
+                print(f"Warning: failed to ensure index on {collection_name}: {e}")

23-24: Remove sys.path hacks; use package/relative imports.

This violates packaging; not needed here.

-sys.path.append(str(Path(__file__).parent.parent.parent.parent))
-

21-22: Import and use Arango-specific exceptions; avoid blind excepts.

Make error handling explicit and lint-clean.

 from arango import ArangoClient
+from arango.exceptions import ArangoError, ServerConnectionError
@@
-                except Exception as e:
+                except (ArangoError, ServerConnectionError) as e:
                     print(f"Error inserting batch: {e}")
@@
-                except Exception as e:
+                except (ArangoError, ServerConnectionError) as e:
                     print(f"Error inserting batch: {e}")
@@
-            except Exception as e:
+            except (ArangoError, ServerConnectionError) as e:
                 print(f"Error creating code-paper edges: {e}")
@@
-            except Exception as e:
+            except (ArangoError, ServerConnectionError) as e:
                 print(f"Error creating citation edges: {e}")
@@
-            except Exception:
-                pass
+            except (ArangoError, ServerConnectionError):
+                pass
@@
-                except Exception:
-                    pass
+                except (ArangoError, ServerConnectionError):
+                    pass

Also applies to: 88-92, 166-170, 213-214, 256-257, 278-279, 291-292


218-260: Process all citations; add inbound ‘citations’; drop LIMIT; validate endpoints.

The LIMIT 100 drops most edges; also handle inbound citations.

-        cursor = self.db.aql.execute("""
-            FOR paper IN arxiv_papers
-            FILTER paper.references != null OR paper.citations != null
-            LIMIT 100
-            RETURN {
-                id: paper.arxiv_id,
-                refs: paper.references,
-                cites: paper.citations
-            }
-        """)
+        cursor = self.db.aql.execute("""
+            FOR paper IN arxiv_papers
+            FILTER paper.references != null OR paper.citations != null
+            RETURN { id: paper.arxiv_id, refs: paper.references, cites: paper.citations }
+        """)
@@
-        for paper in papers_with_citations:
+        for paper in papers_with_citations:
             # Process references
             if paper.get('refs'):
                 for ref_id in paper['refs']:
-                    edge = {
-                        '_from': f"arxiv_papers/{paper['id']}",
-                        '_to': f"arxiv_papers/{ref_id}",
-                        'type': 'cites'
-                    }
-                    edges.append(edge)
+                    _from = f"arxiv_papers/{paper['id']}"
+                    _to = f"arxiv_papers/{ref_id}"
+                    try:
+                        if self.db.has_document(_to):
+                            edges.append({'_from': _from, '_to': _to, 'type': 'cites'})
+                    except (ArangoError, ServerConnectionError):
+                        pass
+            # Process inbound citations
+            if paper.get('cites'):
+                for citing_id in paper['cites']:
+                    _from = f"arxiv_papers/{citing_id}"
+                    _to = f"arxiv_papers/{paper['id']}"
+                    try:
+                        if self.db.has_document(_from):
+                            edges.append({'_from': _from, '_to': _to, 'type': 'cites'})
+                    except (ArangoError, ServerConnectionError):
+                        pass

98-175: Replace O(N²) cosine with FAISS top‑k ANN; canonicalize; use chunk endpoints.

Current approach won’t scale to millions of nodes.

-    def build_embedding_similarity_edges(self, similarity_threshold: float = 0.8):
+    def build_embedding_similarity_edges(self, similarity_threshold: float = 0.8, top_k: int = 10):
         """Create edges between similar embeddings."""
         print(f"\nBuilding similarity edges (threshold={similarity_threshold})...")
@@
-        # Convert to numpy array
-        vectors = np.array([e['vector'] for e in embeddings])
+        # Prepare vectors (float32 + L2 normalize for cosine via inner product)
+        vectors = np.array([e["vector"] for e in embeddings], dtype=np.float32)
+        norms = np.linalg.norm(vectors, axis=1, keepdims=True) + 1e-12
+        vectors = vectors / norms
+
+        # Prefer chunk endpoints when available, else use embedding doc id
+        ids_for_edges = [
+            (e["chunk_id"] if e.get("chunk_id") else e["id"]) for e in embeddings
+        ]
@@
-        # Compute pairwise similarities (in batches for memory efficiency)
-        batch_size = 100
-        edges = []
-
-        for i in tqdm(range(0, len(embeddings), batch_size)):
-            batch_end = min(i + batch_size, len(embeddings))
-            batch_vectors = vectors[i:batch_end]
-            
-            # Compute similarities with all other vectors
-            similarities = cosine_similarity(batch_vectors, vectors)
-            
-            # Find high similarity pairs
-            for local_idx in range(batch_end - i):
-                global_idx = i + local_idx
-                
-                # Get top similar items (excluding self)
-                sim_scores = similarities[local_idx]
-                sim_scores[global_idx] = -1  # Exclude self
-                
-                # Find indices above threshold
-                similar_indices = np.where(sim_scores >= similarity_threshold)[0]
-                
-                for sim_idx in similar_indices[:10]:  # Limit to top 10
-                    if sim_idx != global_idx:
-                        edge = {
-                            '_from': embeddings[global_idx]['id'],
-                            '_to': embeddings[sim_idx]['id'],
-                            'similarity': float(sim_scores[sim_idx]),
-                            'type': 'semantic_similarity'
-                        }
-                        edges.append(edge)
+        # ANN search with FAISS
+        try:
+            import faiss  # type: ignore
+        except ImportError:
+            raise RuntimeError("Install faiss-cpu or faiss-gpu to build similarity edges at scale.")
+
+        dim = vectors.shape[1]
+        index = faiss.IndexFlatIP(dim)
+        index.add(vectors)
+
+        edges = []
+        batch_size = 2048
+        for i in tqdm(range(0, len(vectors), batch_size)):
+            batch = vectors[i : i + batch_size]
+            D, I = index.search(batch, top_k + 1)  # includes self
+            for row, (scores, idxs) in enumerate(zip(D, I)):
+                src = ids_for_edges[i + row]
+                for score, j in zip(scores, idxs):
+                    if j < 0 or j == (i + row):
+                        continue
+                    if score < similarity_threshold:
+                        continue
+                    dst = ids_for_edges[j]
+                    _from, _to = (src, dst) if src <= dst else (dst, src)
+                    edges.append(
+                        {
+                            "_from": _from,
+                            "_to": _to,
+                            "similarity": float(score),
+                            "type": "semantic_similarity",
+                        }
+                    )
@@
-                try:
-                    self.edge_collections['chunk_similarity'].insert_many(batch, overwrite=True)
+                try:
+                    self.edge_collections["chunk_similarity"].insert_many(batch, overwrite=True)
                     total_inserted += len(batch)
-                except Exception as e:
+                except (ArangoError, ServerConnectionError) as e:
                     print(f"Error inserting batch: {e}")
core/framework/graph_embedders.py (1)

35-36: Fix mypy types and enforce layer/sampling invariants.

Use Optional types and validate counts.

-    hidden_dims: List[int] = None  # Hidden layer dimensions
+    hidden_dims: Optional[List[int]] = None  # Hidden layer dimensions
@@
-    num_neighbors: List[int] = None  # Neighbors to sample per layer
+    num_neighbors: Optional[List[int]] = None  # Neighbors to sample per layer
@@
     def __post_init__(self):
         if self.hidden_dims is None:
             # For single layer, just one hidden dim
             self.hidden_dims = [256] if self.num_layers == 1 else [512, 256]
         if self.num_neighbors is None:
             # Adjust based on number of layers
             self.num_neighbors = [25] if self.num_layers == 1 else [25, 10]
+        # Invariants
+        if len(self.hidden_dims) + 1 != len(self.num_neighbors):
+            raise ValueError("len(hidden_dims)+1 must equal len(num_neighbors)")
+        if self.num_layers != len(self.hidden_dims) + 1:
+            raise ValueError("num_layers must equal len(hidden_dims)+1")

Also applies to: 42-52

tools/graphsage/pipelines/graphsage_pipeline.py (2)

27-35: Remove sys.path hack; use relative imports within tools.graphsage.

Keeps packaging clean; runnable via python -m tools.graphsage.pipelines.graphsage_pipeline.

-# Add parent directories to path
-sys.path.append(str(Path(__file__).parent.parent.parent.parent))
-
-from core.framework.memory_store import GraphMemoryStore
-from core.framework.graph_embedders import GraphSAGEEmbedder, GraphSAGEConfig, AggregatorType
-from tools.graphsage.utils.neighborhood_sampler import NeighborhoodSampler, SamplingConfig
-from tools.graphsage.bridge_discovery.theory_practice_finder import (
+from core.framework.memory_store import GraphMemoryStore
+from core.framework.graph_embedders import GraphSAGEEmbedder, GraphSAGEConfig, AggregatorType
+from ..utils.neighborhood_sampler import NeighborhoodSampler, SamplingConfig
+from ..bridge_discovery.theory_practice_finder import (
     TheoryPracticeFinder, BridgeDiscoveryConfig
 )

221-224: Do not overwrite input feature buffer with output embeddings (shape mismatch).

This corrupts inputs and can throw at runtime (e.g., 2048 vs 256).

-        # Store in graph store
-        if self.graph_store.node_embeddings is not None:
-            self.graph_store.node_embeddings[:] = embeddings
+        # Do not overwrite input feature buffer; persist via DB/cache instead.
🧹 Nitpick comments (8)
core/framework/memory_store.py (1)

210-214: Fail fast on zero-node graphs when creating shared memory.

Return+print hides misconfigurations; raise to surface errors early.

-        if num_nodes == 0:
-            print("Warning: No nodes in graph, skipping shared memory creation")
-            return
+        if num_nodes == 0:
+            raise ValueError("Cannot create shared memory: graph has zero nodes")
core/framework/graph_embedders.py (2)

251-261: Prefer iterable unpacking and validate layer count vs sampling depths.

-        dims = [config.input_dim] + config.hidden_dims + [config.output_dim]
+        dims = [config.input_dim, *config.hidden_dims, config.output_dim]
+        if len(dims) - 1 != len(config.num_neighbors):
+            raise ValueError(
+                f"Layer count mismatch: {len(dims) - 1} vs sampling depths {len(config.num_neighbors)}"
+            )

313-318: Add stacklevel to warning to point at caller.

-            warnings.warn(
+            warnings.warn(
                 "Multi-layer GraphSAGE currently requires full graph forward pass. "
                 "Using only first layer as temporary fallback. "
                 "Full implementation requires maintaining node->feature mapping across layers.",
-                RuntimeWarning
+                RuntimeWarning,
+                stacklevel=2
             )
tools/graphsage/pipelines/graphsage_pipeline.py (5)

1-1: Remove shebang in non-executable module (or chmod +x).

-#!/usr/bin/env python3

157-161: Drop unused variable assignment.

-            stats = self.graph_store.load_from_arangodb(db_config)
+            self.graph_store.load_from_arangodb(db_config)

309-309: Narrow exception handling when saving to Arango.

-from arango import ArangoClient
+from arango import ArangoClient
+from arango.exceptions import ArangoError, ServerConnectionError
@@
-            except Exception as e:
+            except (ArangoError, ServerConnectionError) as e:
                 print(f"Error saving batch: {e}")
@@
-            except Exception as e:
+            except (ArangoError, ServerConnectionError) as e:
                 print(f"Error saving {bridge_type} bridges: {e}")

Also applies to: 353-353, 36-36


371-375: Ensure shared memory is available when resuming from cache.

Create the input-feature shared memory after loading from disk if enabled.

         else:
             print("Skipping graph loading (already completed)")
             # Still need to initialize graph_store from cache
             self.graph_store = GraphMemoryStore(max_memory_gb=self.config.max_memory_gb)
             self.graph_store.load_from_disk(self.config.cache_path)
+            if self.config.use_shared_memory:
+                self.graph_store.create_shared_memory(
+                    embedding_dim=self.config.model_config.input_dim
+                )

185-189: Minor nits: remove redundant f-prefix; mark unused arg.

-        print(f"Model initialized:")
+        print("Model initialized:")
@@
-    def train_model(self, num_epochs: int = 10):
+    def train_model(self, _num_epochs: int = 10):

Also applies to: 190-200

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38fe25c and 4f3d8bd.

📒 Files selected for processing (4)
  • core/framework/graph_embedders.py (1 hunks)
  • core/framework/memory_store.py (1 hunks)
  • tools/graphsage/pipelines/graphsage_pipeline.py (1 hunks)
  • tools/graphsage/utils/build_graph.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
{tools/arxiv,core/framework}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{tools/arxiv,core/framework}/**/*.py: Format Python code with Black for tools/arxiv and core/framework
Run Ruff linting for tools/arxiv and core/framework

Files:

  • core/framework/graph_embedders.py
  • core/framework/memory_store.py
core/framework/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run mypy type checking on core/framework only

Files:

  • core/framework/graph_embedders.py
  • core/framework/memory_store.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use relative imports for intra-module references (from .utils import helper_function)

Files:

  • core/framework/graph_embedders.py
  • core/framework/memory_store.py
  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/build_graph.py
{core,tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

All code must include docstrings tying implementation to the theoretical framework (e.g., explain CONVEYANCE, WHERE/WHAT/TIME)

Files:

  • core/framework/graph_embedders.py
  • core/framework/memory_store.py
  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/build_graph.py
{core,tools}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Infrastructure (core/, tools/) must not depend on experiments/

Files:

  • core/framework/graph_embedders.py
  • core/framework/memory_store.py
  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/build_graph.py
{tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

When in tools/ or experiments/, import from core.framework.* for shared infrastructure (e.g., core.framework.embedders)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/build_graph.py
{tools/**/pipelines/**/*.py,tools/github/github_pipeline_manager.py}

📄 CodeRabbit inference engine (CLAUDE.md)

Pipelines must support resumption from checkpoints (error recovery)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
🧠 Learnings (3)
📚 Learning: 2025-09-04T03:47:49.352Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T03:47:49.352Z
Learning: Applies to {tools,experiments}/**/*.py : When in tools/ or experiments/, import from core.framework.* for shared infrastructure (e.g., core.framework.embedders)

Applied to files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
📚 Learning: 2025-09-04T03:47:49.352Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T03:47:49.352Z
Learning: Applies to {tools/**/pipelines/**/*.py,tools/github/github_pipeline_manager.py} : Pipelines must support resumption from checkpoints (error recovery)

Applied to files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
📚 Learning: 2025-09-04T03:49:10.310Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T03:49:10.310Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Implement checkpointing with atomic saves and automatic resume on failure; persist to acid_phased_checkpoint.json

Applied to files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
🧬 Code graph analysis (3)
core/framework/graph_embedders.py (3)
core/framework/memory_store.py (2)
  • GraphMemoryStore (33-380)
  • get_neighbors (251-270)
tools/graphsage/utils/neighborhood_sampler.py (1)
  • sample_neighbors (99-127)
tools/graphsage/pipelines/graphsage_pipeline.py (1)
  • generate_embeddings (209-232)
tools/graphsage/pipelines/graphsage_pipeline.py (4)
core/framework/memory_store.py (6)
  • GraphMemoryStore (33-380)
  • load_from_disk (361-380)
  • load_from_arangodb (69-132)
  • save_to_disk (341-359)
  • create_shared_memory (201-249)
  • _calculate_memory_usage (305-320)
core/framework/graph_embedders.py (4)
  • GraphSAGEEmbedder (234-468)
  • GraphSAGEConfig (32-51)
  • AggregatorType (23-28)
  • generate_embeddings (445-468)
tools/graphsage/utils/neighborhood_sampler.py (2)
  • NeighborhoodSampler (34-322)
  • SamplingConfig (22-31)
tools/graphsage/bridge_discovery/theory_practice_finder.py (3)
  • TheoryPracticeFinder (50-439)
  • BridgeDiscoveryConfig (36-47)
  • discover_all_bridges (320-359)
tools/graphsage/utils/build_graph.py (1)
tools/graphsage/pipelines/graphsage_pipeline.py (1)
  • main (463-490)
🪛 Ruff (0.12.2)
core/framework/graph_embedders.py

209-209: Avoid specifying long messages outside the exception class

(TRY003)


251-251: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


313-313: No explicit stacklevel keyword argument found

Set stacklevel=2

(B028)

core/framework/memory_store.py

83-83: Avoid specifying long messages outside the exception class

(TRY003)


85-85: Avoid specifying long messages outside the exception class

(TRY003)


90-90: Avoid specifying long messages outside the exception class

(TRY003)


100-100: Do not catch blind exception: Exception

(BLE001)


101-101: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


101-101: Avoid specifying long messages outside the exception class

(TRY003)


146-146: Local variable collection is assigned to but never used

Remove assignment to unused variable collection

(F841)


161-161: Do not catch blind exception: Exception

(BLE001)


198-198: Do not catch blind exception: Exception

(BLE001)


225-228: Avoid specifying long messages outside the exception class

(TRY003)


247-247: Do not catch blind exception: Exception

(BLE001)


249-249: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


249-249: Avoid specifying long messages outside the exception class

(TRY003)


331-331: Do not catch blind exception: Exception

(BLE001)

tools/graphsage/pipelines/graphsage_pipeline.py

1-1: Shebang is present but file is not executable

(EXE001)


114-114: Consider moving this statement to an else block

(TRY300)


115-115: Do not catch blind exception: Exception

(BLE001)


129-129: Do not catch blind exception: Exception

(BLE001)


157-157: Local variable stats is assigned to but never used

Remove assignment to unused variable stats

(F841)


185-185: f-string without any placeholders

Remove extraneous f prefix

(F541)


190-190: Unused method argument: num_epochs

(ARG002)


309-309: Do not catch blind exception: Exception

(BLE001)


353-353: Do not catch blind exception: Exception

(BLE001)


437-437: Avoid specifying long messages outside the exception class

(TRY003)


439-439: Avoid specifying long messages outside the exception class

(TRY003)

tools/graphsage/utils/build_graph.py

91-91: Do not catch blind exception: Exception

(BLE001)


169-169: Do not catch blind exception: Exception

(BLE001)


213-213: Do not catch blind exception: Exception

(BLE001)


256-256: Do not catch blind exception: Exception

(BLE001)


278-279: try-except-pass detected, consider logging the exception

(S110)


278-278: Do not catch blind exception: Exception

(BLE001)


291-292: try-except-pass detected, consider logging the exception

(S110)


291-291: Do not catch blind exception: Exception

(BLE001)

- Expand edge collection loading to include all graph edges
- Fix enum conversion for aggregator type from YAML config
- Now loads 5.3M edges from coauthorship and other collections
- build_metadata_graph.py: Build metadata-based edges (coauthorship, categories, temporal)
- graph_visualizer.py: Multi-backend visualization (PyVis, Plotly, Gephi export)
- Support for large-scale graph exploration and analysis
- Ignore cache directories
- Ignore checkpoint files
- Ignore IDE settings (.zed)
- Ignore scratch and staging directories
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

♻️ Duplicate comments (3)
core/framework/memory_store.py (1)

377-386: Re-wrap adjacency as defaultdict on load.

Preserve write-time semantics.

-        self.adjacency = {int(k): set(v) for k, v in data['adjacency'].items()}
+        self.adjacency = defaultdict(
+            set, {int(k): set(v) for k, v in data['adjacency'].items()}
+        )
tools/graphsage/pipelines/graphsage_pipeline.py (2)

27-35: Remove sys.path hack; use relative imports within tools.graphsage.

Matches repo guideline and prior review.

-# Add parent directories to path
-sys.path.append(str(Path(__file__).parent.parent.parent.parent))
-
-from core.framework.memory_store import GraphMemoryStore
-from core.framework.graph_embedders import GraphSAGEEmbedder, GraphSAGEConfig, AggregatorType
-from tools.graphsage.utils.neighborhood_sampler import NeighborhoodSampler, SamplingConfig
-from tools.graphsage.bridge_discovery.theory_practice_finder import (
+from core.framework.memory_store import GraphMemoryStore
+from core.framework.graph_embedders import GraphSAGEEmbedder, GraphSAGEConfig, AggregatorType
+from ..utils.neighborhood_sampler import NeighborhoodSampler, SamplingConfig
+from ..bridge_discovery.theory_practice_finder import (
     TheoryPracticeFinder, BridgeDiscoveryConfig
 )

221-224: Do not overwrite input feature buffer with output embeddings.

Shape mismatch risk (input_dim vs output_dim) and data corruption.

-        # Store in graph store
-        if self.graph_store.node_embeddings is not None:
-            self.graph_store.node_embeddings[:] = embeddings
+        # Do not overwrite input features buffer; persist outputs separately (DB/cache).

Follow-up: if zero-copy outputs are needed, add a dedicated output buffer in GraphMemoryStore sized to output_dim.

🧹 Nitpick comments (22)
tools/graphsage/utils/build_metadata_graph.py (5)

23-23: Drop sys.path hack.

Not needed here; avoid modifying import paths in tools/.

- sys.path.append(str(Path(__file__).parent.parent.parent.parent))

57-57: Annotate Optional for sample_size.

PEP 484: use Optional[int].

-def build_coauthorship_edges(self, sample_size: int = None):
+def build_coauthorship_edges(self, sample_size: Optional[int] = None):
@@
-def build_category_edges(self, sample_size: int = None):
+def build_category_edges(self, sample_size: Optional[int] = None):
@@
-def build_temporal_edges(self, sample_size: int = None):
+def build_temporal_edges(self, sample_size: Optional[int] = None):

Also add: from typing import Optional (Line 17).

Also applies to: 135-135, 224-224


92-96: Silence Ruff B007: unused loop var.

Variable author is unused.

-for author, papers in tqdm(author_papers.items(), desc="Creating co-authorship edges"):
+for _, papers in tqdm(author_papers.items(), desc="Creating co-authorship edges"):

327-333: Use bind vars in AQL to avoid f-string in query.

Safer and cleaner.

-                cursor = self.db.aql.execute(f"""
-                    FOR e IN {edge_type}
-                    FILTER e._from == 'arxiv_papers/{paper_id}' OR e._to == 'arxiv_papers/{paper_id}'
-                    RETURN e
-                """, bind_vars={'paper_id': paper_id})
+                cursor = self.db.aql.execute(
+                    f"""
+                    FOR e IN {edge_type}
+                    FILTER e._from == CONCAT('arxiv_papers/', @pid) OR e._to == CONCAT('arxiv_papers/', @pid)
+                    RETURN e
+                    """,
+                    bind_vars={'pid': paper_id},
+                )

1-12: Tie docstring to the framework.

Add one-liner linking to CONVEYANCE/WHERE/WHAT/TIME as per guidelines.

-"""
+"""
 Build Metadata-Based Graph for All ArXiv Papers.
@@
-This provides uniform graph structure across all 2.8M papers.
+This provides uniform graph structure across all 2.8M papers.
+
+Framework note: Encodes WHERE (topology via metadata), WHAT (domains/categories), and TIME (temporal proximity) to support CONVEYANCE discovery.
"""
core/framework/memory_store.py (3)

143-161: Per-collection node count; drop unused var.

Print delta per collection; remove unused collection.

-            try:
-                collection = db.collection(collection_name)
+            try:
+                db.collection(collection_name)
                 print(f"Loading {collection_name}...")
-                
+                start_index = index
                 # Get all documents
                 cursor = db.aql.execute(f"FOR doc IN {collection_name} RETURN doc")
@@
-                print(f"  Loaded {index} nodes from {collection_name}")
+                print(f"  Loaded {index - start_index} nodes from {collection_name}")

253-255: Chain RuntimeError on shm failure.

Preserve original exception.

-        except Exception as e:
+        except Exception as e:
             self.cleanup()
-            raise RuntimeError(f"Failed to create shared memory: {e}")
+            raise RuntimeError("Failed to create shared memory") from e

204-206: Narrow exception handling in loaders/cleanup.

Replace bare Exception with ArangoError where applicable; print warnings with context.

-            except Exception as e:
+            except ArangoError as e:
                 print(f"  Warning: Could not load {collection_name}: {e}")
@@
-            except Exception as e:
+            except ArangoError as e:
                 print(f"  Warning: Could not load {collection_name}: {e}")
@@
-            except Exception as e:
+            except Exception as e:
                 print(f"Warning: Error cleaning up shared memory: {e}")

Also applies to: 161-163, 337-339

tools/graphsage/pipelines/graphsage_pipeline.py (6)

157-157: Drop unused variable.

stats is unused.

-            stats = self.graph_store.load_from_arangodb(db_config)
+            self.graph_store.load_from_arangodb(db_config)

185-189: Remove redundant f-string marker.

Minor lint fix.

-        print(f"Model initialized:")
+        print("Model initialized:")

103-106: Make checkpoint path configurable; use it.

Avoid hard-coding in repo root.

 class PipelineConfig:
@@
     save_bridges: bool = True
+    # Checkpointing
+    checkpoint_path: str = "./cache/graphsage_checkpoint.json"
@@
-        self.checkpoint_file = Path("graphsage_checkpoint.json")
+        self.checkpoint_file = Path(self.config.checkpoint_path)

Also applies to: 119-131, 56-65


306-311: Narrow Arango exceptions on batch saves.

Replace bare Exception; keep traceback.

-from arango import ArangoClient
+from arango import ArangoClient
+from arango.exceptions import ArangoError
@@
-            except Exception as e:
+            except ArangoError as e:
                 print(f"Error saving batch: {e}")
@@
-            except Exception as e:
+            except ArangoError as e:
                 print(f"Error saving {bridge_type} bridges: {e}")

Also applies to: 350-355, 36-37


471-475: Robust default config path.

Resolve relative to file for python -m execution.

-    parser.add_argument('--config', type=str, default='../configs/graphsage_config.yaml',
+    default_config = (Path(__file__).parent.parent / 'configs' / 'graphsage_config.yaml').resolve()
+    parser.add_argument('--config', type=str, default=str(default_config),

197-206: Use num_epochs to avoid unused-arg lint.

Minimal fix.

-        print("Using random initialization (training not implemented yet)")
+        print(f"Using random initialization (training not implemented yet); num_epochs={num_epochs}")
tools/graphsage/visualization/graph_visualizer.py (8)

1-9: Docstring claims Datashader but no implementation; add theory tie-in per guidelines.

Datashader is listed but not implemented/available via CLI. Also, tools/*.py must include docstrings tying to the theoretical framework (CONVEYANCE, WHERE/WHAT/TIME). Update the module docstring accordingly.

Apply:

-"""
-Advanced Graph Visualization for ArangoDB.
-
-Provides multiple visualization backends for exploring the graph structure:
-1. PyVis - Interactive HTML networks
-2. Plotly - 3D interactive graphs
-3. Datashader - Large-scale visualization
-4. Export to Gephi format
-"""
+"""
+Advanced Graph Visualization for ArangoDB.
+
+Provides multiple visualization backends for exploring the graph structure:
+1. PyVis - Interactive HTML networks
+2. Plotly - 3D interactive graphs
+3. Datashader - Large-scale visualization (TODO: not yet implemented)
+4. Export to Gephi format
+
+CONVEYANCE: Visual channels map topology to interpretable layouts (PyVis/Plotly/Gephi).
+WHERE/WHAT/TIME: WHERE=ArangoDB academy_store/academy_graph; WHAT=heterogeneous nodes/edges (paper/chunk/author/category/code); TIME=snapshot at query time; interactive exploration only.
+"""

11-24: Remove unused imports; add logging for diagnostics.

json, pandas, tqdm, List, Tuple are unused. After removing the sys.path hack (next comment), sys and Path become unused too. Add logging for error reporting.

 import os
-import sys
-import json
+import logging
 import numpy as np
-from pathlib import Path
-from typing import Dict, List, Optional, Tuple
+from typing import Dict, Optional
 from dataclasses import dataclass
 import networkx as nx
 from arango import ArangoClient
 import plotly.graph_objects as go
 from pyvis.network import Network
-import pandas as pd
-from tqdm import tqdm

45-53: Validate DB password early for clearer failures.

Raise a descriptive error if ARANGO_PASSWORD is absent.

-        self.db = self.client.db(
-            db_config['database'],
-            username=db_config['username'],
-            password=db_config.get('password', os.environ.get('ARANGO_PASSWORD'))
-        )
+        pwd = db_config.get('password', os.environ.get('ARANGO_PASSWORD'))
+        if not pwd:
+            raise ValueError("ARANGO_PASSWORD not provided in db_config or environment")
+        self.db = self.client.db(
+            db_config['database'],
+            username=db_config['username'],
+            password=pwd
+        )

218-313: Expose color/size choices in 3D as well.

Optional: mirror node_color_attr/node_size_attr in Plotly for consistency (currently hard-coded to degree).

If desired, compute a node_color array based on config (e.g., community or degree) and use it in marker.color; similarly adjust marker.size.


314-334: Gephi export: OK, but consider logging and filename parameterization.

Looks good. Optionally log node/edge counts and allow output path to be passed through CLI for reproducibility.


377-406: Metrics: sample-based approach is reasonable; guard large-graph costs.

LGTM. Optional: compute metrics on the largest connected component for more representative diameter/paths.


412-418: CLI options omit Datashader.

Docstring lists Datashader, but CLI does not expose it. Either add a mode or remove the claim (see earlier docstring diff).


449-449: Remove f-prefix from literal string.

Fixes F541.

-    print(f"\n✅ Visualization complete! Open the generated HTML file to explore.")
+    print("\n✅ Visualization complete! Open the generated HTML file to explore.")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f3d8bd and e374542.

📒 Files selected for processing (5)
  • .gitignore (1 hunks)
  • core/framework/memory_store.py (1 hunks)
  • tools/graphsage/pipelines/graphsage_pipeline.py (1 hunks)
  • tools/graphsage/utils/build_metadata_graph.py (1 hunks)
  • tools/graphsage/visualization/graph_visualizer.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🧰 Additional context used
📓 Path-based instructions (7)
{tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

When in tools/ or experiments/, import from core.framework.* for shared infrastructure (e.g., core.framework.embedders)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/build_metadata_graph.py
  • tools/graphsage/visualization/graph_visualizer.py
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use relative imports for intra-module references (from .utils import helper_function)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/build_metadata_graph.py
  • tools/graphsage/visualization/graph_visualizer.py
  • core/framework/memory_store.py
{core,tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

All code must include docstrings tying implementation to the theoretical framework (e.g., explain CONVEYANCE, WHERE/WHAT/TIME)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/build_metadata_graph.py
  • tools/graphsage/visualization/graph_visualizer.py
  • core/framework/memory_store.py
{core,tools}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Infrastructure (core/, tools/) must not depend on experiments/

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
  • tools/graphsage/utils/build_metadata_graph.py
  • tools/graphsage/visualization/graph_visualizer.py
  • core/framework/memory_store.py
{tools/**/pipelines/**/*.py,tools/github/github_pipeline_manager.py}

📄 CodeRabbit inference engine (CLAUDE.md)

Pipelines must support resumption from checkpoints (error recovery)

Files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
{tools/arxiv,core/framework}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{tools/arxiv,core/framework}/**/*.py: Format Python code with Black for tools/arxiv and core/framework
Run Ruff linting for tools/arxiv and core/framework

Files:

  • core/framework/memory_store.py
core/framework/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Run mypy type checking on core/framework only

Files:

  • core/framework/memory_store.py
🧠 Learnings (3)
📚 Learning: 2025-09-04T03:47:49.352Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T03:47:49.352Z
Learning: Applies to {tools,experiments}/**/*.py : When in tools/ or experiments/, import from core.framework.* for shared infrastructure (e.g., core.framework.embedders)

Applied to files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
📚 Learning: 2025-09-04T03:47:49.352Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T03:47:49.352Z
Learning: Applies to {tools/**/pipelines/**/*.py,tools/github/github_pipeline_manager.py} : Pipelines must support resumption from checkpoints (error recovery)

Applied to files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
📚 Learning: 2025-09-04T03:49:10.310Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T03:49:10.310Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Implement checkpointing with atomic saves and automatic resume on failure; persist to acid_phased_checkpoint.json

Applied to files:

  • tools/graphsage/pipelines/graphsage_pipeline.py
🧬 Code graph analysis (3)
tools/graphsage/pipelines/graphsage_pipeline.py (4)
core/framework/memory_store.py (6)
  • GraphMemoryStore (33-386)
  • load_from_disk (367-386)
  • load_from_arangodb (69-132)
  • save_to_disk (347-365)
  • create_shared_memory (207-255)
  • _calculate_memory_usage (311-326)
core/framework/graph_embedders.py (4)
  • GraphSAGEEmbedder (234-468)
  • GraphSAGEConfig (32-51)
  • AggregatorType (23-28)
  • generate_embeddings (445-468)
tools/graphsage/utils/neighborhood_sampler.py (2)
  • NeighborhoodSampler (34-322)
  • SamplingConfig (22-31)
tools/graphsage/bridge_discovery/theory_practice_finder.py (3)
  • TheoryPracticeFinder (50-439)
  • BridgeDiscoveryConfig (36-47)
  • discover_all_bridges (320-359)
tools/graphsage/utils/build_metadata_graph.py (1)
tools/graphsage/visualization/graph_visualizer.py (1)
  • main (408-449)
tools/graphsage/visualization/graph_visualizer.py (1)
tools/graphsage/utils/build_metadata_graph.py (1)
  • main (360-383)
🪛 Ruff (0.12.2)
tools/graphsage/pipelines/graphsage_pipeline.py

1-1: Shebang is present but file is not executable

(EXE001)


114-114: Consider moving this statement to an else block

(TRY300)


115-115: Do not catch blind exception: Exception

(BLE001)


129-129: Do not catch blind exception: Exception

(BLE001)


157-157: Local variable stats is assigned to but never used

Remove assignment to unused variable stats

(F841)


185-185: f-string without any placeholders

Remove extraneous f prefix

(F541)


190-190: Unused method argument: num_epochs

(ARG002)


309-309: Do not catch blind exception: Exception

(BLE001)


353-353: Do not catch blind exception: Exception

(BLE001)


437-437: Avoid specifying long messages outside the exception class

(TRY003)


439-439: Avoid specifying long messages outside the exception class

(TRY003)

tools/graphsage/utils/build_metadata_graph.py

57-57: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


92-92: Loop control variable author not used within loop body

(B007)


128-128: Do not catch blind exception: Exception

(BLE001)


135-135: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


217-217: Do not catch blind exception: Exception

(BLE001)


224-224: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


248-248: Local variable window_days is assigned to but never used

Remove assignment to unused variable window_days

(F841)


283-283: Do not catch blind exception: Exception

(BLE001)


307-308: try-except-pass detected, consider logging the exception

(S110)


307-307: Do not catch blind exception: Exception

(BLE001)

tools/graphsage/visualization/graph_visualizer.py

55-55: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


111-111: Do not use bare except

(E722)


449-449: f-string without any placeholders

Remove extraneous f prefix

(F541)

core/framework/memory_store.py

83-83: Avoid specifying long messages outside the exception class

(TRY003)


85-85: Avoid specifying long messages outside the exception class

(TRY003)


90-90: Avoid specifying long messages outside the exception class

(TRY003)


100-100: Do not catch blind exception: Exception

(BLE001)


101-101: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


101-101: Avoid specifying long messages outside the exception class

(TRY003)


146-146: Local variable collection is assigned to but never used

Remove assignment to unused variable collection

(F841)


161-161: Do not catch blind exception: Exception

(BLE001)


204-204: Do not catch blind exception: Exception

(BLE001)


231-234: Avoid specifying long messages outside the exception class

(TRY003)


253-253: Do not catch blind exception: Exception

(BLE001)


255-255: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


255-255: Avoid specifying long messages outside the exception class

(TRY003)


337-337: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)
tools/graphsage/pipelines/graphsage_pipeline.py (1)

1-1: Shebang OK if you plan to run as script; else remove.

If using python -m tools.graphsage.pipelines.graphsage_pipeline, consider dropping the shebang.

Would you like me to remove it and add an entry point in setup.cfg/pyproject?

tools/graphsage/visualization/graph_visualizer.py (1)

335-364: Community detection flow is fine; relies on PyVis fix above.

Logic is sound with Louvain→components fallback. Once PyVis honors node_color_attr, communities mode will render correctly.

After applying the PyVis fix, please run mode=communities and confirm distinct colors per community.

Comment on lines +93 to +101
try:
client = ArangoClient(hosts=db_config.get('host', 'http://localhost:8529'))
db = client.db(
db_config['database'],
username=db_config['username'],
password=password
)
except Exception as e:
raise ConnectionError(f"Failed to connect to ArangoDB: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Catch specific Arango exceptions and chain.

Avoid bare Exception; preserve traceback.

-from arango import ArangoClient
+from arango import ArangoClient
+from arango.exceptions import ArangoError, ServerConnectionError
@@
-        except Exception as e:
-            raise ConnectionError(f"Failed to connect to ArangoDB: {e}")
+        except (ArangoError, ServerConnectionError) as e:
+            raise ConnectionError("Failed to connect to ArangoDB") from e

Also applies to: 19-20

🧰 Tools
🪛 Ruff (0.12.2)

100-100: Do not catch blind exception: Exception

(BLE001)


101-101: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


101-101: Avoid specifying long messages outside the exception class

(TRY003)

Comment on lines +164 to +201
def _load_edges(self, db):
"""Load edges from edge collections."""
edge_collections = [
('semantic_similarity', 'similar'),
('theory_practice_bridges', 'implements'),
('paper_implements_theory', 'implements'),
('paper_has_chunk', 'has_chunk'),
('chunk_similarity', 'similar'),
('coauthorship', 'coauthor'),
('shared_category', 'same_field'),
('temporal_proximity', 'temporal'),
('paper_cites_paper', 'cites'),
]

for collection_name, edge_type in edge_collections:
try:
print(f"Loading edges from {collection_name}...")

cursor = db.aql.execute(f"""
FOR edge IN {collection_name}
RETURN {{from: edge._from, to: edge._to}}
""")

edge_count = 0
for edge in cursor:
# Extract node IDs from _from and _to
from_id = edge['from'].split('/')[-1] if '/' in edge['from'] else edge['from']
to_id = edge['to'].split('/')[-1] if '/' in edge['to'] else edge['to']

if from_id in self.node_ids and to_id in self.node_ids:
from_idx = self.node_ids[from_id]
to_idx = self.node_ids[to_id]

self.adjacency[from_idx].add(to_idx)
self.adjacency[to_idx].add(from_idx) # Undirected
self.edge_types[(from_idx, to_idx)] = edge_type
edge_count += 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Preserve directed 'cites' edges; canonicalize undirected keys; fix edge count.

Do not mirror adjacency for cites; store undirected keys as (min,max); count edges via edge_types length.

-        # For undirected graphs, edges are counted twice in adjacency lists
-        # Divide by 2 to get the actual edge count
-        edge_count = sum(len(neighbors) for neighbors in self.adjacency.values()) // 2
+        # Edge count = unique entries in edge_types (undirected canonicalized)
+        edge_count = len(self.edge_types)
@@
-        edge_collections = [
+        edge_collections = [
             ('semantic_similarity', 'similar'),
@@
             ('paper_cites_paper', 'cites'),
         ]
@@
-                for edge in cursor:
+                directed_types = {"cites"}
+                for edge in cursor:
@@
-                    if from_id in self.node_ids and to_id in self.node_ids:
+                    if from_id in self.node_ids and to_id in self.node_ids:
                         from_idx = self.node_ids[from_id]
                         to_idx = self.node_ids[to_id]
-                        
-                        self.adjacency[from_idx].add(to_idx)
-                        self.adjacency[to_idx].add(from_idx)  # Undirected
-                        self.edge_types[(from_idx, to_idx)] = edge_type
+                        # Directed vs undirected handling
+                        if edge_type in directed_types:
+                            self.adjacency[from_idx].add(to_idx)
+                            self.edge_types[(from_idx, to_idx)] = edge_type
+                        else:
+                            self.adjacency[from_idx].add(to_idx)
+                            self.adjacency[to_idx].add(from_idx)
+                            key = (min(from_idx, to_idx), max(from_idx, to_idx))
+                            self.edge_types[key] = edge_type
                         edge_count += 1

Also applies to: 115-125

🤖 Prompt for AI Agents
In core/framework/memory_store.py around lines 115-125 and 164-201, the loader
currently mirrors all edges (adds reverse adjacency) and uses asymmetric keys
and an incremental edge_count; update the logic so directed "cites" edges are
NOT mirrored (only add adjacency/to_idx for from_idx->to_idx), store undirected
edge_types keys canonically as (min_idx, max_idx) when the relation is
undirected (e.g., similar, has_chunk, coauthor, same_field, temporal,
implements, etc.), and stop using a manual counter—compute edge_count as
len(self.edge_types) after loading; ensure you still populate adjacency
appropriately for undirected relations by adding both directions, and only set
edge_types[(min_idx,max_idx)] once per edge to avoid duplicates.

Comment on lines +366 to +375
# Step 1: Load graph (skip if already done)
if not self.checkpoint.get('graph_loaded', {}).get('completed'):
self.load_graph()
self._save_checkpoint('graph_loaded')
else:
print("Skipping graph loading (already completed)")
# Still need to initialize graph_store from cache
self.graph_store = GraphMemoryStore(max_memory_gb=self.config.max_memory_gb)
self.graph_store.load_from_disk(self.config.cache_path)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Recreate shared memory on resume-from-cache.

Needed for downstream components expecting shm-backed inputs.

         else:
             print("Skipping graph loading (already completed)")
             # Still need to initialize graph_store from cache
             self.graph_store = GraphMemoryStore(max_memory_gb=self.config.max_memory_gb)
             self.graph_store.load_from_disk(self.config.cache_path)
+            if self.config.use_shared_memory:
+                self.graph_store.create_shared_memory(embedding_dim=self.config.model_config.input_dim)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Step 1: Load graph (skip if already done)
if not self.checkpoint.get('graph_loaded', {}).get('completed'):
self.load_graph()
self._save_checkpoint('graph_loaded')
else:
print("Skipping graph loading (already completed)")
# Still need to initialize graph_store from cache
self.graph_store = GraphMemoryStore(max_memory_gb=self.config.max_memory_gb)
self.graph_store.load_from_disk(self.config.cache_path)
else:
print("Skipping graph loading (already completed)")
# Still need to initialize graph_store from cache
self.graph_store = GraphMemoryStore(max_memory_gb=self.config.max_memory_gb)
self.graph_store.load_from_disk(self.config.cache_path)
if self.config.use_shared_memory:
self.graph_store.create_shared_memory(
embedding_dim=self.config.model_config.input_dim
)
🤖 Prompt for AI Agents
In tools/graphsage/pipelines/graphsage_pipeline.py around lines 366 to 375, when
skipping graph loading because graph_loaded checkpoint is completed, the code
restores graph_store from disk but does not recreate the shared-memory-backed
structures required by downstream components; after
self.graph_store.load_from_disk(self.config.cache_path) call, invoke the
graph_store routine that recreates or reinitializes its shared memory buffers
(e.g., create_shared_memory()/init_shm()/rebuild_shm_views) so the in-memory
cache is backed by shm again, and add error handling/logging if shm recreation
fails.

Comment on lines 118 to 133
# Insert edges in batches
if edges:
batch_size = 1000
total_inserted = 0

for i in range(0, len(edges), batch_size):
batch = edges[i:i + batch_size]
try:
self.edge_collections['coauthorship'].insert_many(batch, overwrite=True)
total_inserted += len(batch)
except Exception as e:
print(f"Error inserting batch: {e}")

print(f"Created {total_inserted} co-authorship edges")
else:
print("No co-authorship edges found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Stream inserts to cap RAM.

Avoid materializing all edges; flush by batch.

-# Insert edges in batches
-if edges:
-    batch_size = 1000
-    total_inserted = 0
-    for i in range(0, len(edges), batch_size):
-        batch = edges[i:i + batch_size]
-        try:
-            self.edge_collections['coauthorship'].insert_many(batch, overwrite=True)
-            total_inserted += len(batch)
-        except Exception as e:
-            print(f"Error inserting batch: {e}")
-    print(f"Created {total_inserted} co-authorship edges")
+batch_size = 1000
+total_inserted = 0
+buffer = []
+for edge in edges:
+    buffer.append(edge)
+    if len(buffer) >= batch_size:
+        try:
+            self.edge_collections['coauthorship'].insert_many(buffer, overwrite=True)
+            total_inserted += len(buffer)
+        finally:
+            buffer = []
+# flush tail
+if buffer:
+    self.edge_collections['coauthorship'].insert_many(buffer, overwrite=True)
+    total_inserted += len(buffer)
+print(f"Created {total_inserted} co-authorship edges")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Insert edges in batches
if edges:
batch_size = 1000
total_inserted = 0
for i in range(0, len(edges), batch_size):
batch = edges[i:i + batch_size]
try:
self.edge_collections['coauthorship'].insert_many(batch, overwrite=True)
total_inserted += len(batch)
except Exception as e:
print(f"Error inserting batch: {e}")
print(f"Created {total_inserted} co-authorship edges")
else:
print("No co-authorship edges found")
batch_size = 1000
total_inserted = 0
buffer = []
for edge in edges:
buffer.append(edge)
if len(buffer) >= batch_size:
try:
self.edge_collections['coauthorship'].insert_many(buffer, overwrite=True)
total_inserted += len(buffer)
finally:
buffer = []
# flush any remaining edges
if buffer:
self.edge_collections['coauthorship'].insert_many(buffer, overwrite=True)
total_inserted += len(buffer)
print(f"Created {total_inserted} co-authorship edges")
🧰 Tools
🪛 Ruff (0.12.2)

128-128: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In tools/graphsage/utils/build_metadata_graph.py around lines 118 to 133, the
current code materializes the entire edges list then inserts in batches, which
can spike RAM; change the loop that collects edges to instead stream and flush
by batch: accumulate edges into a small in-memory batch list while iterating the
source (yield/generator or row iterator), call insert_many(batch,
overwrite=True) whenever batch_size is reached, clear the batch and continue,
and after the loop insert any remaining items and update total_inserted
accordingly; keep the same error handling for each insert and print the final
total_inserted or "No co-authorship edges found" when appropriate.

Comment on lines 126 to 131
self.edge_collections['coauthorship'].insert_many(batch, overwrite=True)
total_inserted += len(batch)
except Exception as e:
print(f"Error inserting batch: {e}")

print(f"Created {total_inserted} co-authorship edges")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Narrow exceptions to Arango errors; keep traceback.

Replace bare except Exception with except ArangoError as e and re-raise when needed.

-from arango import ArangoClient
+from arango import ArangoClient
+from arango.exceptions import ArangoError
@@
-                except Exception as e:
-                    print(f"Error inserting batch: {e}")
+                except ArangoError as e:
+                    print(f"Error inserting batch: {e}")
@@
-                except Exception as e:
-                    print(f"Error inserting batch: {e}")
+                except ArangoError as e:
+                    print(f"Error inserting batch: {e}")
@@
-                except Exception:
-                    pass
+                except ArangoError as e:
+                    print(f"  {coll_name} count failed: {e}")

Also applies to: 215-219, 281-285, 303-308, 20-21

🧰 Tools
🪛 Ruff (0.12.2)

128-128: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
tools/graphsage/utils/build_metadata_graph.py lines 126-131 (and likewise apply
to 20-21, 215-219, 281-285, 303-308): the code currently uses a bare "except
Exception as e" that swallows unrelated exceptions and prints a simple message;
replace these with "except ArangoError as e" (import ArangoError from the arango
driver) so only Arango-specific errors are caught, log or print the full
traceback (use traceback.print_exc() or processLogger.exception()) to preserve
debugging info, and re-raise the exception if it is not an ArangoError (or after
logging when appropriate) so other failures propagate.

import pandas as pd
from tqdm import tqdm

sys.path.append(str(Path(__file__).parent.parent.parent.parent))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Drop sys.path manipulation; use package/relative imports.

This violates the repo’s import guidelines and risks import-order bugs.

-sys.path.append(str(Path(__file__).parent.parent.parent.parent))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sys.path.append(str(Path(__file__).parent.parent.parent.parent))
🤖 Prompt for AI Agents
In tools/graphsage/visualization/graph_visualizer.py at line 25, remove the
sys.path.append(Path(...)) manipulation and replace it with proper
package/relative imports: identify the module(s) you need from the repo root and
import them using either absolute package imports (e.g., myrepo.subpkg.module)
or relative imports (e.g., from ..subpkg import module) depending on this file's
package location; ensure the package has __init__.py files and update test/run
configs to run the code as a package (or install the package in editable mode)
so imports resolve without modifying sys.path.

Comment on lines 28 to 40
@dataclass
class VisualizationConfig:
"""Configuration for graph visualization."""
max_nodes: int = 1000
max_edges: int = 5000
layout: str = 'spring' # spring, kamada_kawai, circular, random
node_size_attr: Optional[str] = 'degree' # degree, pagerank, betweenness
node_color_attr: Optional[str] = 'type' # type, community, category
edge_weight_attr: Optional[str] = 'weight'
physics: bool = True
show_labels: bool = True
output_format: str = 'html' # html, json, gexf, graphml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix Optional typing mismatches in VisualizationConfig.

Fields typed Optional[str] have non-None defaults. Use plain str (RUF013).

     layout: str = 'spring'  # spring, kamada_kawai, circular, random
-    node_size_attr: Optional[str] = 'degree'  # degree, pagerank, betweenness
-    node_color_attr: Optional[str] = 'type'  # type, community, category
-    edge_weight_attr: Optional[str] = 'weight'
+    node_size_attr: str = 'degree'  # degree, pagerank, betweenness
+    node_color_attr: str = 'type'  # type, community, category
+    edge_weight_attr: str = 'weight'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@dataclass
class VisualizationConfig:
"""Configuration for graph visualization."""
max_nodes: int = 1000
max_edges: int = 5000
layout: str = 'spring' # spring, kamada_kawai, circular, random
node_size_attr: Optional[str] = 'degree' # degree, pagerank, betweenness
node_color_attr: Optional[str] = 'type' # type, community, category
edge_weight_attr: Optional[str] = 'weight'
physics: bool = True
show_labels: bool = True
output_format: str = 'html' # html, json, gexf, graphml
@dataclass
class VisualizationConfig:
"""Configuration for graph visualization."""
max_nodes: int = 1000
max_edges: int = 5000
layout: str = 'spring' # spring, kamada_kawai, circular, random
node_size_attr: str = 'degree' # degree, pagerank, betweenness
node_color_attr: str = 'type' # type, community, category
edge_weight_attr: str = 'weight'
physics: bool = True
show_labels: bool = True
output_format: str = 'html' # html, json, gexf, graphml
🤖 Prompt for AI Agents
In tools/graphsage/visualization/graph_visualizer.py around lines 28 to 40, the
dataclass fields node_size_attr, node_color_attr, and edge_weight_attr are
annotated as Optional[str] but have non-None default values; change their type
annotations from Optional[str] to plain str to match the defaults, update any
type hints/comments accordingly, and remove the Optional import if it becomes
unused; ensure any downstream code that relied on None values is reviewed and
adjusted to assume a string value.

Comment on lines 55 to 67
def query_subgraph(self, start_node: str = None, depth: int = 2,
limit: int = 1000) -> nx.Graph:
"""
Query a subgraph from ArangoDB.
Args:
start_node: Starting node ID (if None, random sample)
depth: Depth of traversal
limit: Maximum nodes to retrieve
Returns:
NetworkX graph
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Type fix and clarify required identifier format.

Use Optional[str] (RUF013) and document that Arango expects a full _id (collection/key).

-    def query_subgraph(self, start_node: str = None, depth: int = 2, 
+    def query_subgraph(self, start_node: Optional[str] = None, depth: int = 2, 
                       limit: int = 1000) -> nx.Graph:
@@
-            start_node: Starting node ID (if None, random sample)
+            start_node: Full Arango _id (e.g., 'arxiv_papers/12345'); if None, sample papers
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def query_subgraph(self, start_node: str = None, depth: int = 2,
limit: int = 1000) -> nx.Graph:
"""
Query a subgraph from ArangoDB.
Args:
start_node: Starting node ID (if None, random sample)
depth: Depth of traversal
limit: Maximum nodes to retrieve
Returns:
NetworkX graph
"""
def query_subgraph(self, start_node: Optional[str] = None, depth: int = 2,
limit: int = 1000) -> nx.Graph:
"""
Query a subgraph from ArangoDB.
Args:
start_node: Full Arango _id (e.g., 'arxiv_papers/12345'); if None, sample papers
depth: Depth of traversal
limit: Maximum nodes to retrieve
Returns:
NetworkX graph
"""
🧰 Tools
🪛 Ruff (0.12.2)

55-55: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)

🤖 Prompt for AI Agents
In tools/graphsage/visualization/graph_visualizer.py around lines 55 to 67, the
signature and docstring for query_subgraph should use Optional[str] for
start_node (fixing RUF013) and clarify that Arango expects a full _id in the
format "collection/key"; update the function annotation to import and use
typing.Optional, change start_node: str = None to start_node: Optional[str] =
None, and expand the docstring to state that when provided the start_node must
be the Arango full _id (collection/key); ensure imports are added/adjusted and
update any related type checks or comments accordingly.

Comment on lines 108 to 120
try:
cursor = self.db.aql.execute(query, bind_vars=bind_vars)
results = list(cursor)
except:
# Fallback to simpler query
query = """
FOR edge IN coauthorship
LIMIT @limit
RETURN edge
"""
cursor = self.db.aql.execute(query, bind_vars={'limit': limit})
results = list(cursor)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace bare except with explicit Exception and log the cause.

Bare except hides real failures and trips linters (E722). Keep the fallback but log the error.

-        except:
+        except Exception as exc:
+            logging.warning("Primary AQL failed (%s). Falling back to coauthorship edges.", exc)
             # Fallback to simpler query
             query = """
                 FOR edge IN coauthorship
                     LIMIT @limit
                     RETURN edge
             """
             cursor = self.db.aql.execute(query, bind_vars={'limit': limit})
             results = list(cursor)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
cursor = self.db.aql.execute(query, bind_vars=bind_vars)
results = list(cursor)
except:
# Fallback to simpler query
query = """
FOR edge IN coauthorship
LIMIT @limit
RETURN edge
"""
cursor = self.db.aql.execute(query, bind_vars={'limit': limit})
results = list(cursor)
try:
cursor = self.db.aql.execute(query, bind_vars=bind_vars)
results = list(cursor)
except Exception as exc:
logging.warning("Primary AQL failed (%s). Falling back to coauthorship edges.", exc)
# Fallback to simpler query
query = """
FOR edge IN coauthorship
LIMIT @limit
RETURN edge
"""
cursor = self.db.aql.execute(query, bind_vars={'limit': limit})
results = list(cursor)
🧰 Tools
🪛 Ruff (0.12.2)

111-111: Do not use bare except

(E722)

🤖 Prompt for AI Agents
In tools/graphsage/visualization/graph_visualizer.py around lines 108-120,
replace the bare except with an explicit "except Exception as e" and log the
caught exception before falling back; keep the existing fallback query but call
the module/class logger (e.g., logging.error(...) or self.logger.error(...))
with a descriptive message plus the exception details (e.g., include repr(e) or
traceback) so the real failure is recorded while preserving the current fallback
behavior.

Comment on lines 144 to 217
def visualize_pyvis(self, config: VisualizationConfig = None) -> str:
"""
Create interactive visualization using PyVis.
Returns:
Path to HTML file
"""
if config is None:
config = VisualizationConfig()

if self.graph is None:
self.query_subgraph(limit=config.max_nodes)

# Create PyVis network
net = Network(height='800px', width='100%',
bgcolor='#222222', font_color='white')

if config.physics:
net.barnes_hut(gravity=-80000, central_gravity=0.3,
spring_length=250, spring_strength=0.001)

# Add nodes with attributes
for node in self.graph.nodes():
node_data = self.graph.nodes[node]

# Size based on degree
size = min(50, 10 + self.graph.degree[node] * 2)

# Color based on type/community
color = self._get_node_color(node_data.get('type', 'unknown'))

# Label
label = node if config.show_labels else ''
title = node_data.get('title', node)

net.add_node(node, label=label, title=title,
size=size, color=color)

# Add edges
for edge in self.graph.edges(data=True):
weight = edge[2].get('weight', 1.0)
net.add_edge(edge[0], edge[1], value=weight)

# Set options
net.set_options("""
var options = {
"nodes": {
"borderWidth": 2,
"shadow": true
},
"edges": {
"color": {
"inherit": true
},
"smooth": {
"type": "continuous"
},
"shadow": true
},
"physics": {
"enabled": true,
"stabilization": {
"iterations": 100
}
}
}
""")

# Save to HTML
output_path = 'graph_visualization.html'
net.show(output_path)
print(f"Visualization saved to {output_path}")
return output_path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Communities mode currently has no effect; honor node_color_attr and avoid opening a browser.

visualize_communities sets node_color_attr='community', but visualize_pyvis ignores it and always colors by type. Also net.show() tries to open a browser; prefer write_html in headless runs.

         if self.graph is None:
             self.query_subgraph(limit=config.max_nodes)
 
         # Create PyVis network
         net = Network(height='800px', width='100%', 
                      bgcolor='#222222', font_color='white')
@@
-        # Add nodes with attributes
-        for node in self.graph.nodes():
-            node_data = self.graph.nodes[node]
-            
-            # Size based on degree
-            size = min(50, 10 + self.graph.degree[node] * 2)
-            
-            # Color based on type/community
-            color = self._get_node_color(node_data.get('type', 'unknown'))
+        # Add nodes with attributes
+        palette = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00',
+                   '#ffff33', '#a65628', '#f781bf', '#999999']
+        for node in self.graph.nodes():
+            node_data = self.graph.nodes[node]
+            # Size
+            metric = self.graph.degree[node] if config.node_size_attr == 'degree' else self.graph.degree[node]
+            size = min(50, 10 + metric * 2)
+            # Color
+            if config.node_color_attr == 'community' and 'community' in node_data:
+                comm = int(node_data['community'])
+                color = palette[comm % len(palette)]
+            else:
+                color = self._get_node_color(node_data.get('type', 'unknown'))
@@
-        # Save to HTML
-        output_path = 'graph_visualization.html'
-        net.show(output_path)
+        # Save to HTML (no browser pop-up)
+        output_path = 'graph_visualization.html'
+        net.write_html(output_path, open_browser=False)

Committable suggestion skipped: line range outside the PR's diff.

r3d91ll and others added 2 commits September 13, 2025 13:54
Successfully processed 2.8M papers with abstract embeddings:
- Total papers: 2,823,744
- All papers now have embeddings (100% coverage)
- Performance: 48.3 papers/sec
- Total time: 399.1 minutes for final 1.16M papers
- Checkpoint/resume validated and working

This commit represents the known-good baseline before core restructure.
All functionality has been validated through complete database processing.

Key components:
- Sentence-Transformers implementation (12% faster than raw)
- Unix socket optimization for ArangoDB
- Checkpoint/resume capability verified
- Multi-GPU support (2x RTX A6000)
- Jina-v4 embeddings (2048 dimensions)

Baseline snapshot captured for validation:
- docs/baseline/baseline-2025-01-13.md - Full documentation
- tools/arxiv/utils/baseline-snapshot.json - Database state
- tools/arxiv/utils/arxiv_baseline_snapshot.py - Snapshot tool

🏷️ baseline-pre-restructure
## Reorganization Changes
- Moved database modules to core/database/arango/ for clear separation
- Created core/workflows/ for orchestration and state management
- Established core/processors/text/ for text processing modules
- Added core/monitoring/ for centralized metrics collection
- Created core/graph/ placeholder for future graph refactor

## File Movements
- core/database/arango_db_manager.py → core/database/arango/arango_client.py
- core/processors/chunking_strategies.py → core/processors/text/chunking_strategies.py
- core/processors/document_processor.py → core/workflows/workflow_pdf.py
- core/utils/state_manager.py → core/workflows/workflow_state_manager.py
- tools/arxiv/utils/* → core/workflows/ (various workflow files)

## New Additions
- core/framework/sentence_embedder.py: Sentence-transformers implementation (12% faster)
- core/framework/vllm_embedder.py: VLLM embedding attempts (for reference)
- core/database/arango/arango_unix_client.py: Unix socket optimization
- tools/arxiv/utils/arxiv_baseline_snapshot.py: Database state capture tool
- docs/baseline/baseline-2025-01-13.md: Performance baseline documentation

## PRDs and Documentation
- Master Core Restructure PRD with 21-day sprint plan
- Phase 1-3 PRDs for systematic restructuring
- Centralized configuration system PRD
- Monitoring framework organization PRD

## Baseline Metrics Captured
- 2,823,744 papers processed with 100% embedding coverage
- 48.3 papers/second throughput achieved
- 399.1 minutes for 1.16M papers in final run
- Sentence-transformers optimization validated

## Removed/Deprecated
- Removed MCP server (to be reinstated later)
- Deleted duplicate files moved to new locations
- Cleaned up redundant requirements files

This commit establishes the baseline before the core restructure sprint.
All changes preserve functionality while improving organization.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 71

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
core/processors/__init__.py (1)

16-22: Public API removal is breaking — restore deprecated exports and make all immutable

rg found live imports of removed symbols in: tools/arxiv/arxiv_manager.py:500-502 and docs/prd/core_restructure/master-core-restructure-prd.md:322,331. Restore deprecated exports (with DeprecationWarning) before removing and change all to a tuple.

Apply this diff to make all a tuple and add trailing comma:

-__all__ = [
-    "ChunkingStrategy",
-    "TokenBasedChunking",
-    "SemanticChunking",
-    "SlidingWindowChunking",
-    "ChunkingStrategyFactory"
-]
+__all__ = (
+    "ChunkingStrategy",
+    "TokenBasedChunking",
+    "SemanticChunking",
+    "SlidingWindowChunking",
+    "ChunkingStrategyFactory",
+)

Optionally add a deprecation bridge (place this block after the current all):

# Back-compat aliases (to remove in next minor/next LTS)
try:
    from ..workflows import DocumentProcessor, ProcessingConfig, ProcessingResult  # type: ignore
    import warnings as _warnings
    _warnings.warn(
        "Importing DocumentProcessor/ProcessingConfig/ProcessingResult from core.processors is deprecated; "
        "use core.workflows instead.",
        DeprecationWarning,
        stacklevel=2,
    )
    __all__ += ("DocumentProcessor", "ProcessingConfig", "ProcessingResult")
except Exception:
    # If workflows aren’t present or symbols moved further, avoid hard failure at import time.
    pass
core/workflows/workflow_pdf_batch.py (1)

2-13: Augment module docstring to tie to the theory framework.

Per guidelines for {core,tools}/**/*.py, add a short paragraph mapping pipeline phases to WHERE/WHAT/TIME and Conveyance.

Example to append:

  • WHERE: ArangoDB collections namespace for documents/chunks/embeddings
  • WHAT: Extraction (Docling), Embedding (Jina v4), Storage transactions
  • TIME: Batch/process scheduling and locking semantics
  • Conveyance: How chunking + transactional writes preserve semantic fidelity

Comment on lines +8 to 10
from .arango.arango_client import ArangoDBManager, retry_with_backoff

__all__ = ['ArangoDBManager', 'retry_with_backoff']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Add a compatibility shim for old import path.
Downstream code may still import core.database.arango_db_manager. Provide a thin re-export to avoid breakage.

Create core/database/arango_db_manager.py with:

from .arango.arango_client import ArangoDBManager, retry_with_backoff
__all__ = ["ArangoDBManager", "retry_with_backoff"]

Verify no remaining legacy imports:


🏁 Script executed:

#!/bin/bash
# Find legacy imports to the old path
rg -nP "core\.database\.arango_db_manager" -C2

Length of output: 602


Add compatibility shim for old import path.

Create core/database/arango_db_manager.py with:

from .arango.arango_client import ArangoDBManager, retry_with_backoff
__all__ = ["ArangoDBManager", "retry_with_backoff"]

Found one remaining legacy import in docs/prd/core_restructure/master-core-restructure-prd.md (line ~320): from core.database.arango_db_manager import ArangoDBManager — update that doc to the new import or keep it (the shim preserves runtime compatibility). Re-run rg -nP "core\.database\.arango_db_manager" -C2 to confirm no other occurrences.

🤖 Prompt for AI Agents
In core/database/__init__.py around lines 8 to 10, add a compatibility shim file
at core/database/arango_db_manager.py that re-exports ArangoDBManager and
retry_with_backoff from .arango.arango_client by importing them and setting
__all__ = ["ArangoDBManager", "retry_with_backoff"]; then either update the docs
reference in docs/prd/core_restructure/master-core-restructure-prd.md (around
line ~320) to the new import path or keep it (the shim preserves runtime
compatibility); finally run rg -nP "core\.database\.arango_db_manager" -C2 to
verify no other occurrences remain.

Comment on lines +106 to +139
def backup_collection(self, collection_name: str, output_dir: str = "./backups"):
"""
Backup a collection to JSON file.
Args:
collection_name: Name of collection to backup
output_dir: Directory to save backup files
"""
if not self.db.has_collection(collection_name):
logger.error(f"Collection '{collection_name}' does not exist")
return False

# Create backup directory
backup_path = Path(output_dir)
backup_path.mkdir(parents=True, exist_ok=True)

# Generate backup filename
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
filename = backup_path / f"{collection_name}_{timestamp}.json"

# Export collection
logger.info(f"Backing up {collection_name} to {filename}")
collection = self.db.collection(collection_name)

documents = []
for doc in collection:
documents.append(doc)

with open(filename, 'w') as f:
json.dump(documents, f, indent=2)

logger.info(f"✓ Backed up {len(documents):,} documents to {filename}")
return True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Backup loads entire collection into memory; stream instead.

Large collections will OOM. Stream documents (NDJSON) or chunked JSON.

-        documents = []
-        for doc in collection:
-            documents.append(doc)
-
-        with open(filename, 'w') as f:
-            json.dump(documents, f, indent=2)
+        with open(filename, 'w') as f:
+            first = True
+            f.write('[')
+            cursor = collection.export()  # server-side cursor; or use AQL with batch_size
+            for batch in cursor:
+                for doc in batch:
+                    f.write((',' if not first else '') + json.dumps(doc))
+                    first = False
+            f.write(']')

Alternative: produce NDJSON for easier restores.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In core/database/arango/arango_manager.py around lines 106 to 139, the
backup_collection implementation currently loads the entire collection into
memory which will OOM on large collections; change it to stream documents to
disk using NDJSON (one JSON document per line) or write chunked JSON instead of
accumulating a list. Iterate the collection via a server-side cursor/iterator,
open the output file and write each document as a single JSON line (flush
periodically or after each write), avoid building a Python list, and update the
final log to count written documents; ensure file is opened with proper encoding
and handle/propagate IO or iterator exceptions appropriately.

Comment on lines +94 to +116
# Get JWT token for better performance
self._authenticate()

def _authenticate(self):
"""Get JWT token for authenticated requests."""
auth_data = {
'username': self.username,
'password': self.password
}

response = self.session.post(
f'{self.base_url}/_open/auth',
json=auth_data
)

if response.status_code == 200:
token = response.json().get('jwt')
if token:
self.session.headers['Authorization'] = f'bearer {token}'
logger.info("Authentication successful")
else:
logger.warning("JWT authentication failed, using basic auth")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Add timeouts, check errors, and raise domain-specific exceptions.

HTTP calls lack timeouts and rely on status codes only. Introduce a default timeout and raise a typed error for clarity.

+        self.timeout = float(os.getenv("ARANGO_HTTP_TIMEOUT", "30"))
...
-        response = self.session.post(f'{self.base_url}/_open/auth', json=auth_data)
+        response = self.session.post(f'{self.base_url}/_open/auth', json=auth_data, timeout=self.timeout)
...
-        response = self.session.post(f'{self.base_url}/_db/{self.database}/_api/cursor', json=payload)
+        response = self.session.post(f'{self.base_url}/_db/{self.database}/_api/cursor', json=payload, timeout=self.timeout)
         if response.status_code != 201:
-            raise Exception(f"Query failed: {response.text}")
+            raise RuntimeError(f"AQL cursor create failed: {response.status_code} {response.text}")
...
-            response = self.session.put(f'{self.base_url}/_db/{self.database}/_api/cursor/{cursor_id}')
+            response = self.session.put(f'{self.base_url}/_db/{self.database}/_api/cursor/{cursor_id}', timeout=self.timeout)
+            if response.status_code not in (200, 201):
+                raise RuntimeError(f"AQL cursor fetch failed: {response.status_code} {response.text}")
...
-            response = self.session.post(
+            response = self.session.post(
                 f'{self.base_url}/_db/{self.database}/_api/document/{collection}',
-                json=batch,
-                params={'overwrite': 'true'}
+                json=batch,
+                params={'overwrite': 'true'},
+                timeout=self.timeout
             )
...
-        response = self.session.get(
+        response = self.session.get(
             f'{self.base_url}/_db/{self.database}/_api/collection/{collection}/count'
-        )
+        , timeout=self.timeout)
...
-        response = self.session.get(
+        response = self.session.get(
             f'{self.base_url}/_db/{self.database}/_api/collection/{name}'
-        )
+        , timeout=self.timeout)
...
-        response = self.session.post(
+        response = self.session.post(
             f'{self.base_url}/_db/{self.database}/_api/collection',
-            json={'name': name}
+            json={'name': name},
+            timeout=self.timeout
         )

Also applies to: 136-156, 174-194, 212-235

Comment on lines +159 to +171
def insert_documents(self, collection: str, documents: List[Dict],
batch_size: int = 10000) -> Dict[str, int]:
"""
Bulk insert documents with batching.
Args:
collection: Collection name
documents: List of documents to insert
batch_size: Documents per batch
Returns:
Statistics dict with created/errors counts
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Database writes must be atomic per guidelines—use stream transactions.

Batch inserts can partially succeed. Wrap batches in a stream transaction and commit/abort accordingly. Gate behind atomic=True.

-    def insert_documents(self, collection: str, documents: List[Dict],
-                        batch_size: int = 10000) -> Dict[str, int]:
+    def insert_documents(self, collection: str, documents: List[Dict],
+                        batch_size: int = 10000, *, atomic: bool = True) -> Dict[str, int]:
         """
         Bulk insert documents with batching.
@@
-        stats = {'created': 0, 'errors': 0}
+        stats = {'created': 0, 'errors': 0}
+        tx_id = None
+        headers = None
+        if atomic:
+            tx_id = self._begin_stream_tx(write=[collection])
+            headers = {'x-arango-trx-id': tx_id} if tx_id else None
@@
-            response = self.session.post(
+            response = self.session.post(
                 f'{self.base_url}/_db/{self.database}/_api/document/{collection}',
                 json=batch,
-                params={'overwrite': 'true'}
+                params={'overwrite': 'true'},
+                headers=headers,
+                timeout=self.timeout
             )
@@
-        return stats
+        if atomic and tx_id:
+            self._commit_stream_tx(tx_id)
+        return stats
+
+    # --- Stream transaction helpers ---
+    def _begin_stream_tx(self, *, read: List[str] = None, write: List[str] = None) -> Optional[str]:
+        payload = {'collections': {}}
+        if read: payload['collections']['read'] = read
+        if write: payload['collections']['write'] = write
+        resp = self.session.post(f'{self.base_url}/_api/transaction/begin', json=payload, timeout=self.timeout)
+        if resp.status_code in (200, 201):
+            return resp.json().get('id')
+        raise RuntimeError(f"Begin transaction failed: {resp.text}")
+
+    def _commit_stream_tx(self, tx_id: str) -> None:
+        resp = self.session.put(f'{self.base_url}/_api/transaction/{tx_id}', timeout=self.timeout)
+        if resp.status_code not in (200, 201):
+            raise RuntimeError(f"Commit transaction failed: {resp.text}")
+
+    def _abort_stream_tx(self, tx_id: str) -> None:
+        self.session.delete(f'{self.base_url}/_api/transaction/{tx_id}', timeout=self.timeout)

Also applies to: 196-211

🤖 Prompt for AI Agents
In core/database/arango/arango_unix_client.py around lines 159-171 (and
similarly 196-211), the current batch insert implementation can partially
succeed; modify insert_documents to accept an atomic: bool = False parameter
and, when atomic is True, wrap each batch insert in an Arango stream transaction
(begin a transaction before sending a batch, perform the batch insert, commit on
success, abort on any error), propagating/aggregating errors so the caller sees
total created/error counts and ensuring no partial commits across batches; on
atomic=False keep current behavior. Ensure proper try/except/finally to abort on
exceptions and close the transaction, and update docstring and return/exception
behavior accordingly.

Comment on lines +16 to +18
from arango import ArangoClient
from arango_unix_client import ArangoUnixClient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix intra-repo import to be relative per guidelines.

This import will fail from repository root. Use a relative import and run via python -m core.database.benchmark_unix_socket.

-from arango_unix_client import ArangoUnixClient
+from .arango.arango_unix_client import ArangoUnixClient

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In core/database/benchmark_unix_socket.py around lines 16 to 18, the top-level
import "from arango_unix_client import ArangoUnixClient" must be converted to a
relative import so it works when run from the repository root; change it to
"from .arango_unix_client import ArangoUnixClient" (ensure the core/database
package is importable, i.e. contains __init__.py) and run the script with
"python -m core.database.benchmark_unix_socket".

Comment on lines +50 to +71
def _detect_ups(self) -> bool:
"""Detect if UPS monitoring is available."""
ups_tools = [
("apcaccess", "apcupsd"), # APC UPS
("upsc", "nut"), # Network UPS Tools
("pwrstat", "cyberpower") # CyberPower
]

for cmd, ups_type in ups_tools:
try:
result = subprocess.run(
f"which {cmd}",
shell=True,
capture_output=True
)
if result.returncode == 0:
self.ups_command = (cmd, ups_type)
return True
except:
pass

return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Avoid shell=True; use shutil.which and arg lists for UPS detection.

Current approach spawns a shell and swallows errors. Safer and more robust to use shutil.which.

-    def _detect_ups(self) -> bool:
+    def _detect_ups(self) -> bool:
         """Detect if UPS monitoring is available."""
-        ups_tools = [
+        ups_tools = [
             ("apcaccess", "apcupsd"),  # APC UPS
             ("upsc", "nut"),            # Network UPS Tools
             ("pwrstat", "cyberpower")   # CyberPower
         ]
-
-        for cmd, ups_type in ups_tools:
-            try:
-                result = subprocess.run(
-                    f"which {cmd}",
-                    shell=True,
-                    capture_output=True
-                )
-                if result.returncode == 0:
-                    self.ups_command = (cmd, ups_type)
-                    return True
-            except:
-                pass
+        import shutil
+        for cmd, ups_type in ups_tools:
+            if shutil.which(cmd):
+                self.ups_command = (cmd, ups_type)
+                return True
         return False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _detect_ups(self) -> bool:
"""Detect if UPS monitoring is available."""
ups_tools = [
("apcaccess", "apcupsd"), # APC UPS
("upsc", "nut"), # Network UPS Tools
("pwrstat", "cyberpower") # CyberPower
]
for cmd, ups_type in ups_tools:
try:
result = subprocess.run(
f"which {cmd}",
shell=True,
capture_output=True
)
if result.returncode == 0:
self.ups_command = (cmd, ups_type)
return True
except:
pass
return False
def _detect_ups(self) -> bool:
"""Detect if UPS monitoring is available."""
ups_tools = [
("apcaccess", "apcupsd"), # APC UPS
("upsc", "nut"), # Network UPS Tools
("pwrstat", "cyberpower") # CyberPower
]
import shutil
for cmd, ups_type in ups_tools:
if shutil.which(cmd):
self.ups_command = (cmd, ups_type)
return True
return False
🧰 Tools
🪛 Ruff (0.12.2)

60-60: subprocess call with shell=True identified, security issue

(S602)


68-68: Do not use bare except

(E722)


68-69: try-except-pass detected, consider logging the exception

(S110)

🤖 Prompt for AI Agents
In core/monitoring/power_monitor.py around lines 50 to 71, the UPS detection
uses subprocess.run with shell=True and a bare except which is unsafe and
unnecessary; replace that logic with shutil.which to check for the presence of
executables (or use subprocess.run with an argument list if you must execute),
remove shell=True, and eliminate the broad except—use no try/except or catch
specific exceptions only; when shutil.which returns a non-None path set
self.ups_command = (cmd, ups_type) and return True, otherwise continue and
finally return False.

Comment on lines +95 to +101
result = subprocess.run(
"apcaccess status",
shell=True,
capture_output=True,
text=True
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Subprocess calls: drop shell=True, add timeouts, and check=True.

Using the shell and no timeouts is a security and reliability risk. Prefer argv lists, check return codes, and bound execution time.

- result = subprocess.run("apcaccess status", shell=True, capture_output=True, text=True)
+ result = subprocess.run(["apcaccess", "status"], check=True, capture_output=True, text=True, timeout=5)
- result = subprocess.run("upsc -l", shell=True, capture_output=True, text=True)
+ result = subprocess.run(["upsc", "-l"], check=True, capture_output=True, text=True, timeout=5)
...
- result = subprocess.run(f"upsc {ups_name}", shell=True, capture_output=True, text=True)
+ result = subprocess.run(["upsc", ups_name], check=True, capture_output=True, text=True, timeout=5)
- result = subprocess.run("pwrstat -status", shell=True, capture_output=True, text=True)
+ result = subprocess.run(["pwrstat", "-status"], check=True, capture_output=True, text=True, timeout=5)
- result = subprocess.run(
-     "nvidia-smi --query-gpu=index,name,power.draw,power.limit "
-     "--format=csv,noheader",
-     shell=True, capture_output=True, text=True
- )
+ result = subprocess.run(
+     ["nvidia-smi", "--query-gpu=index,name,power.draw,power.limit", "--format=csv,noheader"],
+     check=True, capture_output=True, text=True, timeout=5
+ )

Also applies to: 142-161, 192-198, 240-246

🧰 Tools
🪛 Ruff (0.12.2)

95-95: subprocess call with shell=True seems safe, but may be changed in the future; consider rewriting without shell

(S602)


96-96: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
In core/monitoring/power_monitor.py around lines 95-101 (and similarly at
142-161, 192-198, 240-246), the subprocess.run calls use shell=True, lack
timeouts, and don't check return codes; replace string commands with argument
lists (e.g. ["apcaccess","status"]), remove shell=True, add a timeout parameter
(reasonable value like 5-10s) and use check=True or explicitly inspect
result.returncode and raise/log on non-zero; also handle TimeoutExpired
exceptions and subprocess.CalledProcessError to avoid hangs and propagate/log
useful error details.

Comment on lines +248 to +254
for line in result.stdout.strip().split('\n'):
parts = [p.strip() for p in line.split(',')]
if len(parts) >= 3:
gpu_idx = int(parts[0])
power_draw = float(parts[2].replace(' W', '')) if ' W' in parts[2] else 0
power_limit = float(parts[3].replace(' W', '')) if ' W' in parts[3] else 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Index error risk when parsing nvidia-smi CSV.

You guard with len(parts) >= 3 but read parts[3]. Bump the length check to 4 to avoid IndexError.

-                if len(parts) >= 3:
+                if len(parts) >= 4:
                     gpu_idx = int(parts[0])
                     power_draw = float(parts[2].replace(' W', '')) if ' W' in parts[2] else 0
                     power_limit = float(parts[3].replace(' W', '')) if ' W' in parts[3] else 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for line in result.stdout.strip().split('\n'):
parts = [p.strip() for p in line.split(',')]
if len(parts) >= 3:
gpu_idx = int(parts[0])
power_draw = float(parts[2].replace(' W', '')) if ' W' in parts[2] else 0
power_limit = float(parts[3].replace(' W', '')) if ' W' in parts[3] else 0
for line in result.stdout.strip().split('\n'):
parts = [p.strip() for p in line.split(',')]
if len(parts) >= 4:
gpu_idx = int(parts[0])
power_draw = float(parts[2].replace(' W', '')) if ' W' in parts[2] else 0
power_limit = float(parts[3].replace(' W', '')) if ' W' in parts[3] else 0
🤖 Prompt for AI Agents
In core/monitoring/power_monitor.py around lines 248 to 254, the CSV parsing
checks len(parts) >= 3 but later accesses parts[3], risking an IndexError;
change the guard to require len(parts) >= 4 and keep the existing parsing logic
(convert parts[0] to int, parse parts[2] and parts[3] for ' W' suffix into
floats with fallbacks) so parts[3] is always present before it's accessed.

Comment on lines 1 to 6
"""
Data Infrastructure Module
Text Processing Module
Provides data loading, PDF processing, and chunk management utilities
for processing academic papers from arXiv and other sources.
Provides text processing utilities like chunking strategies.
Document processing workflows have been moved to core/workflows/.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Docstring must align with framework (CONVEYANCE, WHERE/WHAT/TIME).

Current text is descriptive but doesn’t meet the repo’s docstring standard for core/tools/experiments modules.

Apply this diff to replace the header docstring:

-"""
-Text Processing Module
-
-Provides text processing utilities like chunking strategies.
-Document processing workflows have been moved to core/workflows/.
-"""
+"""
+core.processors — Text Processing Utilities
+
+Conveyance: prepares coherent text units (chunks) for downstream graph embedding and GraphSAGE neighborhood aggregation.
+What: defines chunk boundary policies and factories (token/semantic/sliding) that control unit semantics and size.
+Where: preserves document structure via stable identifiers so chunks map 1:1 to graph nodes/edges without losing context.
+Time: supports sliding-window parameters to maintain sequential/temporal coherence across adjacent chunks.
+
+Note: Orchestration/workflows moved to core/workflows/. Import algorithms here; run pipelines there.
+"""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""
Data Infrastructure Module
Text Processing Module
Provides data loading, PDF processing, and chunk management utilities
for processing academic papers from arXiv and other sources.
Provides text processing utilities like chunking strategies.
Document processing workflows have been moved to core/workflows/.
"""
"""
core.processorsText Processing Utilities
Conveyance: prepares coherent text units (chunks) for downstream graph embedding and GraphSAGE neighborhood aggregation.
What: defines chunk boundary policies and factories (token/semantic/sliding) that control unit semantics and size.
Where: preserves document structure via stable identifiers so chunks map 1:1 to graph nodes/edges without losing context.
Time: supports sliding-window parameters to maintain sequential/temporal coherence across adjacent chunks.
Note: Orchestration/workflows moved to core/workflows/. Import algorithms here; run pipelines there.
"""
🤖 Prompt for AI Agents
In core/processors/__init__.py around lines 1 to 6, the module docstring does
not follow the repository's required CONVEYANCE/WHERE/WHAT/TIME docstring
format; replace the current header with a single docstring that states the
conveyance (purpose/context), the module location (core/processors), the
specific functionality provided (text processing utilities like chunking
strategies; note that document processing workflows moved to core/workflows),
and the time or versioning note if applicable, formatted succinctly to match
other core/tools/experiments modules.

Comment on lines +412 to 413
from core.database.arango.arango_client import ArangoDBManager

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Prefer package‑relative imports; drop sys.path hack.

Follow our guidelines to use relative imports within core/. Also remove the ad‑hoc sys.path mutation to keep workers import‑safe.

Apply:

-    from core.framework.embedders import JinaV4Embedder
-    import sys
-    from pathlib import Path
-    sys.path.insert(0, str(Path(__file__).parent.parent.parent))
-    from core.database.arango.arango_client import ArangoDBManager
+    from ..framework.embedders import JinaV4Embedder
+    from ..database.arango.arango_client import ArangoDBManager

Also applies to: 408-413, 411-411

🤖 Prompt for AI Agents
In core/workflows/workflow_pdf_batch.py around lines 408-413 (including 411),
replace the sys.path mutation and absolute import with a package-relative
import: remove any ad-hoc sys.path.append/insert calls and change "from
core.database.arango.arango_client import ArangoDBManager" to a relative import
(e.g., "from ..database.arango.arango_client import ArangoDBManager" adjusted to
the correct relative depth). Ensure all other imports in this block follow the
same package-relative pattern and run tests to confirm workers remain
import-safe.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +189 to +230
def embed_multimodal(self,
inputs: List[Union[str, Dict]],
task: str = "retrieval",
show_progress: bool = False) -> np.ndarray:
"""
Embed multimodal inputs (text and/or images).
Sentence-transformers can handle both text and image URLs/paths.
Args:
inputs: List of texts or dicts with 'text' and 'image' keys
task: Task type
show_progress: Show progress bar
Returns:
L2-normalized embeddings (N x 2048)
"""
# Process inputs for sentence-transformers
processed_inputs = []
for inp in inputs:
if isinstance(inp, str):
processed_inputs.append(inp)
elif isinstance(inp, dict):
# Handle multimodal input
if 'text' in inp and 'image' in inp:
# Sentence-transformers can handle this natively
processed_inputs.append(inp)
elif 'text' in inp:
processed_inputs.append(inp['text'])
elif 'image' in inp:
processed_inputs.append(inp['image'])

embeddings = self.model.encode(
sentences=processed_inputs,
task=task,
batch_size=self.batch_size,
show_progress_bar=show_progress,
normalize_embeddings=True,
convert_to_numpy=True
)

return embeddings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Gate multimodal usage or remove until a multimodal ST model is wired.

Default jinaai/jina-embeddings-v4 is text-only; passing dicts with images will likely fail. Guard by feature-detecting or raise NotImplementedError unless a multimodal model is configured.

-    def embed_multimodal(self, 
+    def embed_multimodal(self, 
                         inputs: List[Union[str, Dict]],
                         task: str = "retrieval",
                         show_progress: bool = False) -> np.ndarray:
@@
-        embeddings = self.model.encode(
-            sentences=processed_inputs,
-            task=task,
-            batch_size=self.batch_size,
-            show_progress_bar=show_progress,
-            normalize_embeddings=True,
-            convert_to_numpy=True
-        )
+        if not hasattr(self.model, "encode"):
+            raise NotImplementedError("SentenceTransformer missing encode()")
+        if not any(isinstance(x, dict) for x in processed_inputs):
+            return self.model.encode(
+                sentences=processed_inputs,
+                batch_size=self.batch_size,
+                show_progress_bar=show_progress,
+                normalize_embeddings=True,
+                convert_to_numpy=True,
+            )
+        raise NotImplementedError("Multimodal inputs require a multimodal model/config")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def embed_multimodal(self,
inputs: List[Union[str, Dict]],
task: str = "retrieval",
show_progress: bool = False) -> np.ndarray:
"""
Embed multimodal inputs (text and/or images).
Sentence-transformers can handle both text and image URLs/paths.
Args:
inputs: List of texts or dicts with 'text' and 'image' keys
task: Task type
show_progress: Show progress bar
Returns:
L2-normalized embeddings (N x 2048)
"""
# Process inputs for sentence-transformers
processed_inputs = []
for inp in inputs:
if isinstance(inp, str):
processed_inputs.append(inp)
elif isinstance(inp, dict):
# Handle multimodal input
if 'text' in inp and 'image' in inp:
# Sentence-transformers can handle this natively
processed_inputs.append(inp)
elif 'text' in inp:
processed_inputs.append(inp['text'])
elif 'image' in inp:
processed_inputs.append(inp['image'])
embeddings = self.model.encode(
sentences=processed_inputs,
task=task,
batch_size=self.batch_size,
show_progress_bar=show_progress,
normalize_embeddings=True,
convert_to_numpy=True
)
return embeddings
def embed_multimodal(self,
inputs: List[Union[str, Dict]],
task: str = "retrieval",
show_progress: bool = False) -> np.ndarray:
"""
Embed multimodal inputs (text and/or images).
Sentence-transformers can handle both text and image URLs/paths.
Args:
inputs: List of texts or dicts with 'text' and 'image' keys
task: Task type
show_progress: Show progress bar
Returns:
L2-normalized embeddings (N x 2048)
"""
# Process inputs for sentence-transformers
processed_inputs = []
for inp in inputs:
if isinstance(inp, str):
processed_inputs.append(inp)
elif isinstance(inp, dict):
# Handle multimodal input
if 'text' in inp and 'image' in inp:
# Sentence-transformers can handle this natively
processed_inputs.append(inp)
elif 'text' in inp:
processed_inputs.append(inp['text'])
elif 'image' in inp:
processed_inputs.append(inp['image'])
if not hasattr(self.model, "encode"):
raise NotImplementedError("SentenceTransformer missing encode()")
if not any(isinstance(x, dict) for x in processed_inputs):
return self.model.encode(
sentences=processed_inputs,
batch_size=self.batch_size,
show_progress_bar=show_progress,
normalize_embeddings=True,
convert_to_numpy=True,
)
raise NotImplementedError("Multimodal inputs require a multimodal model/config")

from multiprocessing import Pool
import click
import pickle
from unix_socket_client import get_unix_socket_client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use relative import for intra-package module.

Satisfies project guideline and avoids path hacks.

-from unix_socket_client import get_unix_socket_client
+from .unix_socket_client import get_unix_socket_client
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from unix_socket_client import get_unix_socket_client
from .unix_socket_client import get_unix_socket_client
🤖 Prompt for AI Agents
In core/graph/builders/build_complete_graph.py around line 20, replace the
absolute import "from unix_socket_client import get_unix_socket_client" with a
package-relative import (e.g. "from ..unix_socket_client import
get_unix_socket_client" or "from .unix_socket_client import
get_unix_socket_client" depending on unix_socket_client's location) to follow
project guidelines; ensure the containing packages have __init__.py files so the
relative import resolves.

Comment on lines +79 to +86
for paper in tqdm(papers, desc="Loading embeddings"):
embed_doc = self.db.collection('arxiv_embeddings').get(paper['_key'])
if embed_doc and 'keyword_embedding' in embed_doc:
embeddings.append(embed_doc['keyword_embedding'])
paper_ids.append(paper['_key'])
cat = paper['categories'][0] if 'categories' in paper and paper['categories'] else 'unknown'
paper_categories.append(cat)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid per-document lookups; fetch embeddings via AQL join.

Current loop does one get() per paper. Use a single AQL to join papers with arxiv_embeddings and stream results.

-        for paper in tqdm(papers, desc="Loading embeddings"):
-            embed_doc = self.db.collection('arxiv_embeddings').get(paper['_key'])
-            if embed_doc and 'keyword_embedding' in embed_doc:
-                embeddings.append(embed_doc['keyword_embedding'])
-                paper_ids.append(paper['_key'])
-                cat = paper['categories'][0] if 'categories' in paper and paper['categories'] else 'unknown'
-                paper_categories.append(cat)
+        join_query = """
+        FOR p IN arxiv_papers
+          FILTER p.keywords != null
+          LET cat = (LENGTH(p.categories) > 0 ? p.categories[0] : "unknown")
+          LET e = DOCUMENT("arxiv_embeddings", p._key)
+          FILTER e != null AND e.keyword_embedding != null
+          RETURN {id: p._key, cat: cat, emb: e.keyword_embedding}
+        """
+        for row in tqdm(self.db.aql.execute(join_query, batch_size=10000), desc="Loading embeddings"):
+            embeddings.append(row['emb'])
+            paper_ids.append(row['id'])
+            paper_categories.append(row['cat'])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for paper in tqdm(papers, desc="Loading embeddings"):
embed_doc = self.db.collection('arxiv_embeddings').get(paper['_key'])
if embed_doc and 'keyword_embedding' in embed_doc:
embeddings.append(embed_doc['keyword_embedding'])
paper_ids.append(paper['_key'])
cat = paper['categories'][0] if 'categories' in paper and paper['categories'] else 'unknown'
paper_categories.append(cat)
join_query = """
FOR p IN arxiv_papers
FILTER p.keywords != null
LET cat = (LENGTH(p.categories) > 0 ? p.categories[0] : "unknown")
LET e = DOCUMENT("arxiv_embeddings", p._key)
FILTER e != null AND e.keyword_embedding != null
RETURN {id: p._key, cat: cat, emb: e.keyword_embedding}
"""
for row in tqdm(self.db.aql.execute(join_query, batch_size=10000), desc="Loading embeddings"):
embeddings.append(row['emb'])
paper_ids.append(row['id'])
paper_categories.append(row['cat'])
🧰 Tools
🪛 Ruff (0.12.2)

84-84: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)

🤖 Prompt for AI Agents
In core/graph/builders/build_graph_gpu_optimized.py around lines 79 to 86, the
current code does a per-document embed_doc = collection.get(...) inside a loop
which causes many individual DB calls; replace this with a single AQL join that
selects papers and their matching arxiv_embeddings in one query (using a FOR p
IN <papers_collection_or_view> FILTER ... FOR e IN arxiv_embeddings FILTER
e._key == p._key RETURN {paper_key: p._key, categories: p.categories,
keyword_embedding: e.keyword_embedding}), execute it with bind parameters and
iterate the returned cursor to stream results, appending keyword_embedding,
paper_key, and the first category (or 'unknown') to embeddings, paper_ids, and
paper_categories respectively; ensure you handle missing embeddings by filtering
in AQL or checking keyword_embedding presence before appending, and keep tqdm
around cursor iteration if progress display is needed.

Comment on lines +114 to +123
# This is the key operation that should use GPU heavily
with torch.cuda.amp.autocast(): # Use mixed precision for speed
similarities = torch.mm(batch, embeddings_gpu.t())

# Find top-k similar papers for each paper in batch
# This also runs on GPU
top_values, top_indices = torch.topk(similarities,
min(top_k + 1, similarities.shape[1]),
dim=1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Blocker: OOM risk — full NxN similarity matrix won’t scale to millions.

similarities = torch.mm(batch, embeddings_gpu.t()) materializes a (batch_size × N) matrix; with N≈2.8M this is infeasible (tens of GB+ per batch). Stream library embeddings in stripes and maintain rolling top‑k per row, or switch to Faiss GPU. Below shows a streaming top‑k pattern:

-            with torch.cuda.amp.autocast():  # Use mixed precision for speed
-                similarities = torch.mm(batch, embeddings_gpu.t())
-
-            # Find top-k similar papers for each paper in batch
-            top_values, top_indices = torch.topk(similarities, 
-                                                 min(top_k + 1, similarities.shape[1]), 
-                                                 dim=1)
+            # Streaming top-k across library stripes to avoid (batch x N) allocation
+            library_chunk = 200_000
+            best_vals = None
+            best_idx = None
+            for lib_start in range(0, embeddings_gpu.shape[0], library_chunk):
+                lib_end = min(lib_start + library_chunk, embeddings_gpu.shape[0])
+                with torch.cuda.amp.autocast():
+                    sim_chunk = torch.mm(batch, embeddings_gpu[lib_start:lib_end].T)
+                v, i = torch.topk(sim_chunk, k=min(top_k + 1, sim_chunk.shape[1]), dim=1)
+                i = i.add(lib_start)
+                if best_vals is None:
+                    best_vals, best_idx = v, i
+                else:
+                    # merge current best with new chunk best
+                    merged_vals = torch.cat([best_vals, v], dim=1)
+                    merged_idx = torch.cat([best_idx, i], dim=1)
+                    best_vals, sel = torch.topk(merged_vals, k=min(top_k + 1, merged_vals.shape[1]), dim=1)
+                    best_idx = torch.gather(merged_idx, 1, sel)
+                del sim_chunk, v, i
+            top_values, top_indices = best_vals, best_idx
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# This is the key operation that should use GPU heavily
with torch.cuda.amp.autocast(): # Use mixed precision for speed
similarities = torch.mm(batch, embeddings_gpu.t())
# Find top-k similar papers for each paper in batch
# This also runs on GPU
top_values, top_indices = torch.topk(similarities,
min(top_k + 1, similarities.shape[1]),
dim=1)
# Streaming top-k across library stripes to avoid (batch x N) allocation
library_chunk = 200_000
best_vals = None
best_idx = None
for lib_start in range(0, embeddings_gpu.shape[0], library_chunk):
lib_end = min(lib_start + library_chunk, embeddings_gpu.shape[0])
with torch.cuda.amp.autocast():
sim_chunk = torch.mm(batch, embeddings_gpu[lib_start:lib_end].T)
v, i = torch.topk(sim_chunk, k=min(top_k + 1, sim_chunk.shape[1]), dim=1)
i = i.add(lib_start)
if best_vals is None:
best_vals, best_idx = v, i
else:
# merge current best with new chunk best
merged_vals = torch.cat([best_vals, v], dim=1)
merged_idx = torch.cat([best_idx, i], dim=1)
best_vals, sel = torch.topk(merged_vals, k=min(top_k + 1, merged_vals.shape[1]), dim=1)
best_idx = torch.gather(merged_idx, 1, sel)
del sim_chunk, v, i
top_values, top_indices = best_vals, best_idx
🤖 Prompt for AI Agents
In core/graph/builders/build_graph_gpu_optimized.py around lines 114-123, the
code computes a full (batch_size × N) similarity matrix via similarities =
torch.mm(batch, embeddings_gpu.t()), which will OOM for N≈2.8M; replace this
with a streaming/striped approach or Faiss on GPU: iterate over embeddings in
manageable chunks/stripes, move each stripe to GPU, compute partial similarities
for the batch vs stripe (using torch.cuda.amp.autocast if desired), for each
stripe compute top-k per row, and merge those stripe-level top-k results into a
rolling top-k per input row (e.g., keep current top-k values/indices and update
by comparing with new stripe top-k using torch.topk or a max-heap strategy),
ensure to free GPU stripe tensors between iterations and use min(top_k+1,
total_seen) for correct sizes; alternatively replace the brute-force mm with a
Faiss GPU index to perform batched nearest-neighbor search.

Comment on lines +56 to +67
'weight': 0.3
})
else:
# Small categories: connect more densely
for i, paper1 in enumerate(papers):
for paper2 in papers[i+1:min(i+max_edges_per_node, len(papers))]:
edges.append({
'_from': f'arxiv_papers/{paper1}',
'_to': f'arxiv_papers/{paper2}',
'category': category,
'weight': 0.3
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Edge explosion and duplicates: add dedup and cap per category.

Even with sampling, duplicate pairs can appear; enforce canonical ordering and a local set.

-    edges = []
+    edges = []
+    seen = set()
@@
-                    edges.append({
+                    key = (min(paper, neighbor), max(paper, neighbor))
+                    if key in seen:
+                        continue
+                    seen.add(key)
+                    edges.append({
@@
-                for paper2 in papers[i+1:min(i+max_edges_per_node, len(papers))]:
-                    edges.append({
+                for paper2 in papers[i+1:min(i+max_edges_per_node, len(papers))]:
+                    key = (paper1, paper2)
+                    if key in seen:
+                        continue
+                    seen.add(key)
+                    edges.append({
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
'weight': 0.3
})
else:
# Small categories: connect more densely
for i, paper1 in enumerate(papers):
for paper2 in papers[i+1:min(i+max_edges_per_node, len(papers))]:
edges.append({
'_from': f'arxiv_papers/{paper1}',
'_to': f'arxiv_papers/{paper2}',
'category': category,
'weight': 0.3
})
edges = []
seen = set()
key = (min(paper, neighbor), max(paper, neighbor))
if key in seen:
continue
seen.add(key)
edges.append({
'_from': f'arxiv_papers/{paper}',
'_to': f'arxiv_papers/{neighbor}',
'category': category,
'weight': 0.3
})
else:
# Small categories: connect more densely
for i, paper1 in enumerate(papers):
for paper2 in papers[i+1:min(i+max_edges_per_node, len(papers))]:
key = (paper1, paper2)
if key in seen:
continue
seen.add(key)
edges.append({
'_from': f'arxiv_papers/{paper1}',
'_to': f'arxiv_papers/{paper2}',
'category': category,
'weight': 0.3
})
🤖 Prompt for AI Agents
In core/graph/builders/build_graph_parallel.py around lines 56 to 67, the loop
that builds edges for small categories can produce duplicate or mirrored pairs
and can grow unbounded; change it to enforce a canonical ordering (e.g.,
tuple(sorted([paper1, paper2]))) and maintain a local set of seen pairs for that
category, skip if already seen (and skip self-pairs), and stop adding when a
per-category cap (max_edges_per_category) is reached; ensure edges use the
canonical pair to avoid duplicates and that the cap is applied before appending
to edges.

Comment on lines +96 to +99
papers_with_keyword = list(self.db.collection('arxiv_papers').find(
{'keywords': {'like': f'%{keyword}%'}}
))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Arango query via Collection.find with 'like' is likely invalid; switch to AQL and robust year parsing.

keywords is probably an array; LIKE on array won’t work. Use AQL ANY+LIKE and parse year from versions.created when available.

-        papers_with_keyword = list(self.db.collection('arxiv_papers').find(
-            {'keywords': {'like': f'%{keyword}%'}}
-        ))
+        cursor = self.db.aql.execute(
+            """
+            FOR p IN arxiv_papers
+              FILTER p.keywords != null
+                AND ANY k IN p.keywords SATISFIES LIKE(LOWER(k), @pattern, true) END
+              RETURN { _key: p._key, primary_category: p.primary_category, arxiv_id: p.arxiv_id, versions: p.versions }
+            """,
+            bind_vars={'pattern': f"%{keyword.lower()}%"}
+        )
+        papers_with_keyword = list(cursor)
@@
-            if 'arxiv_id' in paper and 'primary_category' in paper:
-                year = 2000 + int(paper['arxiv_id'][:2])  # Convert YY to YYYY
-                category = paper['primary_category'].split('.')[0]  # Main category
+            if 'primary_category' in paper:
+                # Prefer versions.created year; fallback to arxiv_id prefix
+                year = None
+                if paper.get('versions'):
+                    created = paper['versions'][0].get('created')
+                    if created and len(created) >= 4:
+                        try:
+                            year = int(created[-4:])
+                        except ValueError:
+                            year = None
+                if year is None and 'arxiv_id' in paper and isinstance(paper['arxiv_id'], str) and len(paper['arxiv_id']) >= 4:
+                    yy = paper['arxiv_id'][:2]
+                    year = 1900 + int(yy) if int(yy) >= 90 else 2000 + int(yy)
+                category = paper['primary_category'].split('.')[0]

Also applies to: 103-108

🤖 Prompt for AI Agents
In core/graph/utils/interdisciplinary_analysis.py around lines 96-99 (and
similarly 103-108), the current Collection.find call using {'keywords': {'like':
f'%{keyword}%'}} is invalid for Arango and will fail for keywords stored as an
array; replace this with an AQL query that uses ANY keyword IN doc.keywords
FILTER LIKE(keyword, @pattern) RETURN doc (bind @pattern = '%' + keyword + '%')
and execute via db.aql.execute or the collection's AQL API. Also update the year
parsing logic to be robust: when extracting year prefer versions.created (parse
ISO date safely) and fall back to a published/created field or the paper's
uploaded date, validating numeric year extraction and handling missing/invalid
values by returning None or a default; ensure parameterized queries and proper
error handling/logging for parsing failures.

Comment on lines +213 to +216
for coll_name in edge_collections:
if coll_name not in [c['name'] for c in self.db.collections()]:
continue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Collection existence check is incorrect; will raise TypeError.

python-arango Database.collections() returns objects, not dicts. Use has_collection().

-            if coll_name not in [c['name'] for c in self.db.collections()]:
-                continue
+            if not self.db.has_collection(coll_name):
+                continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for coll_name in edge_collections:
if coll_name not in [c['name'] for c in self.db.collections()]:
continue
for coll_name in edge_collections:
if not self.db.has_collection(coll_name):
continue
🤖 Prompt for AI Agents
In core/graph/utils/sample_graph_for_training.py around lines 213 to 216, the
current check uses [c['name'] for c in self.db.collections()] which will raise a
TypeError because python-arango Database.collections() returns objects, not
dicts; replace this logic with a call to self.db.has_collection(coll_name)
(i.e., if not self.db.has_collection(coll_name): continue) to correctly test for
collection existence.

Comment on lines +223 to +239
for i in range(0, len(node_list), batch_size):
batch = node_list[i:i+batch_size]

query = f"""
FOR e IN {coll_name}
LET from_key = PARSE_IDENTIFIER(e._from).key
LET to_key = PARSE_IDENTIFIER(e._to).key
FILTER from_key IN @nodes AND to_key IN @nodes
RETURN {{
from: from_key,
to: to_key,
weight: e.weight
}}
"""

edges = list(self.db.aql.execute(query, bind_vars={'nodes': batch}))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Edge extraction misses cross-batch edges.

Filtering both endpoints IN @nodes where @nodes==batch means you only get intra-batch edges. Must join batch against full node set.

-                query = f"""
-                FOR e IN {coll_name}
-                    LET from_key = PARSE_IDENTIFIER(e._from).key
-                    LET to_key = PARSE_IDENTIFIER(e._to).key
-                    FILTER from_key IN @nodes AND to_key IN @nodes
+                query = f"""
+                FOR e IN {coll_name}
+                    LET from_key = PARSE_IDENTIFIER(e._from).key
+                    LET to_key = PARSE_IDENTIFIER(e._to).key
+                    FILTER (from_key IN @batch AND to_key IN @all_nodes)
+                       OR (to_key IN @batch AND from_key IN @all_nodes)
                     RETURN {{
                         from: from_key,
                         to: to_key,
                         weight: e.weight
                     }}
                 """
-                edges = list(self.db.aql.execute(query, bind_vars={'nodes': batch}))
+                edges = list(self.db.aql.execute(
+                    query,
+                    bind_vars={'batch': batch, 'all_nodes': node_list}
+                ))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i in range(0, len(node_list), batch_size):
batch = node_list[i:i+batch_size]
query = f"""
FOR e IN {coll_name}
LET from_key = PARSE_IDENTIFIER(e._from).key
LET to_key = PARSE_IDENTIFIER(e._to).key
FILTER from_key IN @nodes AND to_key IN @nodes
RETURN {{
from: from_key,
to: to_key,
weight: e.weight
}}
"""
edges = list(self.db.aql.execute(query, bind_vars={'nodes': batch}))
for i in range(0, len(node_list), batch_size):
batch = node_list[i:i+batch_size]
query = f"""
FOR e IN {coll_name}
LET from_key = PARSE_IDENTIFIER(e._from).key
LET to_key = PARSE_IDENTIFIER(e._to).key
FILTER (from_key IN @batch AND to_key IN @all_nodes)
OR (to_key IN @batch AND from_key IN @all_nodes)
RETURN {{
from: from_key,
to: to_key,
weight: e.weight
}}
"""
edges = list(self.db.aql.execute(
query,
bind_vars={'batch': batch, 'all_nodes': node_list}
))
🤖 Prompt for AI Agents
In core/graph/utils/sample_graph_for_training.py around lines 223-239, the AQL
filters use "from_key IN @nodes AND to_key IN @nodes" where @nodes is the
current batch, so cross-batch edges are missed; change the query to bind both
the current batch and the full node list and filter like "(from_key IN @batch
AND to_key IN @all_nodes) OR (to_key IN @batch AND from_key IN @all_nodes)" (or
equivalent JOIN logic) so edges between the batch and nodes outside the batch
are returned; ensure you pass bind_vars={'batch': batch, 'all_nodes':
full_node_list}, deduplicate edges if needed, and keep batching for
memory/performance.

Comment on lines +104 to +116
def _get_cpu_model(self) -> str:
"""Get CPU model name."""
try:
if sys.platform == "linux":
result = subprocess.run(
"lscpu | grep 'Model name' | cut -d: -f2",
shell=True, capture_output=True, text=True
)
return result.stdout.strip()
return "Unknown"
except:
return "Unknown"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid shell=True and brittle greps; parse without shell.

Use subprocess without a shell and parse JSON/structured output where possible. Also avoid bare except.

-            if sys.platform == "linux":
-                result = subprocess.run(
-                    "lscpu | grep 'Model name' | cut -d: -f2",
-                    shell=True, capture_output=True, text=True
-                )
-                return result.stdout.strip()
+            if sys.platform == "linux":
+                result = subprocess.run(["lscpu"], check=False, capture_output=True, text=True)
+                for line in result.stdout.splitlines():
+                    if "Model name:" in line:
+                        return line.split(":", 1)[1].strip()
             return "Unknown"
-        except:
-            return "Unknown"
+        except Exception:
+            return "Unknown"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _get_cpu_model(self) -> str:
"""Get CPU model name."""
try:
if sys.platform == "linux":
result = subprocess.run(
"lscpu | grep 'Model name' | cut -d: -f2",
shell=True, capture_output=True, text=True
)
return result.stdout.strip()
return "Unknown"
except:
return "Unknown"
def _get_cpu_model(self) -> str:
"""Get CPU model name."""
try:
if sys.platform == "linux":
result = subprocess.run(["lscpu"], check=False, capture_output=True, text=True)
for line in result.stdout.splitlines():
if "Model name:" in line:
return line.split(":", 1)[1].strip()
return "Unknown"
except Exception:
return "Unknown"
🧰 Tools
🪛 Ruff (0.12.2)

108-108: subprocess call with shell=True seems safe, but may be changed in the future; consider rewriting without shell

(S602)


109-109: Starting a process with a partial executable path

(S607)


113-113: Consider moving this statement to an else block

(TRY300)


114-114: Do not use bare except

(E722)

🤖 Prompt for AI Agents
In core/monitoring/system_monitor.py around lines 104 to 116, replace the
current shell-based lscpu+grep pipeline and bare except with a safer non-shell
invocation and structured parsing: call lscpu with JSON output via
subprocess.run(["lscpu","-J"], capture_output=True, text=True, check=False) and
parse the resulting JSON to extract the "Model name" value (falling back to
parsing /proc/cpuinfo if lscpu or JSON parsing fails), avoid shell=True, and
replace the bare except with a specific except Exception as e branch that
returns "Unknown" on error (optionally logging the exception).

Comment on lines +121 to +135
result = subprocess.run(
"nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader",
shell=True, capture_output=True, text=True
)
for line in result.stdout.strip().split('\n'):
parts = line.split(', ')
if len(parts) >= 3:
gpus.append({
"index": int(parts[0]),
"name": parts[1],
"memory_gb": float(parts[2].replace(' MiB', '')) / 1024
})
except:
pass
return gpus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Same: shell=True and bare except in GPU info.

Use CSV mode without a shell; log exceptions at debug level.

-            result = subprocess.run(
-                "nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader",
-                shell=True, capture_output=True, text=True
-            )
+            result = subprocess.run(
+                ["nvidia-smi", "--query-gpu=index,name,memory.total", "--format=csv,noheader"],
+                check=False, capture_output=True, text=True
+            )
@@
-        except:
-            pass
+        except Exception:
+            # swallow in non-GPU environments
+            return []
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result = subprocess.run(
"nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader",
shell=True, capture_output=True, text=True
)
for line in result.stdout.strip().split('\n'):
parts = line.split(', ')
if len(parts) >= 3:
gpus.append({
"index": int(parts[0]),
"name": parts[1],
"memory_gb": float(parts[2].replace(' MiB', '')) / 1024
})
except:
pass
return gpus
result = subprocess.run(
["nvidia-smi", "--query-gpu=index,name,memory.total", "--format=csv,noheader"],
check=False, capture_output=True, text=True
)
for line in result.stdout.strip().split('\n'):
parts = line.split(', ')
if len(parts) >= 3:
gpus.append({
"index": int(parts[0]),
"name": parts[1],
"memory_gb": float(parts[2].replace(' MiB', '')) / 1024
})
except Exception:
# swallow in non-GPU environments
return []
return gpus
🧰 Tools
🪛 Ruff (0.12.2)

121-121: subprocess call with shell=True seems safe, but may be changed in the future; consider rewriting without shell

(S602)


122-122: Starting a process with a partial executable path

(S607)


133-133: Do not use bare except

(E722)


133-134: try-except-pass detected, consider logging the exception

(S110)

🤖 Prompt for AI Agents
In core/monitoring/system_monitor.py around lines 121 to 135, the GPU-detection
block uses shell=True and a bare except which is unsafe and hides errors; change
subprocess.run to call ["nvidia-smi", "--query-gpu=index,name,memory.total",
"--format=csv,noheader"] (no shell) and set check=True or check_returncode to
surface non-zero exits, parse the output with the csv module or robust
splitting, and replace the bare except with a specific exception handler
(subprocess.CalledProcessError, FileNotFoundError, ValueError) that logs the
exception at debug level using the module logger before returning the gpus list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Implement GraphSAGE for Theory-Practice Bridge Discovery

2 participants