Skip to content

Conversation

@r3d91ll
Copy link
Owner

@r3d91ll r3d91ll commented Sep 3, 2025

Summary by CodeRabbit

  • New Features

    • GitHub repository processing with Tree-sitter symbol extraction, graph-based repo→file→chunk→embedding storage, ArXiv metadata service, and an orchestration CLI.
  • Improvements

    • More robust PDF extraction (subprocess timeouts + fallback), file-based checkpoints/resume, richer paper/code/embedding metadata, transactional DB pooling and atomic storage, enhanced list-generation and collection tooling.
  • Bug Fixes

    • Hardened GPU/status monitoring, log timestamp parsing, and resilient extraction/embedding error handling.
  • Documentation

    • New ADRs, PRDs, READMEs, and usage/config guides.
  • Tests

    • Added integration, Tree-sitter, reliability, and large-scale test suites.
  • Chores

    • Expanded .gitignore and dependency/config updates.

coderabbitai bot and others added 20 commits August 28, 2025 03:46
Docstrings generation was requested by @r3d91ll.

* #6 (comment)

The following files were modified:

* `core/framework/extractors/code_extractor.py`
* `core/framework/extractors/docling_extractor.py`
* `core/processors/chunking_strategies.py`
* `core/processors/document_processor.py`
* `core/processors/generic_document_processor.py`
* `tests/monitor_overnight.py`
* `tests/monitor_phased.py`
* `tests/test_batch_processing.py`
* `tests/test_batch_processing_simple.py`
* `tests/test_document_processor.py`
* `tests/test_overnight_2000.py`
* `tests/test_overnight_arxiv_2000.py`
* `tests/test_overnight_arxiv_phased.py`
* `tools/arxiv/arxiv_document_manager.py`
* `tools/arxiv/arxiv_manager.py`
* `tools/arxiv/pipelines/arxiv_pipeline_v2.py`
* `tools/github/github_document_manager.py`
* `tools/github/github_pipeline.py`
📝 Add docstrings to `feature/github-integration`
- Add Tree-sitter symbol extraction for code analysis
- Implement graph-based storage architecture in ArangoDB
- Add robust and code extractors with proper error handling
- Fix all CodeRabbit review issues (PyMuPDF, ProcessPoolExecutor, etc.)
- Add comprehensive documentation and ADRs
- Successfully process repositories with 100+ files/minute rate

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update `.gitignore` to include new data files for extended collections.
- Modify `collect_ai_papers_extended.py` to allow dynamic target count for paper collection.
- Implement `run_pipeline_from_list.py` to process papers from a specified list.
- Add `run_test_pipeline.py` for configurable testing of the ArXiv pipeline.
…idge analysis

This commit adds a comprehensive experiment analyzing the evolution of word embeddings
from word2vec (2013) → doc2vec (2014) → code2vec (2018) to identify theory-practice
bridges and measure "conveyance" - the ability to transfer knowledge from papers to
implementations.

Key Changes:
- Add word2vec_evolution experiment with complete pipeline
- Fix GitHub pipeline to support config-driven repository processing
- Create graph structure for connecting papers with code implementations
- Implement conveyance measurement framework (C = W×R×H/T × Ctx^α)

Experiment Highlights:
- Identified Gensim's doc2vec as "pure conveyance" (α ≈ 2.0)
  * Implementation created solely from paper without original code
  * Became de facto standard, empirically proving high conveyance
- Processed 3 papers and 3 repositories with embeddings
- Created edge collections for theory-practice relationships
- Established temporal evolution chain in ArangoDB graph

Technical Improvements:
- GitHub pipeline now reads repositories from config (consistency with ArXiv)
- Added graph collections: paper_implements_theory, semantic_similarity,
  temporal_evolution, conveyance_bridge
- Cleaned up deprecated configs and removed large data files from version control

Performance:
- Papers: 27 chunks, 2048-dim Jina v4 embeddings
- Repositories: 275 files, 1,453 embeddings
- Processing: 24.8s papers, 135.9s repositories

This provides the foundation for entropy mapping and conveyance analysis
to quantify how effectively knowledge transfers from academic theory to
practical implementations.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
- Implemented `update_pdf_sizes.py` to scan PDF files and update their sizes in the database.
- Created `utils.py` for utility functions including normalization of arXiv IDs.
- Developed orchestration module with `orchestrator.py` to manage the search, preview, refine, and process workflow for arXiv metadata.
- Added `list_generator.py` to generate paper lists from PostgreSQL based on search configurations.
- Introduced `check_existing_papers.py` script to verify which papers are already processed in ArangoDB.
- Enhanced logging and error handling across modules for better traceability and debugging.
- Remove previously committed *_embeddings.json files from git tracking
- Add pattern to .gitignore to prevent future embedding files from being committed
- These are large generated files that should not be in version control
Security fixes:
- Remove hardcoded DSN credentials from db.yaml, use environment variables
- Fix .gitignore to only exclude test artifacts, not test source files
- Fix keywords field in arxiv_search_minimal.yaml (empty list instead of empty string)

Import path fixes:
- Update all imports from tools.arxiv.pipelines.arango_db_manager to core.database.arango_db_manager
- Fix incorrect import in demo_github_pipeline.py

These changes address critical security vulnerabilities and broken imports identified by CodeRabbit.
ArangoDB Manager fixes:
- Fix with_connection decorator to inject via kwargs (works with methods)
- Improve TTL index checking to verify specific field presence
- Add overwrite parameter to insert_document with proper warnings
- Document data loss risks when overwrite=True

Extractor fixes:
- Fix tree-sitter generator detection to use node attribute
- Fix robust_extractor to use public shutdown API first
- Add proper exception handling for executor cleanup

These changes improve robustness and address potential issues identified by CodeRabbit.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
- Remove all JSON result files that were accidentally tracked
- Remove extracted papers and analysis files
- Update .gitignore to prevent these files from being added
- These are generated output files that should not be in version control
- Merge paper IDs from loaded JSON collection into collected_ids set
- Prevents re-collecting papers that are already in existing collections
- Handles both 'id' and 'arxiv_id' field names for compatibility
- Addresses CodeRabbit review comment in PR #16
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 3, 2025

Walkthrough

Adds a large multi-source processing surface: ArangoDB manager, Postgres-backed ArXiv metadata service and migrations, robust PDF and Tree-sitter code extractors, GenericDocumentProcessor with phased extraction/embedding and checkpointing, GitHub repo processing and graph storage, many CLI/scripts/tests/configs/experiments, docs (ADRs/PRDs), packaging and .gitignore updates, and removal of autogenerated/legacy artifacts.

Changes

Cohort / File(s) Summary
Repository housekeeping
\.gitignore, pyproject.toml, README.md
Add ignore patterns for generated artifacts; adjust Poetry packaging/deps (add psycopg/sqlalchemy/alembic); expand README to document GitHub processing, Tree-sitter, and graph model.
Package inits & bootstrapping
core/__init__.py, tools/__init__.py, core/database/__init__.py, tools/arxiv/db/__init__.py, tools/arxiv/orchestration/__init__.py
New package initializers, module docstrings, and re-exports (ArangoDBManager, retry helper, orchestration facade, DB package exports).
ArangoDB manager
core/database/arango_db_manager.py
New ArangoDBManager with pooled connections, get/return connection API, with_connection decorator, insert/transaction/lock helpers and retry_with_backoff decorator.
Extractors: Docling / Robust / Code / Tree-sitter
core/framework/extractors/docling_extractor.py, robust_extractor.py, code_extractor.py, tree_sitter_extractor.py
Add RobustExtractor (subprocess isolation, timeout, PyMuPDF fallback); extend DoclingExtractor to handle text and fallback semantics; add TreeSitterExtractor (multi-language symbol/metrics/hash); CodeExtractor gains optional Tree-sitter integration with graceful fallback and enriched metadata.
Chunking & processors
core/processors/*, core/processors/document_processor.py, core/processors/generic_document_processor.py
Tokenizer interface expanded (callable or object), sliding/semantic chunking updates, docstring expansions; GenericDocumentProcessor adds staging directory handling, phased extraction/embedding orchestration, worker init, Jina embedder init, ArangoDBManager wiring, enriched storage (symbols, repo, code metrics), and transactional storage/cleanup.
Phase management & pipelines
tools/arxiv/pipelines/arxiv_pipeline.py, tools/arxiv/pipelines/arxiv_pipeline_v2.py
PhaseManager metadata snapshot loading and staged-file enrichment; pipeline v2 adds file-based checkpointing (load/save/_should_use_checkpoint), source handling, and enriched result outputs.
Postgres ArXiv metadata service & tooling
tools/arxiv/db/* (configs, config dataclasses, pg util, alembic env/migrations, snapshot loader, OAI harvester, export/scan/update, compute assessment, helpers)
New typed DB config loader, psycopg connection util, Alembic environment + initial migration, snapshot loader, OAI harvester to Postgres, export_ids/list generation, artifact scanners, pdf-size updaters, compute-assessment modules and many CLI utilities.
Orchestration & list generation
tools/arxiv/orchestration/orchestrator.py, tools/arxiv/utils/list_generator.py
New ArxivPipelineOrchestrator (search→preview→refine→process state machine) and ArxivListGenerator for Postgres-driven ID list generation and outputs.
ArXiv scripts, monitoring & tests
tools/arxiv/scripts/*, tools/arxiv/monitoring/*, tools/arxiv/tests/*, tests/*
Many CLI utilities (collectors, runners, checkers), monitoring tooling (weekend dashboard), and new/extended integration/diagnostic tests (RobustExtractor, Tree-sitter, GitHub flows, large-scale runners, various test harnesses).
GitHub processing workspace
tools/github/*, tools/github/configs/*, tools/github/README.md
Add GitHubDocumentManager, demo pipeline, setup scripts, expanded config (more extensions/repos), README documenting graph-based storage and Tree-sitter pipeline.
Experiments: Word2Vec evolution
experiments/word2vec_evolution/*
New experiment suite: configs, manifests, embedding generation, analysis, graph setup, import/store/verify scripts, experiment outputs and summaries.
Docs, ADRs & PRDs
docs/**/*
Add ADRs (0003–0005), PRDs (ArXiv metadata, GitHub integration), implementation notes and issue/write-ups.
Removals / legacy cleanup
arxiv_pipeline_v2_results_*.json, old.claude/*
Delete autogenerated pipeline result JSONs and legacy old.claude ADR/agent templates.

Sequence Diagram(s)

%%{init: {"themeVariables": {"primaryColor":"#f5f7fa","actorBorder":"#244","noteBg":"#f0f4ff"}}}%%
sequenceDiagram
  autonumber
  actor User
  participant GDM as GitHubDocumentManager
  participant GDP as GenericDocumentProcessor
  participant EX as Code/Tree-sitter Extractor
  participant EMB as JinaV4Embedder
  participant DB as ArangoDBManager

  User->>GDM: prepare_repository(repo_url)
  GDM-->>User: tasks (file list)
  User->>GDP: process_documents(tasks)
  GDP->>EX: extract(file)
  EX-->>GDP: symbols/metrics + text
  GDP->>EMB: embed (late chunking)
  EMB-->>GDP: embeddings
  GDP->>DB: begin transaction → insert repo/file/chunk/embedding
  DB-->>GDP: commit
  GDP-->>User: summary (processed counts)
Loading
sequenceDiagram
  autonumber
  participant DP as DocumentProcessor
  participant RE as RobustExtractor
  participant DL as Docling Worker (subprocess)
  participant PM as PyMuPDF Fallback

  DP->>RE: extract(pdf_path)
  RE->>DL: run with timeout
  alt Docling success
    DL-->>RE: content
    RE-->>DP: result (docling)
  else timeout/error
    RE->>PM: fallback extract
    PM-->>RE: text-only content
    RE-->>DP: result (pymupdf_fallback)
  end
Loading
sequenceDiagram
  autonumber
  actor Operator
  participant ORCH as ArxivPipelineOrchestrator
  participant SQL as Postgres
  participant LG as ListGenerator
  participant PIPE as ArXiv Pipeline v2

  Operator->>ORCH: load search config
  ORCH->>SQL: assess_compute_cost(build_where)
  SQL-->>ORCH: assessment
  ORCH->>Operator: preview/refine
  Operator->>ORCH: approve
  ORCH->>LG: generate_list()
  LG-->>ORCH: paper_ids + files
  ORCH->>PIPE: run with list/config
  PIPE-->>ORCH: results file
  ORCH-->>Operator: completed + next-step cmd
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Possibly related issues

  • ArXiv Metadata Service (Postgres) #10 — ArXiv Metadata Service: this PR adds tools/arxiv/db modules, Alembic migrations, and ingestion/harvest/export scripts matching that issue.
  • Enhancement: Tree-sitter Symbol Extraction and Graph-based Repository Storage #8 — Tree-sitter integration: TreeSitterExtractor and CodeExtractor changes implement Tree-sitter symbol extraction and metrics described by the issue.
  • r3d91ll/HADES#9 — ArangoDB transaction/lock API: ArangoDBManager implements pooled connections, transactions, retries, and TTL locks aligned with this issue.
  • r3d91ll/HADES#13 — GitHub processing & graph storage: tools/github workspace, Graph model, and pipeline changes correspond to this objective.

Possibly related PRs

Poem

A rabbit taps keys in the moonlit lab,
Tree-sitter hums and symbols gab,
Docling guards while PyMuPDF mends,
Graphs link papers to code as friends.
Hop—commit—pipelines sing; embeddings rise, applause that never ends 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/github_arxiv_tooling

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 106

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (37)
tests/test_overnight_arxiv_2000.py (5)

54-109: Fix YYMM filtering (lexicographic bug across centuries) and make sampling reproducible.

  • Comparing "YYMM" as strings admits 1990s (e.g., "9912") when starting at "1501". Normalize to (YYYY, MM).
  • Add optional seed for deterministic runs.

Apply this diff:

-def collect_arxiv_papers(target_count: int = 2000, start_year: str = "1501") -> List[Tuple[Path, str]]:
+def collect_arxiv_papers(target_count: int = 2000, start_year: str = "1501", seed: int | None = None) -> List[Tuple[Path, str]]:
@@
-    Parameters:
+    Parameters:
         target_count (int): Maximum number of papers to collect.
         start_year (str): Inclusive YYMM directory name to start searching from (e.g., "1501" = January 2015).
+        seed (int | None): Optional RNG seed for reproducible sampling.
@@
-    papers = []
+    papers = []
+    rng = random.Random(seed) if seed is not None else random
@@
-    yymm_dirs = sorted([d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()])
-
-    # Filter to start from specified year
-    yymm_dirs = [d for d in yymm_dirs if d.name >= start_year]
+    yymm_dirs = sorted(
+        [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()],
+        key=lambda d: _normalize_yymm(d.name)
+    )
+    # Filter to start from specified year (normalize century)
+    start_cutoff = _normalize_yymm(start_year)
+    yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]
@@
-            sampled_pdfs = random.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_files
+            sampled_pdfs = rng.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_files

Add this helper near the constants:

def _normalize_yymm(yymm: str) -> tuple[int, int]:
    if len(yymm) != 4 or not yymm.isdigit():
        raise ValueError(f"Invalid YYMM: {yymm}")
    yy, mm = int(yymm[:2]), int(yymm[2:])
    if not 1 <= mm <= 12:
        raise ValueError(f"Invalid month in YYMM: {yymm}")
    year = 1900 + yy if yy >= 90 else 2000 + yy
    return (year, mm)

223-244: Import heavy deps inside the function and make GPU use dynamic.

  • Localize imports to avoid import-time failures in non-GPU environments.
  • Detect CUDA and only enable FP16 when GPU is available.
@@
-        # Configure processor
-        config = ProcessingConfig(
+        # Local imports to avoid import-time side effects
+        from core.processors.document_processor import DocumentProcessor, ProcessingConfig
+        try:
+            import torch  # optional
+            use_gpu = torch.cuda.is_available()
+        except Exception:
+            use_gpu = False
+
+        # Configure processor
+        config = ProcessingConfig(
             chunking_strategy=test_config['strategy'],
             chunk_size_tokens=test_config['chunk_size'],
             chunk_overlap_tokens=test_config['overlap'],
             batch_size=test_config['batch_size'],
-            use_gpu=True,
-            use_fp16=True,  # Use FP16 for memory efficiency
+            use_gpu=use_gpu,
+            use_fp16=use_gpu,  # Only enable FP16 on GPU
             use_ocr=False,  # Disable OCR for speed
             extract_tables=True,  # Keep table extraction
             extract_equations=True  # Keep equation extraction
         )

185-189: Wire CLI start-year through to collection.

The --start-year flag currently has no effect.

-        papers = collect_arxiv_papers(num_docs)
+        papers = collect_arxiv_papers(num_docs, start_year=start_year)

400-407: Report the actual number of papers processed, not the requested count.

This corrects summary inaccuracies when fewer files are found or runs abort early.

-        'total_papers_processed': num_docs,
+        'total_papers_processed': len(papers),

459-459: Pass start_year from CLI to the runner.

Without this, --start-year is ignored.

-        results = run_arxiv_overnight_test(args.docs)
+        results = run_arxiv_overnight_test(args.docs, start_year=args.start_year)
tests/test_overnight_arxiv_phased.py (7)

49-52: Make STAGING_DIR portable and robust (fallback if /dev/shm is unavailable).

Hard-coding /dev/shm breaks on macOS/CI and raises FileNotFoundError. Also prefer parents=True.

Apply:

-# Staging directory for inter-phase communication
-STAGING_DIR = Path("/dev/shm/overnight_staging")
-STAGING_DIR.mkdir(exist_ok=True)
+# Staging directory for inter-phase communication
+# Prefer env override; fall back to /dev/shm if present, else system temp
+default_staging_base = Path("/dev/shm") if Path("/dev/shm").exists() else Path(os.getenv("TMPDIR", "/tmp"))
+STAGING_DIR = Path(os.getenv("STAGING_DIR", default_staging_base / "overnight_staging"))
+STAGING_DIR.mkdir(parents=True, exist_ok=True)

53-55: Parameterize ARXIV_PDF_BASE via environment for portability.

Hard-coded path will fail outside your lab host. Allow override.

-# ArXiv PDF repository
-ARXIV_PDF_BASE = Path("/bulk-store/arxiv-data/pdf")
+# ArXiv PDF repository
+ARXIV_PDF_BASE = Path(os.getenv("ARXIV_PDF_BASE", "/bulk-store/arxiv-data/pdf"))

78-83: Fix incorrect lexicographic comparison of year-month directory names.

String comparison can misorder mixed 4- vs 6-digit names (e.g., "9999" > "202001"). Compare numerically and sort by int.

-    yymm_dirs = sorted([d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()])
-    yymm_dirs = [d for d in yymm_dirs if d.name >= start_year]
+    # Normalize to numeric comparison; accept 4- or 6-digit directory names
+    int_start = int(start_year)
+    yymm_dirs = sorted(
+        [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()],
+        key=lambda p: int(p.name)
+    )
+    yymm_dirs = [d for d in yymm_dirs if int(d.name) >= int_start]

133-141: Improve I/O robustness and error logging.

Use encoding and log tracebacks for faster diagnosis.

-        with open(staging_file, 'w') as f:
+        with open(staging_file, 'w', encoding='utf-8') as f:
             json.dump(result, f)
-        return {'arxiv_id': arxiv_id, 'success': True, 'staged_path': str(staging_file)}
+        return {'arxiv_id': arxiv_id, 'success': True, 'staged_path': str(staging_file)}
@@
-    except Exception as e:
-        logger.error(f"Failed to extract {arxiv_id}: {e}")
+    except Exception as e:
+        logger.error(f"Failed to extract {arxiv_id}: {e}", exc_info=True)
         return {'arxiv_id': arxiv_id, 'success': False, 'error': str(e)}

268-281: Avoid setting CUDA_VISIBLE_DEVICES at runtime inside workers.

Setting the env var post-import doesn’t constrain already-initialized CUDA contexts and can confuse device indexing. Rely on torch.cuda.set_device only.

         if torch.cuda.is_available():
             torch.cuda.set_device(gpu_id)
-            os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)

303-312: Prevent OOM by not returning full chunks payload.

Aggregate counts and persist details elsewhere if required.

-        return {
+        return {
             'arxiv_id': arxiv_id,
             'success': True,
             'num_chunks': len(chunks_with_embeddings),
-            'chunks': chunks_with_embeddings,  # Store if needed
             'gpu_used': gpu_id if torch.cuda.is_available() else 'cpu'
         }

502-516: Compute overall metrics from actual phase outputs, not requested num_docs.

Avoids inflated rates when fewer papers are processed successfully.

-    total_time = extraction_results['phase_time'] + embedding_results['phase_time']
-    overall_rate = num_docs / total_time * 60
+    total_time = extraction_results['phase_time'] + embedding_results['phase_time']
+    processed_docs = embedding_results.get('total_files', 0) or extraction_results.get('total_papers', 0)
+    overall_rate = (processed_docs / total_time * 60) if total_time > 0 else 0
@@
-        'total_papers_processed': num_docs,
+        'total_papers_processed': processed_docs,
tools/arxiv/monitoring/monitor_overnight.py (1)

46-49: Guard against partial JSON reads while file is being written.

json.load() can fail if the writer hasn’t finished flushing. Retry instead of crashing.

Apply:

-        if result_file.exists():
-            with open(result_file) as f:
-                results = json.load(f)
+        if result_file.exists():
+            try:
+                with open(result_file) as f:
+                    results = json.load(f)
+            except json.JSONDecodeError:
+                # File may be mid-write; try again next cycle
+                time.sleep(10)
+                continue
tests/test_overnight_2000.py (1)

43-113: --mock flag currently does nothing; monkey patch happens before args are parsed.

The embedder is chosen at import time; flipping MOCK_EMBEDDER later won’t reapply the patch. Move the mock class and monkey patch behind a function and call it when args.mock is true.

-# Use real embedder by default for accurate testing
-MOCK_EMBEDDER = False  # Set to True only for quick debugging
-
-if MOCK_EMBEDDER:
-    logger.info("Using MOCK embedder for faster processing")
-    
-    class MockEmbedder:
-        """Mock embedder for testing without loading heavy models."""
-        ...
-    # Monkey patch the embedder
-    import core.framework.embedders
-    core.framework.embedders.JinaV4Embedder = MockEmbedder
-else:
-    logger.info("Using REAL Jina embedder - this will be slower")
+MOCK_EMBEDDER = False  # toggled via CLI
+
+def apply_mock_embedder():
+    logger.info("Using MOCK embedder for faster processing")
+    class MockEmbedder:
+        """Mock embedder for testing without loading heavy models."""
+        def __init__(self, *args, **kwargs):
+            self.embedding_dim = 2048
+        def embed_texts(self, texts: List[str], batch_size: int = 1):
+            import numpy as np
+            return np.random.randn(len(texts), self.embedding_dim).astype(np.float32)
+        def embed_with_late_chunking(self, text: str):
+            import numpy as np
+            from core.framework.embedders import ChunkWithEmbedding
+            words = text.split()
+            chunk_size = 256
+            chunks = []
+            for i in range(0, len(words), chunk_size - 50):
+                chunk_text = ' '.join(words[i:i+chunk_size])
+                chunk = ChunkWithEmbedding(
+                    text=chunk_text,
+                    embedding=np.random.randn(self.embedding_dim).astype(np.float32),
+                    start_char=i * 5,
+                    end_char=(i + len(chunk_text.split())) * 5,
+                    start_token=i,
+                    end_token=min(i + chunk_size, len(words)),
+                    chunk_index=len(chunks),
+                    total_chunks=0,
+                    context_window_used=chunk_size
+                )
+                chunks.append(chunk)
+            for c in chunks: c.total_chunks = len(chunks)
+            return chunks
+    import core.framework.embedders as _emb
+    _emb.JinaV4Embedder = MockEmbedder

And in main, after parsing args:

-    if args.mock:
-        MOCK_EMBEDDER = True
-        logger.info("WARNING: Using MOCK embedder - results won't reflect real performance!")
+    if args.mock:
+        MOCK_EMBEDDER = True
+        apply_mock_embedder()
+        logger.info("WARNING: Using MOCK embedder - results won't reflect real performance!")
tools/arxiv/monitoring/monitor_phased.py (1)

104-110: Fix summary slicing: lines.index(line) can return the first occurrence, not the last.

If “FINAL SUMMARY” appears multiple times, you may print the wrong block. Use the last occurrence.

-                # Print summary lines
-                for summary_line in lines[lines.index(line):]:
+                # Print summary lines from the last occurrence
+                try:
+                    start = len(lines) - 1 - list(reversed(lines)).index(line)
+                except ValueError:
+                    start = 0
+                for summary_line in lines[start:]:
                     if "Overall Rate:" in summary_line or "Extraction:" in summary_line or "Embedding:" in summary_line:
                         print(summary_line.strip())
                 return
tools/arxiv/arxiv_manager.py (4)

101-112: Docstring says “ignore version suffix,” but code validates original ID; use base_id.

Small logic/consistency fix; also removes an unused variable warning.

-        base_id = re.sub(r'v\d+$', '', arxiv_id)
+        base_id = re.sub(r'v\d+$', '', arxiv_id)
@@
-        if cls.ARXIV_ID_PATTERN.match(arxiv_id):
+        if cls.ARXIV_ID_PATTERN.match(base_id):
             return True, None
@@
-        if cls.OLD_ARXIV_ID_PATTERN.match(arxiv_id):
+        if cls.OLD_ARXIV_ID_PATTERN.match(base_id):
             return True, None

413-468: Missing transactional writes and distributed lock; risk of partial writes and duplicate processing.

Per learnings, writes should be atomic and guarded by a lock. Use Arango locks (arxiv_locks) and wrap all inserts in a single transaction.

Proposed flow:

  • acquire_lock(paper_info.sanitized_id) → skip or retry if locked
  • begin_transaction(write_collections=['arxiv_papers','arxiv_chunks','arxiv_embeddings','arxiv_structures']) // or tables/equations/images if you split
  • insert arxiv_papers, then chunks and structures within the same tx
  • commit; finally release_lock

Do you want me to draft a concrete implementation aligned with core.database.ArangoDBManager’s transaction API?


31-31: Update ArangoDBManager import to new core.database module
In tools/arxiv/arxiv_manager.py (line 31), change the import to:

-from tools.arxiv.pipelines.arango_db_manager import ArangoDBManager
+from core.database.arango_db_manager import ArangoDBManager

389-391: Remove await from synchronous insert_document calls

In _store_arxiv_result (tools/arxiv/arxiv_manager.py), db_manager.insert_document is defined as a regular def, not async def, so awaiting it will raise. Drop the await before each call to:

  • self.db_manager.insert_document('arxiv_papers', arxiv_doc)
  • self.db_manager.insert_document('arxiv_embeddings', chunk_doc)
  • self.db_manager.insert_document('arxiv_structures', structures_doc)

Occurrences at lines 438–440, 454–455 and 466–467.

tools/github/github_document_manager.py (2)

268-270: Avoid sys.path manipulation; use absolute imports.

Modify to import DocumentTask via the package, per guidelines. The sys.path hack is fragile.

-        # Import here to avoid circular dependency
-        import sys
-        sys.path.insert(0, str(Path(__file__).parent.parent.parent))
-        from core.processors.generic_document_processor import DocumentTask
+        from core.processors.generic_document_processor import DocumentTask

168-188: URL parsing is brittle; handle extra path segments and both HTTPS/SSH forms robustly.

Use urllib.parse and split on path safely; current split() will crash on owner/repo/subpath.

-        # Handle different URL formats
-        url = url.strip()
-        
-        # Extract owner/repo from various formats
-        if "github.com" in url:
-            # https://github.com/owner/repo or git@github.com:owner/repo
-            parts = url.split("github.com")[-1]
-            parts = parts.strip(":/").replace(".git", "")
-            owner, name = parts.split("/")
-        else:
-            # Assume owner/repo format
-            owner, name = url.split("/")
+        from urllib.parse import urlparse
+        url = url.strip()
+        owner = name = None
+        if "github.com" in url:
+            if url.startswith("git@"):
+                path = url.split("github.com:", 1)[-1]
+            else:
+                path = urlparse(url).path
+            parts = [p for p in path.replace(".git", "").strip("/").split("/") if p]
+            if len(parts) >= 2:
+                owner, name = parts[0], parts[1]
+        else:
+            parts = [p for p in url.split("/") if p]
+            if len(parts) == 2:
+                owner, name = parts
+        if not owner or not name:
+            raise ValueError(f"Unrecognized GitHub repo identifier: {url}")
core/processors/chunking_strategies.py (1)

477-497: SlidingWindow: emit a remainder/single chunk for short texts

Current guard skips any chunk when len(tokens) <= window_size. Emit a remainder/single chunk so short docs aren’t dropped.

-        # Handle remainder if any
-        if len(tokens) > self.window_size:
-            last_index = len(chunks) * self.step_size
-            if last_index < len(tokens):
-                remaining_tokens = tokens[last_index:]
-                if len(remaining_tokens) >= self.step_size:  # Only if substantial
-                    chunk_text = ' '.join(remaining_tokens)
-                    start_char = len(' '.join(tokens[:last_index]))
-                    
-                    chunk = TextChunk(
-                        text=chunk_text,
-                        start_char=start_char,
-                        end_char=start_char + len(chunk_text),
-                        chunk_index=len(chunks),
-                        metadata={
-                            'strategy': 'sliding_window_remainder',
-                            'token_count': len(remaining_tokens)
-                        }
-                    )
-                    chunks.append(chunk)
+        # Handle remainder or produce a single chunk for short inputs
+        last_index = len(chunks) * self.step_size
+        if last_index < len(tokens):
+            remaining_tokens = tokens[last_index:]
+            if remaining_tokens:  # emit any meaningful remainder
+                chunk_text = ' '.join(remaining_tokens)
+                start_char = len(' '.join(tokens[:last_index]))
+                chunk = TextChunk(
+                    text=chunk_text,
+                    start_char=start_char,
+                    end_char=start_char + len(chunk_text),
+                    chunk_index=len(chunks),
+                    metadata={
+                        'strategy': 'sliding_window_remainder' if chunks else 'sliding_window_single',
+                        'window_size': self.window_size,
+                        'step_size': self.step_size,
+                        'token_count': len(remaining_tokens)
+                    }
+                )
+                chunks.append(chunk)
tools/arxiv/arxiv_document_manager.py (2)

95-97: limit=0 bug: processes all files instead of none

if limit: treats 0 as falsy. Use is not None.

-        if limit:
-            pdf_files = pdf_files[:limit]
+        if limit is not None:
+            pdf_files = pdf_files[:limit]

246-252: Critical: validate_arxiv_id strips all digits and rejects valid IDs

rstrip('v0123456789') removes all trailing digits; for "2305.12345" it becomes "", failing validation. Parse the optional version suffix instead.

-        # Check paper number is 4-6 digits (or with version like 12345v2)
-        paper_num_base = paper_num.rstrip('v0123456789')
-        if not paper_num_base.isdigit() or not (4 <= len(paper_num_base) <= 6):
+        # Check paper number is 4–6 digits, optionally followed by version like v2
+        base, _, _ = paper_num.partition('v')
+        if not base.isdigit() or not (4 <= len(base) <= 6):
             return False
tools/arxiv/pipelines/arxiv_pipeline_v2.py (3)

167-186: Checkpoint ‘enabled’ should accept booleans and strings. Current logic misinterprets YAML true/false.

If config uses YAML booleans, _should_use_checkpoint falls into the auto branch erroneously.

Apply:

 def _should_use_checkpoint(self, count: int) -> bool:
@@
-        if self.checkpoint_enabled == 'true':
-            return True
-        elif self.checkpoint_enabled == 'false':
-            return False
-        else:  # auto mode
-            return count >= self.checkpoint_auto_threshold
+        v = self.checkpoint_enabled
+        # Accept booleans and common string variants
+        if isinstance(v, bool):
+            return v
+        if isinstance(v, str):
+            s = v.strip().lower()
+            if s in ('true', 'yes', 'on', '1'):
+                return True
+            if s in ('false', 'no', 'off', '0'):
+                return False
+            # auto or unknown → treat as auto
+        return count >= self.checkpoint_auto_threshold

254-267: Checkpoint never updates: gating on results['success'] is wrong.

GenericDocumentProcessor returns per-phase results; there’s no top-level ‘success’. Update based on embedding successes.

Apply:

-        # Step 3: Update checkpoint if enabled
-        if use_checkpoint and results.get('success'):
-            # Add successfully processed documents to checkpoint
-            if 'embedding' in results and 'success' in results['embedding']:
-                for doc_id in results['embedding']['success']:
-                    self.processed_ids.add(doc_id)
-                
-                # Save checkpoint periodically or at the end
-                self._save_checkpoint(
-                    extraction_results=results.get('extraction'),
-                    embedding_results=results.get('embedding')
-                )
-                logger.info(f"Checkpoint updated: {len(self.processed_ids)} total documents processed")
+        # Step 3: Update checkpoint if enabled
+        if use_checkpoint:
+            embed_success = results.get('embedding', {}).get('success', [])
+            if embed_success:
+                self.processed_ids.update(embed_success)
+                # Save checkpoint (periodic save_interval can be added later)
+                self._save_checkpoint(
+                    extraction_results=results.get('extraction'),
+                    embedding_results=results.get('embedding')
+                )
+                logger.info(f"Checkpoint updated: {len(self.processed_ids)} total documents processed")

277-294: Fix reporting: handle missing top-level ‘success’ and zero-division on avg chunks.

Avoid KeyError on results['success'] and only compute averages when total_processed > 0.

Apply:

-        if results['success']:
-            extraction = results['extraction']
-            embedding = results['embedding']
+        extraction = results.get('extraction', {})
+        embedding = results.get('embedding', {})
+        total_processed = results.get('total_processed', 0)
+        if extraction or embedding:
             logger.info(f"Extraction: {len(extraction['success'])} success, {len(extraction['failed'])} failed")
             logger.info(f"Embedding: {len(embedding['success'])} success, {len(embedding['failed'])} failed")
-            logger.info(f"Total processed: {results['total_processed']} documents")
+            logger.info(f"Total processed: {total_processed} documents")
 
-            if results['total_processed'] > 0:
-                rate = results['total_processed'] / elapsed * 60
+            if total_processed > 0:
+                rate = total_processed / elapsed * 60
                 logger.info(f"Processing rate: {rate:.1f} documents/minute")
-                
-                if 'chunks_created' in embedding:
-                    avg_chunks = embedding['chunks_created'] / results['total_processed']
-                    logger.info(f"Average chunks per document: {avg_chunks:.1f}")
+                if 'chunks_created' in embedding:
+                    avg_chunks = embedding['chunks_created'] / total_processed
+                    logger.info(f"Average chunks per document: {avg_chunks:.1f}")
core/framework/extractors/docling_extractor.py (3)

306-315: Flatten structured outputs and include a stable version to match downstream expectations

DocumentProcessor._extract_content reads top-level tables/equations/images/figures and version. Here, Docling results are nested under structures and lack version, so structured content is silently dropped and extractor_version becomes unknown.

Apply this diff to surface the structures at top level while keeping structures for backward compatibility and add version='docling_v2':

 return {
-    'full_text': full_text,
-    'markdown': full_text,  # Also provide as 'markdown' key
-    'structures': structures,
-    'metadata': {
-        'extractor': 'docling_v2',
-        'num_pages': num_pages,
-        'processing_time': None  # Docling doesn't provide this
-    }
+    'full_text': full_text,
+    'markdown': full_text,  # Also provide as 'markdown' key
+    # Flattened keys for downstream compatibility
+    'tables': structures.get('tables', []),
+    'equations': structures.get('equations', []),
+    'images': structures.get('images', []),
+    'figures': structures.get('figures', []),
+    # Retain original nested payload for backward compatibility
+    'structures': structures,
+    'metadata': {
+        'extractor': 'docling_v2',
+        'num_pages': num_pages,
+        'processing_time': None  # Populated by caller
+    },
+    'version': 'docling_v2'
 }

176-193: Fix processing_time accounting when falling back after a processing error

processing_time is computed before running the fallback path, under-reporting total time. Recompute after the fallback completes:

-        except Exception as e:
-            # Compute duration before handling exception
-            duration = (datetime.now() - start_time).total_seconds()
+        except Exception as e:
             # Don't use fallback for pre-validation errors (they should fail fast)
             error_msg = str(e)
             if any(x in error_msg for x in ["empty", "Invalid PDF header", "Cannot read PDF file"]):
                 logger.error(f"Pre-validation failed for {pdf_path}: {e}")
                 raise RuntimeError(f"Pre-validation failed for {pdf_path}: {e}") from e
             
             # Use fallback for processing errors if enabled
             if self.use_fallback:
                 logger.warning(f"Extraction failed, attempting fallback: {e}")
                 result = self._extract_fallback(pdf_path)
-                # Add processing time and pdf_path to metadata
-                result['metadata']['processing_time'] = duration
+                # Add processing time and pdf_path to metadata (include fallback runtime)
+                duration = (datetime.now() - start_time).total_seconds()
+                # Ensure metadata dict exists
+                result.setdefault('metadata', {})
+                result['metadata']['processing_time'] = duration
                 result['metadata']['pdf_path'] = str(pdf_path)
                 return result

388-401: Retain top‐level pdf_path or update all downstream usages
Multiple consumers index extracted['pdf_path'] (e.g. tools/arxiv/pipelines/arxiv_pipeline.py:95, experiments/word2vec_evolution/store_in_arango.py:65, tests/test_overnight_arxiv_phased.py:128). Removing that key in extract_batch will break these; either keep the top‐level duplication for backward compatibility or refactor every reference to use metadata['pdf_path'].

core/processors/document_processor.py (2)

392-403: Honor nested structures from extractors for backward compatibility

If upstream extractors return structures (as in current Docling path), tables/equations/images are otherwise lost. Merge from structures when top-level keys are missing:

-        return ExtractionResult(
-            full_text=full_text,
-            tables=docling_result.get('tables', []),
-            equations=docling_result.get('equations', []),
-            images=docling_result.get('images', []),
-            figures=docling_result.get('figures', []),
-            metadata=docling_result.get('metadata', {}),
-            latex_source=latex_source,
-            has_latex=has_latex,
-            extraction_time=time.time() - start_time,
-            extractor_version=docling_result.get('version', 'unknown')
-        )
+        structures = docling_result.get('structures', {}) or {}
+        return ExtractionResult(
+            full_text=full_text,
+            tables=docling_result.get('tables') or structures.get('tables', []),
+            equations=docling_result.get('equations') or structures.get('equations', []),
+            images=docling_result.get('images') or structures.get('images', []),
+            figures=docling_result.get('figures') or structures.get('figures', []),
+            metadata=docling_result.get('metadata', {}),
+            latex_source=latex_source,
+            has_latex=has_latex,
+            extraction_time=time.time() - start_time,
+            extractor_version=docling_result.get('version', 'unknown')
+        )

491-505: Make embedding generation robust to runtime errors

If embed_texts raises, process_document fails. Catch and degrade to zero-vector as documented.

-            # Generate embedding for chunk
-            # JinaV4Embedder expects a list and returns array of embeddings
-            embeddings = self.embedder.embed_texts([chunk['text']], batch_size=1)
-            embedding = embeddings[0] if len(embeddings) > 0 else None
+            # Generate embedding for chunk (robust to failures)
+            try:
+                embeddings = self.embedder.embed_texts([chunk['text']], batch_size=1)
+                embedding = embeddings[0] if len(embeddings) > 0 else None
+            except Exception as e:
+                logger.warning(f"Embedding failed for chunk {i}: {type(e).__name__}: {e}")
+                embedding = None
tools/arxiv/scripts/run_pipeline_from_list.py (1)

1-300: Enable Black and fix all lint violations in tools/arxiv
Environment lacks a black command and Ruff still reports 229 violations in tools/arxiv—including blank-line whitespace (W293), trailing whitespace (W291), unused variables (F841), bare except clauses (E722), and module‐level import ordering (E402). Install Black (pip install black), ensure it’s on PATH, then re-run

ruff check tools/arxiv --fix  
black tools/arxiv

to auto‐fix and manually address any remaining errors.

tools/arxiv/pipelines/arxiv_pipeline.py (1)

666-675: Align storage with required Arango collections and validate embedding dimensionality.

Guidelines require: arxiv_papers, arxiv_chunks, arxiv_embeddings, arxiv_structures; also vectors must be 2048-d with model='jina-v4'.

  • Include arxiv_structures in the transaction’s write set and store structures in one document.
  • Assert/guard that each embedding vector has 2048 dims; fail-fast otherwise.
-                write_collections = ['arxiv_papers', 'arxiv_chunks', 'arxiv_embeddings']
-                structures = doc.get('structures', {})
-                if structures.get('equations'):
-                    write_collections.append('arxiv_equations')
-                if structures.get('tables'):
-                    write_collections.append('arxiv_tables')
-                if structures.get('images'):
-                    write_collections.append('arxiv_images')
+                write_collections = ['arxiv_papers', 'arxiv_chunks', 'arxiv_embeddings', 'arxiv_structures']
+                structures = doc.get('structures', {})
-                        txn_db.collection('arxiv_embeddings').insert({
+                        emb_vec = chunk.embedding.tolist()
+                        if len(emb_vec) != 2048:
+                            raise ValueError(f"Unexpected embedding size {len(emb_vec)} for {arxiv_id}; expected 2048")
+                        txn_db.collection('arxiv_embeddings').insert({
                             '_key': f"{chunk_key}_emb",
                             'paper_id': sanitized_id,
                             'chunk_id': chunk_key,
-                            'vector': chunk.embedding.tolist(),
+                            'vector': emb_vec,
                             'model': 'jina-v4',
                             'embedding_method': 'late_chunking',
                             'embedding_date': datetime.now().isoformat()
                         }, overwrite=True)
-                    # Store equations if present
-                    for i, eq in enumerate(structures.get('equations', [])):
-                        txn_db.collection('arxiv_equations').insert({
-                            '_key': f"{sanitized_id}_eq_{i}",
-                            'paper_id': sanitized_id,
-                            'equation_index': i,
-                            'latex': eq.get('latex', ''),
-                            'label': eq.get('label', ''),
-                            'type': eq.get('type', 'display')
-                        }, overwrite=True)
-                    
-                    # Store tables if present
-                    for i, tbl in enumerate(structures.get('tables', [])):
-                        txn_db.collection('arxiv_tables').insert({
-                            '_key': f"{sanitized_id}_table_{i}",
-                            'paper_id': sanitized_id,
-                            'table_index': i,
-                            'caption': tbl.get('caption', ''),
-                            'headers': tbl.get('headers', []),
-                            'data_rows': tbl.get('rows', [])
-                        }, overwrite=True)
-                    
-                    # Store images if present
-                    for i, img in enumerate(structures.get('images', [])):
-                        txn_db.collection('arxiv_images').insert({
-                            '_key': f"{sanitized_id}_img_{i}",
-                            'paper_id': sanitized_id,
-                            'image_index': i,
-                            'caption': img.get('caption', ''),
-                            'type': img.get('type', 'figure')
-                        }, overwrite=True)
+                    # Store all detected structures in a single document
+                    txn_db.collection('arxiv_structures').insert({
+                        '_key': sanitized_id,
+                        'paper_id': sanitized_id,
+                        'equations': structures.get('equations', []),
+                        'tables': structures.get('tables', []),
+                        'images': structures.get('images', []),
+                        'counts': {
+                            'equations': len(structures.get('equations', [])),
+                            'tables': len(structures.get('tables', [])),
+                            'images': len(structures.get('images', [])),
+                        }
+                    }, overwrite=True)

Also applies to: 740-748, 750-781

core/processors/generic_document_processor.py (2)

154-160: Make extraction task timeout configurable.

Hard-coded 120s may be too short. Read from config with a sensible default.

-        num_workers = extraction_config.get('workers', 8)
+        num_workers = extraction_config.get('workers', 8)
         gpu_devices = extraction_config.get('gpu_devices', [0, 1])
+        task_timeout = extraction_config.get('task_timeout_seconds', 600)
@@
-                        result = future.result(timeout=120)
+                        result = future.result(timeout=task_timeout)

Also applies to: 179-184


409-414: Fix ArangoDBManager import to honor package layout and avoid sys.path hacks.

Prefer absolute imports per guidelines; keep a guarded fallback if needed.

-    import sys
-    from pathlib import Path
-    sys.path.insert(0, str(Path(__file__).parent.parent.parent))
-    from tools.arxiv.pipelines.arango_db_manager import ArangoDBManager
+    try:
+        from core.database.arango_db_manager import ArangoDBManager  # preferred
+    except Exception:
+        from tools.arxiv.pipelines.arango_db_manager import ArangoDBManager  # fallback
tools/arxiv/db/config.py (1)

1-152: Install Black, update pyproject.toml, and run Ruff + Black on tools/arxiv/
Move ignore, select, and per-file-ignores into the lint section of pyproject.toml, then install Black (pip install black), and run:

ruff check tools/arxiv --fix
black tools/arxiv

Comment on lines 111 to 124
# First, ensure the GitHub graph is set up
setup_script = Path(__file__).parent.parent.parent / 'tools' / 'github' / 'setup_github_graph.py'
if setup_script.exists():
logger.info("\nEnsuring GitHub graph collections exist...")
result = subprocess.run(
[sys.executable, str(setup_script)],
capture_output=True,
text=True,
cwd=str(setup_script.parent)
)
if "created" in result.stdout.lower() or "already exists" in result.stdout.lower():
logger.info(" ✓ GitHub graph collections ready")

# Process each repository
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check setup_github_graph return code and surface failures

Currently only scanning stdout text; a failing script can be silently ignored.

-        result = subprocess.run(
+        result = subprocess.run(
             [sys.executable, str(setup_script)],
             capture_output=True,
             text=True,
             cwd=str(setup_script.parent)
         )
-        if "created" in result.stdout.lower() or "already exists" in result.stdout.lower():
-            logger.info("  ✓ GitHub graph collections ready")
+        if result.returncode == 0 and (
+            "created" in result.stdout.lower() or "already exists" in result.stdout.lower()
+        ):
+            logger.info("  ✓ GitHub graph collections ready")
+        else:
+            logger.warning(f"  ⚠ GitHub graph setup may have failed (rc={result.returncode}). Inspect output:")
+            logger.warning(result.stdout.strip()[:1000])
+            logger.warning(result.stderr.strip()[:1000])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# First, ensure the GitHub graph is set up
setup_script = Path(__file__).parent.parent.parent / 'tools' / 'github' / 'setup_github_graph.py'
if setup_script.exists():
logger.info("\nEnsuring GitHub graph collections exist...")
result = subprocess.run(
[sys.executable, str(setup_script)],
capture_output=True,
text=True,
cwd=str(setup_script.parent)
)
if "created" in result.stdout.lower() or "already exists" in result.stdout.lower():
logger.info(" ✓ GitHub graph collections ready")
# Process each repository
# First, ensure the GitHub graph is set up
setup_script = Path(__file__).parent.parent.parent / 'tools' / 'github' / 'setup_github_graph.py'
if setup_script.exists():
logger.info("\nEnsuring GitHub graph collections exist...")
result = subprocess.run(
[sys.executable, str(setup_script)],
capture_output=True,
text=True,
cwd=str(setup_script.parent)
)
if result.returncode == 0 and (
"created" in result.stdout.lower() or "already exists" in result.stdout.lower()
):
logger.info(" ✓ GitHub graph collections ready")
else:
logger.warning(f" ⚠ GitHub graph setup may have failed (rc={result.returncode}). Inspect output:")
logger.warning(result.stdout.strip()[:1000])
logger.warning(result.stderr.strip()[:1000])
# Process each repository
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/process_github_repos.py around lines 111 to
124, the wrapper currently only inspects stdout text from running
setup_github_graph.py and can silently ignore failures; update it to check
subprocess.CompletedProcess.returncode and treat non-zero return codes as
errors: after running subprocess.run, if result.returncode != 0 log an error
including result.stderr (and result.stdout for context), and raise SystemExit or
an exception (or sys.exit(1)) to stop processing; keep the existing success
check for the positive messages but only after confirming returncode == 0.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai fix submitted. please confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r3d91ll I can see that the fix has not yet been applied to the code. Looking at the current diff for experiments/word2vec_evolution/process_github_repos.py, the code still shows the original implementation on lines 111-124:

result = subprocess.run(
    [sys.executable, str(setup_script)],
    capture_output=True,
    text=True,
    cwd=str(setup_script.parent)
)
if "created" in result.stdout.lower() or "already exists" in result.stdout.lower():
    logger.info("  ✓ GitHub graph collections ready")

The suggested improvements are missing:

  • No result.returncode check
  • No error handling for non-zero return codes
  • No stderr logging on failures
  • No warning when setup fails

To apply the fix, you need to replace the current setup_github_graph section with the improved version that checks the return code and handles failures properly.

Comment on lines +16 to +23
# Fix Python path
script_dir = Path(__file__).resolve().parent
project_root = script_dir.parent.parent.parent
sys.path.insert(0, str(project_root))
sys.path.insert(0, str(script_dir.parent))

from pipelines.arxiv_pipeline import ACIDPhasedPipeline
from core.database import ArangoDBManager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix imports and satisfy Ruff E402; prefer absolute package paths

Move sys.path adjustments before any imports and use absolute imports.

-# Fix Python path
-script_dir = Path(__file__).resolve().parent
-project_root = script_dir.parent.parent.parent
-sys.path.insert(0, str(project_root))
-sys.path.insert(0, str(script_dir.parent))
-
-from pipelines.arxiv_pipeline import ACIDPhasedPipeline
-from core.database import ArangoDBManager
+# Fix Python path (before imports)
+script_dir = Path(__file__).resolve().parent
+project_root = script_dir.parent.parent.parent
+sys.path.insert(0, str(project_root))
+
+from tools.arxiv.pipelines.arxiv_pipeline import ACIDPhasedPipeline
+from core.database.arango_db_manager import ArangoDBManager

Also remove unused imports (json, datetime) to keep lint clean.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Fix Python path
script_dir = Path(__file__).resolve().parent
project_root = script_dir.parent.parent.parent
sys.path.insert(0, str(project_root))
sys.path.insert(0, str(script_dir.parent))
from pipelines.arxiv_pipeline import ACIDPhasedPipeline
from core.database import ArangoDBManager
# Fix Python path (before imports)
script_dir = Path(__file__).resolve().parent
project_root = script_dir.parent.parent.parent
sys.path.insert(0, str(project_root))
from tools.arxiv.pipelines.arxiv_pipeline import ACIDPhasedPipeline
from core.database.arango_db_manager import ArangoDBManager
🧰 Tools
🪛 Ruff (0.12.2)

22-22: Module level import not at top of file

(E402)


23-23: Module level import not at top of file

(E402)

🤖 Prompt for AI Agents
In tools/arxiv/tests/quick_pipeline_test.py around lines 16 to 23, the test
currently performs sys.path modifications after module imports which triggers
Ruff E402 and uses relative or non-absolute imports plus includes unused
imports; move the sys.path.insert(...) calls to the very top of the file before
any import statements, change imports to absolute package paths (e.g., from
pipelines.arxiv_pipeline import ACIDPhasedPipeline -> use the full package
absolute import as appropriate in your project), and remove any unused imports
like json and datetime so the file imports are ordered and lint-clean.

Comment on lines +96 to +101
print(f"\nProcessing {len(available_papers)} available papers...")

# The pipeline processes from filesystem, not from a list
# We'll use the 'local' source and let it find papers
pipeline.run(source='local', count=len(available_papers))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure the pipeline processes the exact test papers

source='local' + count=N may pick arbitrary PDFs. Populate specific_list with available_papers and use that source.

-        # The pipeline processes from filesystem, not from a list
-        # We'll use the 'local' source and let it find papers
-        pipeline.run(source='local', count=len(available_papers))
+        # Force the pipeline to process exactly our IDs
+        pipeline.config.setdefault('processing', {}).setdefault('specific_list', {})['arxiv_ids'] = available_papers
+        pipeline.run(source='specific_list', count=len(available_papers))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"\nProcessing {len(available_papers)} available papers...")
# The pipeline processes from filesystem, not from a list
# We'll use the 'local' source and let it find papers
pipeline.run(source='local', count=len(available_papers))
print(f"\nProcessing {len(available_papers)} available papers...")
# Force the pipeline to process exactly our IDs
pipeline.config.setdefault('processing', {}).setdefault('specific_list', {})['arxiv_ids'] = available_papers
pipeline.run(source='specific_list', count=len(available_papers))
🤖 Prompt for AI Agents
In tools/arxiv/tests/quick_pipeline_test.py around lines 96 to 101, the test
currently uses source='local' with count which can pick arbitrary PDFs; instead
populate the pipeline's specific_list with available_papers and call
pipeline.run with source='specific_list' (pass the specific_list argument set to
available_papers) so the pipeline processes exactly those test papers rather
than any local PDFs; remove or ignore the count parameter when using
specific_list.

# =========================================
# Runs the complete test suite for 5000+ papers

set -e # Exit on error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Propagate failures in pipelines (tee): enable pipefail

Without set -o pipefail, failures in python ... | tee won’t be detected.

-set -e  # Exit on error
+set -e  # Exit on error
+set -o pipefail  # Propagate failures through pipelines
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
set -e # Exit on error
set -e # Exit on error
set -o pipefail # Propagate failures through pipelines
🤖 Prompt for AI Agents
In tools/arxiv/tests/run_large_scale_test.sh around line 6, the script uses "set
-e" but doesn't enable pipefail so failures in pipelines like "python ... | tee"
can be ignored; update the shell options to include "set -o pipefail" (or change
to "set -euo pipefail" if you also want nounset and errexit behavior) so
pipeline failures are propagated and the script exits on any command in a
pipeline failing.

Comment on lines +89 to +91
echo "Clearing previous staging data..."
rm -rf "$STAGING_DIR"/*

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Safe rm -rf: guard against accidental “/*”

Harden deletes with parameter expansion.

-rm -rf "$STAGING_DIR"/*
+rm -rf "${STAGING_DIR:?}/"*

Repeat the same change at Line 175.

🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 90-90: Use "${var:?}" to ensure this never expands to /* .

(SC2115)

🤖 Prompt for AI Agents
In tools/arxiv/tests/run_large_scale_test.sh around lines 89-91 (and again at
line 175), the unguarded rm -rf "$STAGING_DIR"/* can evaluate to rm -rf /* if
STAGING_DIR is empty; replace it with a safe parameter expansion guard so the
command aborts when STAGING_DIR is unset/empty (e.g., use the ${VAR:?} form) and
apply the same change at line 175 to harden both deletes.

Comment on lines +137 to +144
# Run the full test suite
python test_large_scale_processing.py \
--config ../configs/large_scale_test.yaml \
--papers "$PAPER_LIST" \
--arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log"

TEST_EXIT_CODE=$?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Capture the Python exit code when piping through tee

$? captures tee’s status. Use ${PIPESTATUS[0]}.

 python test_large_scale_processing.py \
     --config ../configs/large_scale_test.yaml \
     --papers "$PAPER_LIST" \
     --arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log"
 
-TEST_EXIT_CODE=$?
+TEST_EXIT_CODE=${PIPESTATUS[0]}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Run the full test suite
python test_large_scale_processing.py \
--config ../configs/large_scale_test.yaml \
--papers "$PAPER_LIST" \
--arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log"
TEST_EXIT_CODE=$?
# Run the full test suite
python test_large_scale_processing.py \
--config ../configs/large_scale_test.yaml \
--papers "$PAPER_LIST" \
--arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log"
TEST_EXIT_CODE=${PIPESTATUS[0]}
🤖 Prompt for AI Agents
In tools/arxiv/tests/run_large_scale_test.sh around lines 137 to 144, the script
currently captures the exit code with TEST_EXIT_CODE=$? after piping the python
command through tee, which returns tee's status instead of the Python process;
replace that line with TEST_EXIT_CODE=${PIPESTATUS[0]} (or enable pipefail and
use $? accordingly) to capture the Python process exit code, and ensure the
script is executed with bash (shebang or invoked with bash) so ${PIPESTATUS[@]}
is supported.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +42 to +66
def _extract_with_pymupdf(pdf_path: str) -> Dict[str, Any]:
"""Fallback extraction using PyMuPDF."""
if not PYMUPDF_AVAILABLE:
return None

try:
with fitz.open(pdf_path) as doc:
text_parts = []
num_pages = len(doc)

for page_num, page in enumerate(doc):
page_text = page.get_text()
if page_text.strip():
text_parts.append(f"--- Page {page_num + 1} ---\n{page_text}")

full_text = "\n\n".join(text_parts)

return {
'full_text': full_text,
'text': full_text,
'markdown': full_text,
'num_pages': num_pages,
'extractor': 'pymupdf_fallback'
}
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Return a schema-compatible fallback result (include metadata/structures)

Current fallback payload lacks the 'metadata' block used elsewhere and deviates from DoclingExtractor’s shape.

Apply:

-def _extract_with_pymupdf(pdf_path: str) -> Dict[str, Any]:
+def _extract_with_pymupdf(pdf_path: str) -> dict[str, Any]:
@@
-            return {
-                'full_text': full_text,
-                'text': full_text,
-                'markdown': full_text,
-                'num_pages': num_pages,
-                'extractor': 'pymupdf_fallback'
-            }
+            return {
+                'full_text': full_text,
+                'markdown': full_text,
+                'structures': {},
+                'metadata': {
+                    'extractor': 'pymupdf_fallback',
+                    'num_pages': num_pages,
+                    'pdf_path': str(pdf_path),
+                    'processing_time': None,
+                    'warning': 'Using fallback extractor - structures not extracted'
+                }
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _extract_with_pymupdf(pdf_path: str) -> Dict[str, Any]:
"""Fallback extraction using PyMuPDF."""
if not PYMUPDF_AVAILABLE:
return None
try:
with fitz.open(pdf_path) as doc:
text_parts = []
num_pages = len(doc)
for page_num, page in enumerate(doc):
page_text = page.get_text()
if page_text.strip():
text_parts.append(f"--- Page {page_num + 1} ---\n{page_text}")
full_text = "\n\n".join(text_parts)
return {
'full_text': full_text,
'text': full_text,
'markdown': full_text,
'num_pages': num_pages,
'extractor': 'pymupdf_fallback'
}
except Exception as e:
def _extract_with_pymupdf(pdf_path: str) -> dict[str, Any]:
"""Fallback extraction using PyMuPDF."""
if not PYMUPDF_AVAILABLE:
return None
try:
with fitz.open(pdf_path) as doc:
text_parts = []
num_pages = len(doc)
for page_num, page in enumerate(doc):
page_text = page.get_text()
if page_text.strip():
text_parts.append(f"--- Page {page_num + 1} ---\n{page_text}")
full_text = "\n\n".join(text_parts)
return {
- 'full_text': full_text,
- 'text': full_text,
- 'markdown': full_text,
- 'num_pages': num_pages,
'full_text': full_text,
'markdown': full_text,
'structures': {},
'metadata': {
'extractor': 'pymupdf_fallback',
'num_pages': num_pages,
'pdf_path': str(pdf_path),
'processing_time': None,
'warning': 'Using fallback extractor - structures not extracted'
}
}
except Exception as e:
🤖 Prompt for AI Agents
In core/framework/extractors/robust_extractor.py around lines 42 to 66, the
pymupdf fallback return lacks the expected DoclingExtractor-shaped keys; update
the returned dict to include a top-level 'metadata' dict (e.g., {'source':
pdf_path, 'num_pages': num_pages, 'extractor': 'pymupdf_fallback'}) and a
'structures' list (e.g., per-page objects with page_number and text) in addition
to the existing text/full_text/markdown fields so the payload matches the rest
of the codebase's schema.

Comment on lines +11 to +19
- **Schema Migration**: Added columns for compute cost assessment
- `pdf_size_bytes`: Track actual PDF file sizes
- `processing_complexity`: Estimates processing difficulty
- `estimated_tokens`: Estimated token count for LLM processing
- `created_at`, `modified_at`: Timestamps managed by triggers
- **Migration Script**: `migrations/add_compute_assessment_columns.sql`
- Successfully applied to existing database with 2.7M papers
- Includes indexes for performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Docs reference a SQL migration not shown in this PR; verify file exists and is chained.


🏁 Script executed:

#!/bin/bash
# Verify referenced migration exists and links to this initial revision
fd -t f -a 'add_compute_assessment_columns.sql' .
fd -t f -a 'tools/arxiv/db/migrations/versions' | sort
rg -n "down_revision\s*=\s*[\"']20250901_000001_initial_schema[\"']" tools/arxiv/db/migrations/versions -S

Length of output: 659


🏁 Script executed:

#!/bin/bash
set -e

echo "Finding add_compute_assessment_columns.sql"
find . -type f -name 'add_compute_assessment_columns.sql' || true

echo "Searching codebase for migration name references"
rg -l 'add_compute_assessment_columns' -F . || true

echo "Listing migrations directory under tools/arxiv/db/migrations"
find tools/arxiv/db/migrations -type f -print || true

Length of output: 656


Add missing migration file and versioned script
The docs reference migrations/add_compute_assessment_columns.sql, but that SQL file isn’t in the repo and there’s no Alembic version under tools/arxiv/db/migrations/versions chaining from 20250901_000001_initial_schema.py. Add the SQL migration and a corresponding version script with down_revision='20250901_000001_initial_schema'.

🧰 Tools
🪛 LanguageTool

[grammar] ~12-~12: There might be a mistake here.
Context: ...size_bytes: Track actual PDF file sizes - processing_complexity: Estimates processing difficulty - e...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...lexity: Estimates processing difficulty - estimated_tokens`: Estimated token count for LLM processi...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ...Estimated token count for LLM processing - created_at, modified_at: Timestamps managed by t...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...fied_at: Timestamps managed by triggers - **Migration Script**: migrations/add_comp...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...ged by triggers - Migration Script: migrations/add_compute_assessment_columns.sql - Successfully applied to existing databas...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ...ed to existing database with 2.7M papers - Includes indexes for performance ### 2....

(QB_NEW_EN)

🤖 Prompt for AI Agents
In docs/implementation/arxiv_metadata_service_implementation.md around lines 11
to 19, the docs reference a missing migration file and Alembic version script;
add a SQL migration at migrations/add_compute_assessment_columns.sql containing
the DDL to add pdf_size_bytes, processing_complexity, estimated_tokens,
created_at, modified_at plus appropriate indexes and triggers, and create a
corresponding Alembic version script under tools/arxiv/db/migrations/versions
named with a proper timestamp that sets
down_revision='20250901_000001_initial_schema', implements upgrade() to run the
SQL (or emit equivalent SQLAlchemy migration steps) and downgrade() to revert
the changes, register the new revision in Alembic env, commit both files, and
verify by running the alembic upgrade to ensure the migration applies cleanly
against the existing DB.

Comment on lines +162 to +167
- Existing GitHub tools in `/home/todd/olympus/HADES/tools/github/`
- Tree-sitter for code parsing
- Jina v4 code adapter for embeddings
- ArangoDB for graph storage
- PostgreSQL for metadata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace local absolute path with repo-relative reference.

Hard-coding /home/todd/... breaks for all other environments.

-- Existing GitHub tools in `/home/todd/olympus/HADES/tools/github/`
+- Existing GitHub tools in `tools/github/`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Existing GitHub tools in `/home/todd/olympus/HADES/tools/github/`
- Tree-sitter for code parsing
- Jina v4 code adapter for embeddings
- ArangoDB for graph storage
- PostgreSQL for metadata
- Existing GitHub tools in `tools/github/`
- Tree-sitter for code parsing
- Jina v4 code adapter for embeddings
- ArangoDB for graph storage
- PostgreSQL for metadata
🤖 Prompt for AI Agents
In docs/issues/github_theory_practice_bridge_detection.md around lines 162 to
167, replace the hard-coded absolute path
"/home/todd/olympus/HADES/tools/github/" with a repository-relative reference
(for example "tools/github/" or "./tools/github/") or a placeholder like
"$REPO_ROOT/tools/github/" so the docs work across environments; update the list
entry accordingly and ensure any adjacent instructions reflect the new
repo-relative path convention.

Comment on lines +82 to +88
## Database Status
- **ArangoDB** (`academy_store` @ 192.168.1.69:8529)
- 2,491 total papers (including our 3)
- 42,511 total chunks
- 43,511 total embeddings
- 1 GitHub repository (word2vec only)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Redact internal host details and sensitive counts.

Publishing a private IP (192.168.1.69) and internal corpus sizes invites information leakage. Reference env-configured endpoints instead and omit exact totals (or move to an internal report).

Apply this diff:

-## Database Status
-- **ArangoDB** (`academy_store` @ 192.168.1.69:8529)
-  - 2,491 total papers (including our 3)
-  - 42,511 total chunks
-  - 43,511 total embeddings
-  - 1 GitHub repository (word2vec only)
+## Database Status (sanitized)
+- ArangoDB database and collections reachable via ARANGO_HOST/ARANGO_PASSWORD env vars.
+- Document, chunk, and embedding totals available internally (omit from public docs).
+- GitHub repository ingestion validated for at least one repository.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Database Status
- **ArangoDB** (`academy_store` @ 192.168.1.69:8529)
- 2,491 total papers (including our 3)
- 42,511 total chunks
- 43,511 total embeddings
- 1 GitHub repository (word2vec only)
## Database Status (sanitized)
- ArangoDB database and collections reachable via ARANGO_HOST/ARANGO_PASSWORD env vars.
- Document, chunk, and embedding totals available internally (omit from public docs).
- GitHub repository ingestion validated for at least one repository.
🧰 Tools
🪛 LanguageTool

[grammar] ~82-~82: There might be a mistake here.
Context: ...e analysis (partial) ## Database Status - ArangoDB (academy_store @ 192.168.1....

(QB_NEW_EN)


[grammar] ~83-~83: There might be a mistake here.
Context: ...** (academy_store @ 192.168.1.69:8529) - 2,491 total papers (including our 3) -...

(QB_NEW_EN)


[grammar] ~84-~84: There might be a mistake here.
Context: ... - 2,491 total papers (including our 3) - 42,511 total chunks - 43,511 total emb...

(QB_NEW_EN)


[grammar] ~85-~85: There might be a mistake here.
Context: ...including our 3) - 42,511 total chunks - 43,511 total embeddings - 1 GitHub rep...

(QB_NEW_EN)


[grammar] ~86-~86: There might be a mistake here.
Context: ...total chunks - 43,511 total embeddings - 1 GitHub repository (word2vec only) ## ...

(QB_NEW_EN)

🤖 Prompt for AI Agents
In experiments/word2vec_evolution/experiment_summary.md around lines 82 to 88,
the document exposes an internal IP address and exact dataset counts; replace
the hard-coded host/IP with a reference to an environment-configured endpoint
(e.g., ARO_DB_HOST or configured connection string) and remove or redact precise
totals (use vague descriptors like "≈2.5k papers", "tens of thousands of
chunks/embeddings" or move exact numbers to an internal-only report). Also
remove any repository-specific sensitive identifiers or mark them as internal.
Ensure the public summary contains no raw internal hostnames/IPs or exact
counts.

Comment on lines +55 to +61
response = requests.get(pdf_url, stream=True)
response.raise_for_status()

with open(output_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add request timeout, UA, and skip keep-alive chunks when downloading.

Prevents hangs and follows good HTTP hygiene for arXiv.

-    response = requests.get(pdf_url, stream=True)
-    response.raise_for_status()
-
-    with open(output_path, 'wb') as f:
-        for chunk in response.iter_content(chunk_size=8192):
-            f.write(chunk)
+    # Use a UA, streaming, and sensible connect/read timeouts
+    headers = {"User-Agent": "HADES-Lab/1.0 (+contact@example.com)"}  # set to your contact
+    with requests.get(pdf_url, stream=True, timeout=(5, 30), headers=headers) as response:
+        response.raise_for_status()
+        with open(output_path, 'wb') as f:
+            for chunk in response.iter_content(chunk_size=8192):
+                if chunk:  # skip keep-alive chunks
+                    f.write(chunk)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = requests.get(pdf_url, stream=True)
response.raise_for_status()
with open(output_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
# Use a UA, streaming, and sensible connect/read timeouts
headers = {"User-Agent": "HADES-Lab/1.0 (+contact@example.com)"} # set to your contact
with requests.get(pdf_url, stream=True, timeout=(5, 30), headers=headers) as response:
response.raise_for_status()
with open(output_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
if chunk: # skip keep-alive chunks
f.write(chunk)
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/import_papers.py around lines 55 to 61, the
download request lacks a timeout, a User-Agent header, and doesn't skip
keep-alive (empty) chunks; update the requests.get call to include a timeout
(e.g. timeout=10) and headers with a descriptive User-Agent (and optionally
Connection: close), and when writing response.iter_content(chunk_size=8192) skip
empty chunks by doing if not chunk: continue before f.write(chunk) so keep-alive
fragments are ignored.

Comment on lines +64 to +77
cmd = [
sys.executable,
str(pipeline_script),
"--config", str(config_path),
"--count", str(count),
"--arango-password", arango_password
]

print(f"Running command:")
print(f" {' '.join(cmd[:3])}")
print(f" {' '.join(cmd[3:5])}")
print(f" {' '.join(cmd[5:7])}")
print(f" --arango-password [HIDDEN]")
print()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Don’t pass DB password via CLI; use env var. Also fix Ruff’s F541 f-strings.

Process args are visible via ps; switch to environment passing and simplify logging prints (no placeholders → no f-strings).

Apply:

@@
-    cmd = [
+    cmd = [
         sys.executable,
         str(pipeline_script),
         "--config", str(config_path),
-        "--count", str(count),
-        "--arango-password", arango_password
+        "--count", str(count),
     ]
 
-    print(f"Running command:")
-    print(f"  {' '.join(cmd[:3])}")
-    print(f"  {' '.join(cmd[3:5])}")
-    print(f"  {' '.join(cmd[5:7])}")
-    print(f"  --arango-password [HIDDEN]")
+    print("Running command:")
+    print("  " + " ".join(cmd))
+    print("  ARANGO_PASSWORD passed via environment")
@@
-        result = subprocess.run(
+        env = os.environ.copy()
+        env["ARANGO_PASSWORD"] = arango_password
+        result = subprocess.run(
             cmd,
             cwd=str(pipeline_dir),
-            capture_output=False,  # Show output in real-time
-            text=True
+            capture_output=False,  # Show output in real-time
+            text=True,
+            env=env
         )

Also applies to: 84-89

🧰 Tools
🪛 Ruff (0.12.2)

72-72: f-string without any placeholders

Remove extraneous f prefix

(F541)


76-76: f-string without any placeholders

Remove extraneous f prefix

(F541)

Comment on lines +102 to +114
def check_database() -> Tuple[bool, List[str]]:
"""Check database connectivity."""
issues = []

try:
from core.database import ArangoDBManager

config = {
'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'),
'database': 'academy_store',
'username': 'root',
'password': os.getenv('ARANGO_PASSWORD')
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Fix import to use absolute module path and avoid fragile package re-exports

from core.database import ArangoDBManager likely fails unless re-exported in core/database/__init__.py. Import directly:

-        from core.database import ArangoDBManager
+        from core.database.arango_db_manager import ArangoDBManager
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def check_database() -> Tuple[bool, List[str]]:
"""Check database connectivity."""
issues = []
try:
from core.database import ArangoDBManager
config = {
'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'),
'database': 'academy_store',
'username': 'root',
'password': os.getenv('ARANGO_PASSWORD')
}
def check_database() -> Tuple[bool, List[str]]:
"""Check database connectivity."""
issues = []
try:
from core.database.arango_db_manager import ArangoDBManager
config = {
'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'),
'database': 'academy_store',
'username': 'root',
'password': os.getenv('ARANGO_PASSWORD')
}
🤖 Prompt for AI Agents
In tools/arxiv/tests/validate_pipeline.py around lines 102 to 114, the code uses
a re-exported import "from core.database import ArangoDBManager" which is
fragile; replace it with a direct import from the module that actually defines
ArangoDBManager (for example "from core.database.manager import ArangoDBManager"
or the correct defining module), so the test does not rely on package re-exports
and will import reliably.

Comment on lines +124 to +129
required_collections = [
'arxiv_papers',
'arxiv_chunks',
'arxiv_embeddings',
'arxiv_structures'
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Align required collections with the manager’s schema

arxiv_structures isn’t created by _ensure_collections; the actual collections are arxiv_equations, arxiv_tables, and arxiv_images. Update the check:

-        required_collections = [
-            'arxiv_papers',
-            'arxiv_chunks', 
-            'arxiv_embeddings',
-            'arxiv_structures'
-        ]
+        required_collections = [
+            'arxiv_papers',
+            'arxiv_chunks',
+            'arxiv_embeddings',
+            'arxiv_equations',
+            'arxiv_tables',
+            'arxiv_images',
+        ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
required_collections = [
'arxiv_papers',
'arxiv_chunks',
'arxiv_embeddings',
'arxiv_structures'
]
required_collections = [
'arxiv_papers',
'arxiv_chunks',
'arxiv_embeddings',
'arxiv_equations',
'arxiv_tables',
'arxiv_images',
]
🤖 Prompt for AI Agents
In tools/arxiv/tests/validate_pipeline.py around lines 124 to 129, the
required_collections list incorrectly expects 'arxiv_structures' which is not
created by _ensure_collections; replace it with the actual collections created
by the manager: 'arxiv_equations', 'arxiv_tables', and 'arxiv_images'. Update
the required_collections list to include these three names (and remove
'arxiv_structures'), so the test aligns with the manager schema and validates
the real collections created by _ensure_collections.

Comment on lines +84 to +95
# Process first 5 files as demo
demo_tasks = tasks[:5]
logger.info(f"\n Processing {len(demo_tasks)} files as demo...")

results = processor.process_documents(demo_tasks)

# Step 3: Show results
logger.info("\n3. Processing Results:")
logger.info(f" - Total processed: {results.get('total_processed', 0)}")
logger.info(f" - Extraction phase: {results.get('extraction_success', 0)} success, {results.get('extraction_failed', 0)} failed")
logger.info(f" - Embedding phase: {results.get('embedding_success', 0)} success, {results.get('embedding_failed', 0)} failed")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix result keys: use nested extraction/embedding summaries from GenericDocumentProcessor.

Current keys (extraction_success, etc.) don’t exist and always print zeros.

-    logger.info("\n3. Processing Results:")
-    logger.info(f"   - Total processed: {results.get('total_processed', 0)}")
-    logger.info(f"   - Extraction phase: {results.get('extraction_success', 0)} success, {results.get('extraction_failed', 0)} failed")
-    logger.info(f"   - Embedding phase: {results.get('embedding_success', 0)} success, {results.get('embedding_failed', 0)} failed")
+    logger.info("\n3. Processing Results:")
+    extraction = results.get('extraction', {}) or {}
+    embedding = results.get('embedding', {}) or {}
+    logger.info(f"   - Total processed: {results.get('total_processed', 0)}")
+    logger.info(f"   - Extraction phase: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed")
+    logger.info(f"   - Embedding phase: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Process first 5 files as demo
demo_tasks = tasks[:5]
logger.info(f"\n Processing {len(demo_tasks)} files as demo...")
results = processor.process_documents(demo_tasks)
# Step 3: Show results
logger.info("\n3. Processing Results:")
logger.info(f" - Total processed: {results.get('total_processed', 0)}")
logger.info(f" - Extraction phase: {results.get('extraction_success', 0)} success, {results.get('extraction_failed', 0)} failed")
logger.info(f" - Embedding phase: {results.get('embedding_success', 0)} success, {results.get('embedding_failed', 0)} failed")
# Step 3: Show results
logger.info("\n3. Processing Results:")
extraction = results.get('extraction', {}) or {}
embedding = results.get('embedding', {}) or {}
logger.info(f" - Total processed: {results.get('total_processed', 0)}")
logger.info(f" - Extraction phase: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed")
logger.info(f" - Embedding phase: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed")
🤖 Prompt for AI Agents
In tools/github/demo_github_pipeline.py around lines 84 to 95, the log uses
non-existent flat keys like extraction_success and embedding_success which
always print zeros; update the log to read the nested summaries produced by
GenericDocumentProcessor (e.g. results.get("extraction", {}).get("success", 0)
and results.get("extraction", {}).get("failed", 0) and similarly for
results.get("embedding", {}).get("success", 0) / .get("failed", 0)), keeping
fallbacks to 0 so missing keys don’t crash, and adjust the logger.info calls to
use those nested values.

Comment on lines +139 to +147
FOR paper IN arxiv_embeddings
FOR code IN github_embeddings
LET similarity = DOT(paper.embedding, code.embedding)
FILTER similarity > 0.8
RETURN {
paper: paper.title,
code: code.source_file,
similarity: similarity
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

AQL uses an undefined DOT() function; the example won’t run on vanilla ArangoDB.

Replace with an available similarity function (e.g., COSINE_SIMILARITY or COSINE_DISTANCE depending on ArangoDB version), or reference a precomputed field.

-    LET similarity = DOT(paper.embedding, code.embedding)
+    // Requires ArangoDB vector functions (3.11/3.12+) or your own UDF.
+    LET similarity = 1 - COSINE_DISTANCE(paper.embedding, code.embedding)

🌐 Web query:

Does ArangoDB AQL provide built-in cosine similarity or distance functions for vector arrays (version 3.11+)? If so, what are the function names and usage?

💡 Result:

Yes.

Built-in AQL vector functions available in ArangoDB 3.11:

  • COSINE_SIMILARITY(x, y) — returns cosine similarity between two numeric arrays (or a 2D array and a 1D array). Example: COSINE_SIMILARITY([0,1], [1,0]) → 0. [1][2]
  • L1_DISTANCE(x, y) — Manhattan (L1) distance between two numeric arrays. Example: L1_DISTANCE([-1,-1], [2,2]) → 6. [2]
  • L2_DISTANCE(x, y) — Euclidean (L2) distance between two numeric arrays. Example: L2_DISTANCE([1,1], [5,2]) → 4.1231056. [2]

Notes:

  • Vector index–accelerated functions for approximate nearest-neighbor searches (e.g., APPROX_NEAR_COSINE and APPROX_NEAR_L2) were introduced later (3.12+) and require a vector index and the experimental vector-index option; those are used in queries like: SORT APPROX_NEAR_COSINE(doc.vector, @q) DESC. [3]

Sources: ArangoDB AQL function docs and release notes. [1][2][3]


Replace undefined DOT() with built-in AQL functions
File: tools/github/README.md lines 139–147

-    LET similarity = DOT(paper.embedding, code.embedding)
+    LET similarity = COSINE_SIMILARITY(paper.embedding, code.embedding)

Use COSINE_SIMILARITY(x, y) for cosine similarity (ArangoDB 3.11+); for distances, use L1_DISTANCE or L2_DISTANCE; in 3.12+ you can also leverage APPROX_NEAR_COSINE/APPROX_NEAR_L2 for index-accelerated searches.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FOR paper IN arxiv_embeddings
FOR code IN github_embeddings
LET similarity = DOT(paper.embedding, code.embedding)
FILTER similarity > 0.8
RETURN {
paper: paper.title,
code: code.source_file,
similarity: similarity
}
FOR paper IN arxiv_embeddings
FOR code IN github_embeddings
LET similarity = COSINE_SIMILARITY(paper.embedding, code.embedding)
FILTER similarity > 0.8
RETURN {
paper: paper.title,
code: code.source_file,
similarity: similarity
}
🤖 Prompt for AI Agents
In tools/github/README.md around lines 139 to 147, the query uses an undefined
DOT() function; replace it with the built-in AQL similarity functions: compute
similarity = COSINE_SIMILARITY(paper.embedding, code.embedding) and use FILTER
similarity > 0.8; alternatively for distances use L1_DISTANCE or L2_DISTANCE,
and on ArangoDB 3.12+ consider switching to APPROX_NEAR_COSINE or APPROX_NEAR_L2
for index-accelerated searches and better performance.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Comment on lines +155 to +166
# Calculate metrics
metrics = self._calculate_metrics(tree, content)

# Extract structure
structure = self._extract_structure(tree, content, language)

return {
'symbols': symbols,
'metrics': metrics,
'structure': structure,
'language': language,
'tree_sitter_version': tree_sitter.__version__ if hasattr(tree_sitter, '__version__') else 'unknown'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Metrics use the wrong language; pass the detected language into _calculate_metrics.

Currently, _calculate_metrics infers language from tree.language (which isn’t available on Tree objects), silently falling back to "python". Complexity is miscounted for non-Python files.

Apply this diff:

-            metrics = self._calculate_metrics(tree, content)
+            metrics = self._calculate_metrics(tree, content, language)
-def _calculate_metrics(self, tree, content: str) -> Dict[str, Any]:
+def _calculate_metrics(self, tree, content: str, language: str) -> Dict[str, Any]:
@@
-        # Get language-specific control flow nodes or use a default set
-        language = getattr(tree, 'language', 'python')
-        cf_nodes = control_flow_nodes.get(language, control_flow_nodes['python'])
+        # Use the detected language passed from caller
+        cf_nodes = control_flow_nodes.get(language, control_flow_nodes['python'])

Also applies to: 687-731

🤖 Prompt for AI Agents
In core/framework/extractors/tree_sitter_extractor.py around lines 155-166 (and
the similar block around lines 687-731), _calculate_metrics is being called
without the detected language, causing it to infer language incorrectly from
tree and default to "python"; update calls to pass the detected language
variable (e.g., metrics = self._calculate_metrics(tree, content, language)) and
modify _calculate_metrics to accept a language parameter and use that value
instead of trying to read tree.language or defaulting to "python".

Comment on lines +321 to +333
name_node = node.child_by_field_name('name')
if name_node:
# Check for generator by looking for '*' token
# Rely on the node's generator attribute for JS/TS
is_generator = bool(getattr(node, 'generator', False))

symbols['functions'].append({
'name': content[name_node.start_byte:name_node.end_byte],
'line': name_node.start_point[0] + 1,
'async': self._has_child_type(node, 'async'),
'generator': is_generator
})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix JS/TS generator/async detection.

Tree-sitter nodes don’t expose a “generator” boolean. The current getattr(...) will always be False.

Apply this diff:

-                if name_node:
-                    # Check for generator by looking for '*' token
-                    # Rely on the node's generator attribute for JS/TS
-                    is_generator = bool(getattr(node, 'generator', False))
-                    
-                    symbols['functions'].append({
-                        'name': content[name_node.start_byte:name_node.end_byte],
-                        'line': name_node.start_point[0] + 1,
-                        'async': self._has_child_type(node, 'async'),
-                        'generator': is_generator
-                    })
+                if name_node:
+                    prefix = content[node.start_byte:name_node.start_byte]
+                    is_async = self._has_child_type(node, 'async') or prefix.lstrip().startswith('async')
+                    is_generator = '*' in prefix
+                    symbols['functions'].append({
+                        'name': content[name_node.start_byte:name_node.end_byte],
+                        'line': name_node.start_point[0] + 1,
+                        'async': is_async,
+                        'generator': is_generator,
+                    })
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
name_node = node.child_by_field_name('name')
if name_node:
# Check for generator by looking for '*' token
# Rely on the node's generator attribute for JS/TS
is_generator = bool(getattr(node, 'generator', False))
symbols['functions'].append({
'name': content[name_node.start_byte:name_node.end_byte],
'line': name_node.start_point[0] + 1,
'async': self._has_child_type(node, 'async'),
'generator': is_generator
})
name_node = node.child_by_field_name('name')
if name_node:
prefix = content[node.start_byte:name_node.start_byte]
is_async = self._has_child_type(node, 'async') or prefix.lstrip().startswith('async')
is_generator = '*' in prefix
symbols['functions'].append({
'name': content[name_node.start_byte:name_node.end_byte],
'line': name_node.start_point[0] + 1,
'async': is_async,
'generator': is_generator,
})
🤖 Prompt for AI Agents
In core/framework/extractors/tree_sitter_extractor.py around lines 321 to 333,
the code uses getattr(node, 'generator', False) which is invalid for tree-sitter
nodes and will always be False; replace that with a real check for the '*'
generator token by scanning the node for a child token or character '*' before
the function name (e.g., inspect node.children or the source slice between
node.start_byte and name_node.start_byte for a '*' character) and set
is_generator based on that check (keep the async detection using
self._has_child_type(node, 'async')).

Comment on lines +51 to +56
def __init__(self, config_path: str):
# Load configuration
import json
with open(config_path, 'r') as f:
self.config = json.load(f)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix example: loads YAML with json.load (incorrect).

Use yaml.safe_load to match the YAML config referenced.

-        import json
-        with open(config_path, 'r') as f:
-            self.config = json.load(f)
+        import yaml
+        with open(config_path) as f:
+            self.config = yaml.safe_load(f)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, config_path: str):
# Load configuration
import json
with open(config_path, 'r') as f:
self.config = json.load(f)
def __init__(self, config_path: str):
# Load configuration
import yaml
with open(config_path) as f:
self.config = yaml.safe_load(f)
🤖 Prompt for AI Agents
In docs/adr/0005-arxiv-pipeline-v2-architecture.md around lines 51 to 56, the
constructor currently uses json.load to parse a YAML configuration file; change
this to import and use yaml.safe_load so the YAML config is parsed correctly
(add "import yaml" and replace json.load with yaml.safe_load when reading the
file), ensuring the file is still opened in text mode and assigned to
self.config.

Comment on lines +28 to +31
# Connect to ArangoDB
client = ArangoClient(hosts='http://192.168.1.69:8529')
self.db = client.db('academy_store', username='root', password='1luv93ngu1n$')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Secret leak: hardcoded Arango root password and host

Credentials and host must not be in code. Rotate the password immediately and switch to env/config-based loading.

+import os
@@
-        client = ArangoClient(hosts='http://192.168.1.69:8529')
-        self.db = client.db('academy_store', username='root', password='1luv93ngu1n$')
+        host = os.getenv("ARANGO_HOST", "http://localhost:8529")
+        db_name = os.getenv("ARANGO_DB", "academy_store")
+        user = os.getenv("ARANGO_USER", "root")
+        password = os.getenv("ARANGO_PASSWORD")
+        if not password:
+            raise ValueError("ARANGO_PASSWORD not set")
+        client = ArangoClient(hosts=host)
+        self.db = client.db(db_name, username=user, password=password)

If this secret was committed, consider revoking it and auditing access.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Connect to ArangoDB
client = ArangoClient(hosts='http://192.168.1.69:8529')
self.db = client.db('academy_store', username='root', password='1luv93ngu1n$')
import os
# Connect to ArangoDB
host = os.getenv("ARANGO_HOST", "http://localhost:8529")
db_name = os.getenv("ARANGO_DB", "academy_store")
user = os.getenv("ARANGO_USER", "root")
password = os.getenv("ARANGO_PASSWORD")
if not password:
raise ValueError("ARANGO_PASSWORD not set")
client = ArangoClient(hosts=host)
self.db = client.db(db_name, username=user, password=password)
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/analyze_theory_practice_bridges.py around
lines 28-31, the ArangoDB host and root password are hardcoded; replace these
with configuration-driven loading (e.g., read ARANGO_HOST, ARANGO_USER,
ARANGO_PASSWORD from environment variables or a secured config store), validate
presence and fail fast with a clear error if missing, and remove the plaintext
credentials from the file; after updating, rotate/revoke the exposed credentials
and ensure the repo history and any CI secrets do not contain the old values
(use git-secret / git-filter-repo or similar if necessary).

Comment on lines +160 to +165
for i, p_emb in enumerate(paper_embeddings):
for j, c_emb in enumerate(code_embeddings):
# Cosine similarity (1 - cosine distance)
similarity = 1 - cosine(p_emb, c_emb)
similarity_matrix[i, j] = similarity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

NaN-safe cosine similarity

Guard against zero vectors producing NaNs.

-                similarity = 1 - cosine(p_emb, c_emb)
+                similarity = 1 - cosine(p_emb, c_emb)
+                if not np.isfinite(similarity):
+                    similarity = 0.0
                 similarity_matrix[i, j] = similarity
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i, p_emb in enumerate(paper_embeddings):
for j, c_emb in enumerate(code_embeddings):
# Cosine similarity (1 - cosine distance)
similarity = 1 - cosine(p_emb, c_emb)
similarity_matrix[i, j] = similarity
for i, p_emb in enumerate(paper_embeddings):
for j, c_emb in enumerate(code_embeddings):
# Cosine similarity (1 - cosine distance)
similarity = 1 - cosine(p_emb, c_emb)
if not np.isfinite(similarity):
similarity = 0.0
similarity_matrix[i, j] = similarity
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/analyze_theory_practice_bridges.py around
lines 160 to 165, the cosine calculation can produce NaN when one or both
vectors are zero; modify the loop to compute cosine similarity with an explicit
norm check: compute the L2 norms of p_emb and c_emb, if either norm is zero set
similarity to 0.0 (or another defined fallback) else compute similarity =
dot(p_emb, c_emb) / (norm_p * norm_c); assign that value to similarity_matrix[i,
j]. Ensure no NaNs are written by handling the zero-norm case before division.

Comment on lines +49 to +60
COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections"
if [ -d "$COLLECTION_DIR" ]; then
PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}')
echo "Found $PAPER_COUNT papers in existing collections"

if [ "$PAPER_COUNT" -lt 20000 ]; then
echo "Collection has less than 20,000 papers. Running extended collector..."
cd "$ARXIV_DIR/scripts"
python collect_ai_papers_extended.py
else
echo -e "${GREEN}✅ Sufficient papers available ($PAPER_COUNT)${NC}"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add validation for paper count parsing

The paper count extraction using awk could fail silently if the file format is unexpected. Consider adding error handling for the arithmetic operation.

Apply this improved version with validation:

 COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections"
 if [ -d "$COLLECTION_DIR" ]; then
     PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}')
+    if ! [[ "$PAPER_COUNT" =~ ^[0-9]+$ ]]; then
+        echo -e "${RED}ERROR: Invalid paper count detected${NC}"
+        exit 1
+    fi
     echo "Found $PAPER_COUNT papers in existing collections"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections"
if [ -d "$COLLECTION_DIR" ]; then
PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}')
echo "Found $PAPER_COUNT papers in existing collections"
if [ "$PAPER_COUNT" -lt 20000 ]; then
echo "Collection has less than 20,000 papers. Running extended collector..."
cd "$ARXIV_DIR/scripts"
python collect_ai_papers_extended.py
else
echo -e "${GREEN}✅ Sufficient papers available ($PAPER_COUNT)${NC}"
fi
COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections"
if [ -d "$COLLECTION_DIR" ]; then
PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}')
if ! [[ "$PAPER_COUNT" =~ ^[0-9]+$ ]]; then
echo -e "${RED}ERROR: Invalid paper count detected${NC}"
exit 1
fi
echo "Found $PAPER_COUNT papers in existing collections"
if [ "$PAPER_COUNT" -lt 20000 ]; then
echo "Collection has less than 20,000 papers. Running extended collector..."
cd "$ARXIV_DIR/scripts"
python collect_ai_papers_extended.py
else
echo -e "${GREEN}✅ Sufficient papers available ($PAPER_COUNT)${NC}"
fi

Comment on lines +86 to +92
if [ -d "$STAGING_DIR" ]; then
rm -rf "$STAGING_DIR"/*
echo "✅ Staging directory cleaned"
else
mkdir -p "$STAGING_DIR"
echo "✅ Staging directory created"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential dangerous expansion in staging directory cleanup

The rm -rf command on line 87 could expand to /* if $STAGING_DIR is unexpectedly empty, potentially deleting all files on the system.

Apply this fix to ensure safe directory cleanup:

 # Step 2: Clean up staging area
 echo -e "${YELLOW}Step 2: Cleanup${NC}"
 echo "Cleaning staging directory..."
 STAGING_DIR="/dev/shm/weekend_staging"
 if [ -d "$STAGING_DIR" ]; then
-    rm -rf "$STAGING_DIR"/*
+    rm -rf "${STAGING_DIR:?}"/*
     echo "✅ Staging directory cleaned"
 else
     mkdir -p "$STAGING_DIR"
     echo "✅ Staging directory created"
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [ -d "$STAGING_DIR" ]; then
rm -rf "$STAGING_DIR"/*
echo "✅ Staging directory cleaned"
else
mkdir -p "$STAGING_DIR"
echo "✅ Staging directory created"
fi
if [ -d "$STAGING_DIR" ]; then
rm -rf "${STAGING_DIR:?}"/*
echo "✅ Staging directory cleaned"
else
mkdir -p "$STAGING_DIR"
echo "✅ Staging directory created"
fi
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 87-87: Use "${var:?}" to ensure this never expands to /* .

(SC2115)

🤖 Prompt for AI Agents
In tools/arxiv/scripts/run_weekend_test.sh around lines 86 to 92, the rm -rf
"$STAGING_DIR"/* can expand to /* if $STAGING_DIR is empty; ensure $STAGING_DIR
is non-empty and existing, then replace the unsafe rm usage with a safe deletion
of only the directory contents (for example check [ -n "$STAGING_DIR" ] and [ -d
"$STAGING_DIR" ] first, then use a command that deletes only its contents such
as find "$STAGING_DIR" -mindepth 1 -delete or find "$STAGING_DIR" -mindepth 1
-maxdepth 1 -exec rm -rf -- {} +) so you never run rm on / or the root glob when
the variable is empty.

Comment on lines +108 to +111
pipeline = ACIDPhasedPipeline(
config_path=str(self.config_path),
arango_password=self.arango_password
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify that ACIDPhasedPipeline constructor accepts arango_password parameter

The code passes arango_password to ACIDPhasedPipeline constructor, but this parameter doesn't appear in the constructor signature shown in the relevant code snippets.


🏁 Script executed:

#!/bin/bash
# Check the ACIDPhasedPipeline constructor signature
ast-grep --pattern 'class ACIDPhasedPipeline:
  $$$
  def __init__(self, $_):
    $$$'

Length of output: 17365


ACIDPhasedPipeline constructor missing arango_password parameter
The test at tools/arxiv/tests/test_large_scale_processing.py:108–111 calls

ACIDPhasedPipeline(config_path=…, arango_password=…)

but the class’s __init__ in tools/arxiv/pipelines/arxiv_pipeline.py only accepts config_path (line 836). Add an arango_password parameter to its signature or remove it from the test.

🤖 Prompt for AI Agents
In tools/arxiv/pipelines/arxiv_pipeline.py around line 836, the
ACIDPhasedPipeline.__init__ only accepts config_path but the test constructs it
with arango_password; update the constructor signature to accept an optional
arango_password parameter (e.g., arango_password=None), assign it to
self.arango_password, and propagate it to any internal Arango/DB client creation
code that requires a password; if no client currently uses it, still store the
attribute to preserve test compatibility.

Comment on lines +122 to +126
results = pipeline.process_papers(batch, count=len(batch))

total_processed += results.get('successful', 0)
total_failed += results.get('failed', 0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Pipeline's process_papers method signature inconsistency

The code calls pipeline.process_papers() with various arguments including detailed_timing=True, but the pipeline class shown in relevant snippets doesn't have this method. Only the run() method is visible in the ACIDPhasedPipeline class.

Also applies to: 196-198, 365-369


🏁 Script executed:

#!/bin/bash
# Search for process_papers method in ACIDPhasedPipeline class
ast-grep --pattern 'class ACIDPhasedPipeline:
  $$$
  def process_papers($$$):
    $$$'

# Also check if there's a different pipeline class being used
rg -n "def process_papers" --type=py

Length of output: 197


🏁 Script executed:

#!/bin/bash
# Display the definition of process_papers in OnDemandProcessor
sed -n '115,135p' tools/arxiv/utils/on_demand_processor.py

Length of output: 655


🏁 Script executed:

#!/bin/bash
# Show imports and pipeline instantiation in test file
sed -n '1,80p' tools/arxiv/tests/test_large_scale_processing.py

Length of output: 2484


🏁 Script executed:

#!/bin/bash
# Locate and display the ACIDPhasedPipeline class definition and its methods
file=$(rg -l "class ACIDPhasedPipeline" --type=py)
echo "Found pipeline file: $file"
sed -n '1,200p' "$file"

Length of output: 6245


ACIDPhasedPipeline lacks process_papers—update tests or implement method
Tests in tools/arxiv/tests/test_large_scale_processing.py (lines 122–126, 196–198, 365–369) invoke pipeline.process_papers, but ACIDPhasedPipeline only defines run(). Either add a process_papers alias/wrapper in the pipeline class or change the tests to call pipeline.run() with the appropriate arguments.

Comment on lines +197 to +207
output_files = write_outputs(
out_dir=output_dir,
prefix=output.get('prefix', 'arxiv_ids'),
ids=ids,
with_pdf=with_pdf,
missing_pdf=missing_pdf,
stats=stats,
monthly=monthly,
suffix=stamp,
update_symlinks=output.get('update_symlinks', True)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Bug: write_outputs returns None; generate_list expects a dict and loses file paths.

This silently makes output_files None. Either change write_outputs to return paths or compute them here after calling it.

-            output_files = write_outputs(
-                out_dir=output_dir,
-                prefix=output.get('prefix', 'arxiv_ids'),
+            prefix = output.get('prefix', 'arxiv_ids')
+            update_symlinks = output.get('update_symlinks', True)
+            write_outputs(
+                out_dir=output_dir,
+                prefix=prefix,
                 ids=ids,
                 with_pdf=with_pdf,
                 missing_pdf=missing_pdf,
                 stats=stats,
                 monthly=monthly,
                 suffix=stamp,
-                update_symlinks=output.get('update_symlinks', True)
+                update_symlinks=update_symlinks,
             )
-            
-            logger.info("Output files written", files=output_files)
+            output_files = {
+                "master_ts": output_dir / f"{prefix}_sweep_{stamp}.txt",
+                "stats_ts": output_dir / f"{prefix}_sweep_stats_{stamp}.json",
+            }
+            if update_symlinks:
+                output_files["master"] = output_dir / f"{prefix}_sweep.txt"
+                output_files["stats"] = output_dir / f"{prefix}_stats.json"
+            if with_pdf:
+                output_files["with_pdfs_ts"] = output_dir / f"{prefix}_with_pdfs_{stamp}.txt"
+                if update_symlinks:
+                    output_files["with_pdfs"] = output_dir / f"{prefix}_with_pdfs.txt"
+            if missing_pdf:
+                output_files["missing_pdfs_ts"] = output_dir / f"{prefix}_missing_pdfs_{stamp}.txt"
+                if update_symlinks:
+                    output_files["missing_pdfs"] = output_dir / f"{prefix}_missing_pdfs.txt"
+
+            logger.info("Output files written", files={k: str(v) for k, v in output_files.items()})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
output_files = write_outputs(
out_dir=output_dir,
prefix=output.get('prefix', 'arxiv_ids'),
ids=ids,
with_pdf=with_pdf,
missing_pdf=missing_pdf,
stats=stats,
monthly=monthly,
suffix=stamp,
update_symlinks=output.get('update_symlinks', True)
)
# Determine prefix and symlink behavior up front
prefix = output.get("prefix", "arxiv_ids")
update_symlinks = output.get("update_symlinks", True)
# Actually write the outputs (returns None)
write_outputs(
out_dir=output_dir,
prefix=prefix,
ids=ids,
with_pdf=with_pdf,
missing_pdf=missing_pdf,
stats=stats,
monthly=monthly,
suffix=stamp,
update_symlinks=update_symlinks,
)
# Manually reconstruct the paths that were written
output_files = {
"master_ts": output_dir / f"{prefix}_sweep_{stamp}.txt",
"stats_ts": output_dir / f"{prefix}_sweep_stats_{stamp}.json",
}
if update_symlinks:
output_files["master"] = output_dir / f"{prefix}_sweep.txt"
output_files["stats"] = output_dir / f"{prefix}_stats.json"
if with_pdf:
output_files["with_pdfs_ts"] = output_dir / f"{prefix}_with_pdfs_{stamp}.txt"
if update_symlinks:
output_files["with_pdfs"] = output_dir / f"{prefix}_with_pdfs.txt"
if missing_pdf:
output_files["missing_pdfs_ts"] = output_dir / f"{prefix}_missing_pdfs_{stamp}.txt"
if update_symlinks:
output_files["missing_pdfs"] = output_dir / f"{prefix}_missing_pdfs.txt"
logger.info("Output files written", files={k: str(v) for k, v in output_files.items()})

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
core/processors/chunking_strategies.py (2)

86-96: Preserve newlines in _clean_text; current normalization breaks paragraph-aware strategies

Collapsing all whitespace to single spaces removes newlines, which prevents SemanticChunking from detecting paragraphs/sentences and materially changes downstream behavior. Preserve newline characters while still collapsing spaces/tabs.

Apply this diff:

 def _clean_text(self, text: str) -> str:
-        """
-        Normalize input text for chunking by collapsing all whitespace to single spaces, removing null characters, and trimming leading/trailing spaces.
-        
-        This produces a compact, single-line-safe string suitable for downstream tokenization and chunk boundary calculations.
-        """
-        # Remove excessive whitespace
-        text = re.sub(r'\s+', ' ', text)
-        # Remove null characters
-        text = text.replace('\x00', '')
-        return text.strip()
+        """
+        Normalize input text for chunking by collapsing runs of spaces/tabs, preserving newlines,
+        removing null characters, and trimming edges. This keeps paragraph/sentence boundaries intact.
+        """
+        # Normalize line endings
+        text = text.replace('\r\n', '\n').replace('\r', '\n')
+        # Collapse spaces/tabs but preserve newlines
+        text = re.sub(r'[^\S\n]+', ' ', text)
+        # Remove null characters
+        text = text.replace('\x00', '')
+        # Collapse 3+ newlines to exactly two (paragraph boundary), and strip
+        text = re.sub(r'\n{3,}', '\n\n', text).strip()
+        return text

476-484: SlidingWindow remainder drops tail < step_size and duplicates tokens; compute tail from last window end

  • tail < step_size is currently discarded, losing content.
  • Remainder starts at len(chunks) * step_size, which duplicates tokens already covered by the last window; the uncovered tail actually begins at last_start + window_size.
  • Also avoid O(n^2) joins for start_char by precomputing prefix chars.

Apply this diff:

 def create_chunks(self, text: str, **kwargs) -> List[TextChunk]:
@@
         text = self._clean_text(text)
         tokens = text.split()  # Simple tokenization
+        # Precompute cumulative char positions (O(n)) for O(1) offset lookup
+        prefix_chars = [0]
+        total = 0
+        for idx, tok in enumerate(tokens):
+            if idx > 0:
+                total += 1  # space between tokens
+            total += len(tok)
+            prefix_chars.append(total)
@@
-            start_char = len(' '.join(tokens[:i]))
+            start_char = prefix_chars[i]
@@
-        # Handle remainder if any
-        if len(tokens) > self.window_size:
-            last_index = len(chunks) * self.step_size
-            if last_index < len(tokens):
-                remaining_tokens = tokens[last_index:]
-                if len(remaining_tokens) >= self.step_size:  # Only if substantial
-                    chunk_text = ' '.join(remaining_tokens)
-                    start_char = len(' '.join(tokens[:last_index]))
-                    
-                    chunk = TextChunk(
-                        text=chunk_text,
-                        start_char=start_char,
-                        end_char=start_char + len(chunk_text),
-                        chunk_index=len(chunks),
-                        metadata={
-                            'strategy': 'sliding_window_remainder',
-                            'token_count': len(remaining_tokens)
-                        }
-                    )
-                    chunks.append(chunk)
+        # Handle uncovered tail after the last full window (if any)
+        if chunks:
+            last_start = (len(chunks) - 1) * self.step_size
+            tail_start = last_start + self.window_size
+            if tail_start < len(tokens):
+                remaining_tokens = tokens[tail_start:]
+                chunk_text = ' '.join(remaining_tokens)
+                start_char = prefix_chars[tail_start]
+                chunk = TextChunk(
+                    text=chunk_text,
+                    start_char=start_char,
+                    end_char=start_char + len(chunk_text),
+                    chunk_index=len(chunks),
+                    metadata={
+                        'strategy': 'sliding_window_remainder',
+                        'token_count': len(remaining_tokens)
+                    }
+                )
+                chunks.append(chunk)

Also applies to: 501-521

♻️ Duplicate comments (2)
experiments/word2vec_evolution/experiment_config.yaml (1)

112-112: Fix: add a trailing newline at EOF to satisfy YAMLlint/CI.

Current error: “no new line character at the end of file.”

Apply:

-  - validation_report/  # Metrics and hypothesis testing results
+  - validation_report/  # Metrics and hypothesis testing results
+
core/processors/chunking_strategies.py (1)

119-123: Tokenizer object: make convert_tokens_to_string optional; fall back to ' '.join()

Requiring convert_tokens_to_string excludes many tokenizer implementations that only expose .tokenize and .decode. Accept .tokenize and use a safe fallback for reconstruction. This mirrors the earlier guidance on the tokenizer interface.

         Parameters:
@@
-            tokenizer: Optional tokenizer that can be either:
-                - A callable function that takes a string and returns a list of tokens: Callable[[str], List[str]]
-                - A tokenizer-like object with .tokenize(text) and .convert_tokens_to_string(tokens) methods
-                - None (uses simple whitespace split)
+            tokenizer: Optional tokenizer that can be either:
+                - A callable: (str) -> list[str]
+                - A tokenizer-like object with .tokenize(text)->list[str] and optionally
+                  .convert_tokens_to_string(tokens)->str (fallback is ' '.join(tokens))
+                - None (uses simple whitespace split)
@@
-        # Validate tokenizer interface if provided
+        # Validate tokenizer interface if provided
         if tokenizer is not None:
-            if not (callable(tokenizer) or 
-                    (hasattr(tokenizer, 'tokenize') and hasattr(tokenizer, 'convert_tokens_to_string'))):
+            if not (callable(tokenizer) or hasattr(tokenizer, 'tokenize')):
                 raise TypeError(
-                    "Tokenizer must be either a callable that returns List[str], "
-                    "or an object with .tokenize() and .convert_tokens_to_string() methods"
+                    "Tokenizer must be either a callable that returns list[str], "
+                    "or an object with .tokenize(str)->list[str]. "
+                    "convert_tokens_to_string(tokens)->str is optional."
                 )
@@
-            if self.tokenizer:
-                if callable(self.tokenizer):
-                    # For callable tokenizers, join tokens with space
-                    chunk_text = ' '.join(chunk_tokens)
-                else:
-                    # For tokenizer objects, use convert_tokens_to_string method
-                    chunk_text = self.tokenizer.convert_tokens_to_string(chunk_tokens)
+            if self.tokenizer:
+                if callable(self.tokenizer):
+                    chunk_text = ' '.join(chunk_tokens)
+                else:
+                    if hasattr(self.tokenizer, 'convert_tokens_to_string'):
+                        chunk_text = self.tokenizer.convert_tokens_to_string(chunk_tokens)
+                    else:
+                        chunk_text = ' '.join(chunk_tokens)
             else:
                 chunk_text = ' '.join(chunk_tokens)

Also applies to: 131-139, 186-193

🧹 Nitpick comments (6)
experiments/word2vec_evolution/experiment_config.yaml (3)

81-101: Stage ordering may be non-deterministic if a map is parsed unordered.

If the runner relies on order, prefer a list of stages with explicit names.

Minimal illustration:

-pipeline:
-  stages:
-    1_import:
-      - Download papers from ArXiv
-      - Clone GitHub repositories
-      - Extract text content from both sources
+pipeline:
+  stages:
+    - name: import
+      tasks:
+        - Download papers from ArXiv
+        - Clone GitHub repositories
+        - Extract text content from both sources

Confirm whether your pipeline preserves YAML map insertion order; if not, adopt the list form.


42-51: Optional: centralize thresholds with YAML anchors to reduce drift.

Keeps validation text and criteria consistent.

 analysis:
   # Entropy analysis
-  entropy_window_size: 100
-  entropy_threshold_low: 0.3   # Low entropy = high clarity/bridge point
-  entropy_threshold_high: 0.7  # High entropy = conceptual complexity
+  entropy_window_size: 100
+  entropy_thresholds: &entropy_thresholds
+    low: 0.3    # Low entropy = high clarity/bridge point
+    high: 0.7   # High entropy = conceptual complexity

   # Bridge detection
   bridge_criteria:
     semantic_similarity_threshold: 0.85
     conveyance_threshold: 0.9
     where_threshold: 0.8  # Proximity in graph
+
+hypotheses:
   h1:
     name: "Theory-Practice Bridge Detection"
     description: |
       Papers and their implementation code will show low entropy (high clarity)
       at specific bridge points where theoretical concepts map directly to
       implementation patterns.
     validation: |
-      Measure entropy at points of high semantic similarity between paper
-      and code. Expect entropy < 0.3 at true bridge points.
+      Measure entropy at points of high semantic similarity between paper
+      and code. Expect entropy < *entropy_thresholds.low at true bridge points.

Also applies to: 56-62, 69-71, 75-80


32-46: Add a seed for reproducibility

  • Under analysis, add random_seed: 42 to stabilize chunking and ordering.
  • Embedding dimensions (2048) already match the jina-v4 default.
core/processors/chunking_strategies.py (3)

174-176: TokenBased start_char is O(n^2); precompute prefix chars to O(1) per chunk

Repeated joins inside the loop are quadratic. Precompute a prefix array once.

         chunks = []
         stride = self.chunk_size - self.chunk_overlap
+        # Precompute cumulative char positions for O(1) start offsets
+        prefix_chars = [0]
+        total = 0
+        for idx, tok in enumerate(tokens):
+            if idx > 0:
+                total += 1  # space between tokens
+            total += len(tok)
+            prefix_chars.append(total)
@@
-            start_char = len(' '.join(tokens[:i]))
+            start_char = prefix_chars[i]

Also applies to: 196-197


17-17: Modernize typing per Ruff UP035/UP007: use built-in generics and | unions

Adopt built-in generics and import Callable from collections.abc. Improves readability and aligns with current typing best practices.

-from typing import List, Dict, Any, Optional, Tuple, Callable, Union
+from collections.abc import Callable
+from typing import Any
@@
-        tokenizer: Optional[Union[Callable[[str], List[str]], Any]] = None
+        tokenizer: Callable[[str], list[str]] | Any | None = None

Follow-up (optional): Convert other annotations to built-ins, e.g., List[TextChunk] -> list[TextChunk], Dict[str, Any] -> dict[str, Any].

Also applies to: 111-111


85-96: Offer: add targeted tests to pin behavior

I can add tests to prevent regressions:

  • _clean_text preserves newlines and paragraph boundaries (round-trips).
  • SlidingWindow covers all tokens including tails < step_size; no gaps.
  • Semantic forced split yields strictly increasing, unique chunk_index values.

Do you want me to open a follow-up PR with these tests?

Also applies to: 464-524, 368-399

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8736363 and 80596db.

📒 Files selected for processing (2)
  • core/processors/chunking_strategies.py (12 hunks)
  • experiments/word2vec_evolution/experiment_config.yaml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Prefer absolute imports following the shown module structure (e.g., core.framework., tools.arxiv.pipelines.)

Files:

  • core/processors/chunking_strategies.py
🧬 Code graph analysis (1)
core/processors/chunking_strategies.py (1)
experiments/word2vec_evolution/generate_embeddings.py (1)
  • chunk_text (30-57)
🪛 YAMLlint (1.37.1)
experiments/word2vec_evolution/experiment_config.yaml

[error] 112-112: no new line character at the end of file

(new-line-at-end-of-file)

🪛 Ruff (0.12.2)
core/processors/chunking_strategies.py

17-17: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)


17-17: typing.List is deprecated, use list instead

(UP035)


17-17: typing.Dict is deprecated, use dict instead

(UP035)


17-17: typing.Tuple is deprecated, use tuple instead

(UP035)


111-111: Use X | Y for type annotations

Convert to X | Y

(UP007)

🔇 Additional comments (3)
experiments/word2vec_evolution/experiment_config.yaml (2)

1-112: Well-structured experiment config.

Clear sections, sensible defaults, and readable comments. Nice.


13-31: Sanity check passed: arXiv IDs and GitHub repo URLs resolve successfully.

core/processors/chunking_strategies.py (1)

41-46: Docstring and API clarifications: LGTM

Clearer parameter semantics, return types, and behavior descriptions improve the API surface without changing runtime behavior.

Also applies to: 51-57, 72-82, 114-127, 147-159, 241-248, 256-269, 344-348, 355-362, 402-417, 449-457, 466-475, 540-554

Comment on lines +368 to +379
"""
Force-split an oversized text into consecutive TextChunk objects of at most `self.max_chunk_size` tokens.
The text is split on whitespace into words and rejoined into chunks of up to `max_chunk_size` words. `start_char` is used as the initial character offset for the first chunk; subsequent chunks' `start_char`/`end_char` are computed by advancing the offset by the length of the produced chunk text plus a single separating space. Character positions are therefore approximate and do not preserve original spacing or punctuation.
Parameters:
text (str): The input text to split.
start_char (int): Character index in the original document corresponding to the start of `text`.
Returns:
List[TextChunk]: A list of TextChunk objects with metadata containing `strategy='semantic_forced_split'` and `token_count`. Chunk indices are zero-based and reflect the order of the produced chunks.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

chunk_index resets inside _force_split_text; leads to duplicate/incorrect indices

Chunks produced here start from 0 regardless of prior chunks, causing duplicate indices in the final list. Provide a base_index to offset numbering.

-def _force_split_text(self, text: str, start_char: int) -> List[TextChunk]:
+def _force_split_text(self, text: str, start_char: int, base_index: int = 0) -> List[TextChunk]:
@@
-        for i in range(0, len(words), self.max_chunk_size):
+        for i in range(0, len(words), self.max_chunk_size):
             chunk_words = words[i:i + self.max_chunk_size]
             chunk_text = ' '.join(chunk_words)
@@
-            chunks.append(TextChunk(
+            chunks.append(TextChunk(
                 text=chunk_text,
                 start_char=start_char,
                 end_char=start_char + len(chunk_text),
-                chunk_index=len(chunks),
+                chunk_index=base_index + len(chunks),
                 metadata={
                     'strategy': 'semantic_forced_split',
                     'token_count': len(chunk_words)
                 }
             ))

Also update the call site to pass the current offset (in create_chunks):

-                    chunks.extend(self._force_split_text(para, current_chunk_start))
+                    chunks.extend(self._force_split_text(para, current_chunk_start, base_index=len(chunks)))

Also applies to: 383-399

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
experiments/word2vec_evolution/process_github_repos.py (1)

100-127: Good: now checking returncode; add a timeout to avoid hangs

You implemented the earlier suggestion to check return codes. Add a reasonable timeout so setup can’t hang the run.

-        result = subprocess.run(
+        result = subprocess.run(
             [sys.executable, str(setup_script)],
             capture_output=True,
             text=True,
-            cwd=str(setup_script.parent)
+            cwd=str(setup_script.parent),
+            timeout=120
         )
🧹 Nitpick comments (4)
experiments/word2vec_evolution/process_github_repos.py (4)

22-27: Fix E402 import placement and avoid sys.path hacks

Move the GitHubPipelineManager import into main() (after sys.path manipulation) and loosen the type in process_repository to avoid top-level import ordering issues. This removes the Ruff E402 and aligns with absolute-import guidance without polluting module scope.

Apply:

@@
-# Add parent to path for imports
-sys.path.insert(0, str(Path(__file__).parent.parent.parent))
-
-# Import GitHub pipeline manager for direct use
-from tools.github.github_pipeline_manager import GitHubPipelineManager
+# Add project root to path for imports
+sys.path.insert(0, str(Path(__file__).parent.parent.parent))
@@
-def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager):
+from typing import Any
+
+def process_repository(repo_url: str, name: str, manager: Any):

And inside main(), just before constructing manager:

@@
-    # Create manager instance
+    # Create manager instance
+    from tools.github.github_pipeline_manager import GitHubPipelineManager  # local to satisfy E402

128-145: Guard against missing keys in config to avoid KeyError

If github_simple.yaml lacks expected sections, access like config['arango'] will KeyError. Add get() checks with clear errors.

-    with open(config_path) as f:
-        config = yaml.safe_load(f)
+    with open(config_path) as f:
+        config = yaml.safe_load(f) or {}
@@
-    config['arango']['password'] = arango_password
+    if 'arango' not in config:
+        logger.error("Config missing 'arango' section")
+        sys.exit(1)
+    config['arango']['password'] = arango_password
+    if 'processing' not in config or 'github' not in config['processing']:
+        logger.error("Config missing 'processing.github' section")
+        sys.exit(1)

159-172: Persist richer per-repo metrics in results.json

Capture totals when available. Helps downstream analysis.

-        success = process_repository(repo_info['repo'], repo_info['name'], manager)
+        success = process_repository(repo_info['repo'], repo_info['name'], manager)
+        # Optionally, pull lightweight stats from manager cache/DB in future
         results.append({
             'repo': repo_info['repo'],
             'name': repo_info['name'],
             'paper': repo_info['paper'],
             'success': success,
             'timestamp': datetime.now().isoformat()
         })

201-209: Use sys.exit and add trailing newline

Small polish: prefer sys.exit for scripts; also fix missing newline at EOF (Ruff W292).

-if __name__ == "__main__":
-    exit(main())
+if __name__ == "__main__":
+    sys.exit(main())
+
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 80596db and 2af1332.

📒 Files selected for processing (1)
  • experiments/word2vec_evolution/process_github_repos.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Prefer absolute imports following the shown module structure (e.g., core.framework., tools.arxiv.pipelines.)

Files:

  • experiments/word2vec_evolution/process_github_repos.py
🧬 Code graph analysis (1)
experiments/word2vec_evolution/process_github_repos.py (1)
tools/github/github_pipeline_manager.py (3)
  • GitHubPipelineManager (23-316)
  • process_repository (53-87)
  • main (319-366)
🪛 Ruff (0.12.2)
experiments/word2vec_evolution/process_github_repos.py

26-26: Module level import not at top of file

(E402)


209-209: No newline at end of file

Add trailing newline

(W292)

🔇 Additional comments (1)
experiments/word2vec_evolution/process_github_repos.py (1)

68-89: Repo URL handling already supported: GitHubDocumentManager._parse_repo_url explicitly accepts the shorthand "owner/repo", and process_repository passes that format end-to-end, so no changes are needed.

Comment on lines +28 to +36
def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager):
"""
Process a single GitHub repository using the GitHub pipeline.
Args:
repo_url: GitHub URL (owner/repo format)
name: Friendly name for logging
manager: GitHubPipelineManager instance
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Normalize repo URL format to prevent processing failures

GitHubPipelineManager examples expect a full https URL, but the docstring and callers pass owner/repo. Normalize to support both to avoid parse failures at runtime.

 def process_repository(repo_url: str, name: str, manager: Any):
@@
-    logger.info(f"Processing {name} ({repo_url})...")
+    # Accept 'owner/repo' or full URL and normalize
+    repo_url_norm = repo_url if repo_url.startswith("http") else f"https://github.com/{repo_url}"
+    logger.info(f"Processing {name} ({repo_url_norm})...")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager):
"""
Process a single GitHub repository using the GitHub pipeline.
Args:
repo_url: GitHub URL (owner/repo format)
name: Friendly name for logging
manager: GitHubPipelineManager instance
"""
def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager):
"""
Process a single GitHub repository using the GitHub pipeline.
Args:
repo_url: GitHub URL (owner/repo format)
name: Friendly name for logging
manager: GitHubPipelineManager instance
"""
# Accept 'owner/repo' or full URL and normalize
repo_url_norm = repo_url if repo_url.startswith("http") else f"https://github.com/{repo_url}"
logger.info(f"Processing {name} ({repo_url_norm})...")
# …rest of function unchanged…

Comment on lines +39 to +61
try:
# Process the repository directly using the manager
results = manager.process_repository(repo_url)

# Check results
if results and 'stored' in results:
stored_count = results.get('stored', 0)
logger.info(f" ✓ {name} processed successfully")
logger.info(f" Stored {stored_count} embeddings")

# Log additional stats if available
if 'repository' in results:
repo_info = results['repository']
logger.info(f" Repository: {repo_info.get('full_name', repo_url)}")
if 'stats' in repo_info:
stats = repo_info['stats']
logger.info(f" Files: {stats.get('file_count', 'N/A')}")
logger.info(f" Chunks: {stats.get('chunk_count', 'N/A')}")

return True
else:
logger.error(f" ✗ {name} processing returned no results")
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t gate success on a non-guaranteed 'stored' key; log robust metrics instead

GitHubPipelineManager.main() prints total_processed/total_functions; 'stored' may not exist, leading to false negatives. Consider success when process_repository returns without exception; log known fields when present.

-        # Process the repository directly using the manager
-        results = manager.process_repository(repo_url)
+        # Process the repository directly using the manager
+        results = manager.process_repository(repo_url_norm)
@@
-        # Check results
-        if results and 'stored' in results:
-            stored_count = results.get('stored', 0)
-            logger.info(f"  ✓ {name} processed successfully")
-            logger.info(f"    Stored {stored_count} embeddings")
+        # Consider success if a result dict is returned; log available metrics
+        if isinstance(results, dict):
+            file_count = results.get('total_processed')
+            func_count = results.get('total_functions')
+            emb_count = results.get('embeddings_stored', results.get('stored'))
+            logger.info(f"  ✓ {name} processed successfully")
+            if file_count is not None or func_count is not None or emb_count is not None:
+                logger.info(
+                    "    " +
+                    ", ".join(
+                        s for s in [
+                            f"Files={file_count}" if file_count is not None else None,
+                            f"Functions={func_count}" if func_count is not None else None,
+                            f"Embeddings={emb_count}" if emb_count is not None else None,
+                        ] if s
+                    )
+                )
@@
-            logger.error(f"  ✗ {name} processing returned no results")
+            logger.error(f"  ✗ {name} processing returned no results")
             return False

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In experiments/word2vec_evolution/process_github_repos.py around lines 39 to 61,
the current logic treats the absence of a non-guaranteed 'stored' key as a
failure; change it to consider the repository processed successfully if
process_repository returns without raising an exception (i.e., results is not
None/False), and only use 'stored' (and other optional fields like 'repository',
'stats', 'file_count', 'chunk_count') for logging when present; specifically,
replace the strict "if results and 'stored' in results" gate with a check for a
truthy results object, log stored_count if available (fallback to 'N/A' or 0),
log repository and stats when present, and return True for successful
non-exceptional returns while still returning False when results is None/False.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
experiments/word2vec_evolution/experiment_config.yaml (1)

113-113: Fix trailing spaces and add newline at EOF to satisfy yamllint.

Line 113 has stray spaces and the file lacks a final newline, which will keep yamllint failing.

Apply:

-  
+
🧹 Nitpick comments (1)
experiments/word2vec_evolution/experiment_config.yaml (1)

47-51: Clarify naming: consider proximity_threshold over where_threshold.

Improves readability; the intent is proximity in the graph.

-    where_threshold: 0.8  # Proximity in graph
+    proximity_threshold: 0.8  # Proximity in graph
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2af1332 and 4d99984.

📒 Files selected for processing (1)
  • experiments/word2vec_evolution/experiment_config.yaml (1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
experiments/word2vec_evolution/experiment_config.yaml

[error] 113-113: no new line character at the end of file

(new-line-at-end-of-file)


[error] 113-113: trailing spaces

(trailing-spaces)

🔇 Additional comments (2)
experiments/word2vec_evolution/experiment_config.yaml (2)

81-107: Verify loader supports list-of-strings under pipeline.stages
The experiment runner uses a plain yaml.safe_load (no Pydantic or schema validation) to read pipeline.stages; confirm the code that iterates over config['pipeline']['stages'] treats each stage value as a List[str] of step names rather than expecting objects with id/params.


34-36: Embedding model/dimension consistency confirmed. The “jina-v4” model outputs 2048-dimensional embeddings per the Jina Embeddings API, and all code paths (defaults in DocumentProcessor, tests, and pipelines) consistently use embedding_model “jina-v4” with embedding_dimensions 2048.

Comment on lines +14 to +31
word2vec:
arxiv_id: "1301.3781"
title: "Efficient Estimation of Word Representations in Vector Space"
year: 2013
github_repo: "https://github.com/dav/word2vec"

doc2vec:
arxiv_id: "1405.4053"
title: "Distributed Representations of Sentences and Documents"
year: 2014
github_repo: "https://github.com/bnosac/doc2vec"

code2vec:
arxiv_id: "1803.09473"
title: "code2vec: Learning Distributed Representations of Code"
year: 2018
github_repo: "https://github.com/tech-srl/code2vec"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Sanity-check paper → repo mappings.

Make sure these GitHub repos are the intended canonical implementations for the cited papers to avoid skewed bridge analyses.


🌐 Web query:

Are these canonical GitHub repos for the respective papers?
- 1301.3781 word2vec → github.com/dav/word2vec
- 1405.4053 doc2vec → github.com/bnosac/doc2vec
- 1803.09473 code2vec → github.com/tech-srl/code2vec

💡 Result:

Short answers:

  • 1301.3781 (word2vec): The canonical/original code is the Google/ Tomas Mikolov export (tmikolov/word2vec / code.google.com archive). dav/word2vec is a community mirror/clone, not the original canonical repo. [1][2]

  • 1405.4053 (doc2vec / Paragraph Vector): bnosac/doc2vec is a third‑party R package/wrapper (not the original paper authors' canonical code). The paper (Le & Mikolov) does not have a single official GitHub repo from the authors; common implementations live in Gensim and other third‑party repos (and some C/C++/wrappers such as hiyijian/doc2vec). [3][4]

  • 1803.09473 (code2vec): github.com/tech-srl/code2vec is the authors’ official repository referenced by the paper (official). [5]

Sources:
[1] tmikolov/word2vec (export of original Google code).
[2] dav/word2vec (mirror; README notes “Original from https://code.google.com/p/word2vec/”).
[3] ArXiv: “Distributed Representations of Sentences and Documents” (Le & Mikolov).
[4] bnosac/doc2vec (R wrapper / third‑party implementation; points to other impls).
[5] ArXiv + tech-srl/code2vec (paper links repo).


Use canonical implementations for word2vec and doc2vec

🤖 Prompt for AI Agents
experiments/word2vec_evolution/experiment_config.yaml lines 14-31: the YAML
points to non-canonical repos for word2vec and doc2vec; update word2vec
github_repo to the official upstream https://github.com/tmikolov/word2vec and
replace doc2vec github_repo with the canonical Gensim implementation
https://github.com/RaRe-Technologies/gensim (leave code2vec mapping as-is).

rd91ll and others added 2 commits September 3, 2025 02:27
- Fix lexicographic YYMM filtering bug with proper century handling in overnight test scripts
- Update ArangoDBManager import paths to use core.database module
- Remove incorrect await from synchronous insert_document calls
- Fix tokenizer interface to support both callable and object-based tokenizers
- Fix validate_arxiv_id bug that was stripping all digits
- Clean up YAML formatting in experiment config

These changes resolve critical bugs identified in the code review.
fix: Address CodeRabbit review comments from PR #17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
tests/test_overnight_arxiv_2000.py (1)

418-425: Report the actual number of papers processed.

set this to len(papers), not the requested num_docs.

-        'total_papers_processed': num_docs,
+        'total_papers_processed': len(papers),
core/processors/chunking_strategies.py (2)

85-95: Fix: _clean_text collapses newlines, breaking paragraph/sentence-aware strategies

Collapsing all whitespace to a single space erases newlines, so SemanticChunking can no longer detect paragraphs/sentences after calling _clean_text. Preserve newlines while normalizing other whitespace.

 def _clean_text(self, text: str) -> str:
-        # Remove excessive whitespace
-        text = re.sub(r'\s+', ' ', text)
-        # Remove null characters
-        text = text.replace('\x00', '')
-        return text.strip()
+        # Normalize line endings and whitespace while preserving paragraph breaks
+        text = text.replace('\r\n', '\n').replace('\r', '\n')
+        # Collapse inline whitespace but keep newlines
+        text = re.sub(r'[ \t\f\v]+', ' ', text)
+        # Collapse 3+ consecutive newlines to 2 to represent paragraph breaks
+        text = re.sub(r'\n{3,}', '\n\n', text)
+        # Remove null characters
+        text = text.replace('\x00', '')
+        return text.strip()

503-523: SlidingWindow remainder overlaps last window; compute correct remainder start

Remainder should begin at last_start + window_size, not len(chunks) * step_size.

-        # Handle remainder if any
-        if len(tokens) > self.window_size:
-            last_index = len(chunks) * self.step_size
-            if last_index < len(tokens):
-                remaining_tokens = tokens[last_index:]
-                if len(remaining_tokens) >= self.step_size:  # Only if substantial
-                    chunk_text = ' '.join(remaining_tokens)
-                    start_char = len(' '.join(tokens[:last_index]))
+        # Handle remainder: tokens after the last full window
+        if len(tokens) > self.window_size and chunks:
+            last_start = (len(chunks) - 1) * self.step_size
+            remainder_start = last_start + self.window_size
+            if remainder_start < len(tokens):
+                remaining_tokens = tokens[remainder_start:]
+                if len(remaining_tokens) >= self.step_size:  # Only if substantial
+                    chunk_text = ' '.join(remaining_tokens)
+                    start_char = len(' '.join(tokens[:remainder_start]))
core/processors/generic_document_processor.py (2)

341-357: Handle extractor returning no content (None/empty) to avoid downstream NoneType errors

WORKER_DOCLING.extract may return None; _embed_and_store expects a dict. Short-circuit with a failure result instead of staging an invalid document.

         # Extract PDF
         extracted = WORKER_DOCLING.extract(task.pdf_path)
+        if not extracted:
+            return {
+                'success': False,
+                'document_id': task.document_id,
+                'error': 'Extractor returned no content'
+            }

429-433: Ensure overridden collections exist (e.g., github_*), or inserts will fail

ArangoDBManager.ensure_collections creates only arxiv* by default. When overriding collections, create them up front.

     # Initialize DB manager with collections
     WORKER_DB_MANAGER = ArangoDBManager(arango_config)
     WORKER_DB_MANAGER.collections = collections  # Override with source-specific collections
+    # Ensure overridden collections exist
+    for coll in collections.values():
+        if not WORKER_DB_MANAGER.db.has_collection(coll):
+            WORKER_DB_MANAGER.db.create_collection(coll)
tools/arxiv/arxiv_manager.py (2)

442-455: Split chunk storage: separate arxiv_chunks from arxiv_embeddings and handle missing embeddings.

Aligns with the intended schema; also avoids failures when embeddings are absent.

-            # Store chunks with embeddings
-            for chunk in result.chunks:
-                chunk_doc = {
-                    '_key': f"{paper_info.sanitized_id}_chunk_{chunk.chunk_index}",
-                    'paper_id': paper_info.sanitized_id,
-                    'arxiv_id': paper_info.arxiv_id,
-                    'chunk_index': chunk.chunk_index,
-                    'text': chunk.text,
-                    'embedding': chunk.embedding.tolist(),
-                    'start_char': chunk.start_char,
-                    'end_char': chunk.end_char,
-                    'context_window_used': chunk.context_window_used
-                }
-                self.db_manager.insert_document('arxiv_embeddings', chunk_doc)
+            # Store chunks and embeddings (separate collections)
+            for chunk in result.chunks:
+                chunk_key = f"{paper_info.sanitized_id}_chunk_{chunk.chunk_index}"
+                chunk_only_doc = {
+                    '_key': chunk_key,
+                    'paper_id': paper_info.sanitized_id,
+                    'arxiv_id': paper_info.arxiv_id,
+                    'chunk_index': chunk.chunk_index,
+                    'text': chunk.text,
+                    'start_char': chunk.start_char,
+                    'end_char': chunk.end_char,
+                    'context_window_used': getattr(chunk, 'context_window_used', None),
+                }
+                self.db_manager.insert_document('arxiv_chunks', chunk_only_doc)
+
+                embedding_doc = {
+                    '_key': f"{chunk_key}_emb",
+                    'paper_id': paper_info.sanitized_id,
+                    'arxiv_id': paper_info.arxiv_id,
+                    'chunk_id': chunk_key,
+                    'chunk_index': chunk.chunk_index,
+                    'embedding': (chunk.embedding.tolist() if getattr(chunk, 'embedding', None) is not None else None),
+                }
+                self.db_manager.insert_document('arxiv_embeddings', embedding_doc)

1-490: Fix linting and formatting issues in tools/arxiv/arxiv_manager.py
Ruff reports 107 issues (72 auto-fixable) across these categories:

  • Import organization: unsorted imports and unused imports (os, asyncio, typing.Union, dataclasses.asdict) (I001, F401)
  • Type annotations: switch from deprecated typing.Dict/List/Tuple to built-ins (dict, list, tuple) and use X | None instead of Optional[X] (UP035, UP006, UP045)
  • Whitespace cleanup: remove blank lines containing spaces (W293) and add trailing newline at EOF (W292)
  • Dead code: remove unused locals (base_id, category) (F841)
  • Miscellaneous: remove unnecessary open mode argument (UP015) and replace datetime.now(timezone.utc) with datetime.now(datetime.UTC) alias (UP017)

Apply ruff check --fix (add --unsafe-fixes for type-annotation changes) and Black formatting to resolve these issues.

♻️ Duplicate comments (10)
experiments/word2vec_evolution/experiment_config.yaml (2)

14-31: Use canonical repos for word2vec/doc2vec to avoid skewed mappings.

Swap to the upstream export for word2vec and a canonical implementation for doc2vec (Gensim). Code2vec entry is fine as-is.

   word2vec:
     arxiv_id: "1301.3781"
     title: "Efficient Estimation of Word Representations in Vector Space"
     year: 2013
-    github_repo: "https://github.com/dav/word2vec"
+    github_repo: "https://github.com/tmikolov/word2vec"

   doc2vec:
     arxiv_id: "1405.4053"
     title: "Distributed Representations of Sentences and Documents"
     year: 2014
-    github_repo: "https://github.com/bnosac/doc2vec"
+    github_repo: "https://github.com/RaRe-Technologies/gensim"

112-112: Fix YAML lint error: add a trailing newline at EOF.

CI is failing with "no new line character at the end of file". Add a single newline after Line 112.

   - validation_report/  # Metrics and hypothesis testing results
+
core/processors/chunking_strategies.py (1)

316-316: Fix duplicate chunk_index when force-splitting large paragraphs

_force_split_text starts indices at 0, causing duplicates when its chunks are extended into the main list. Offset indices by the current length of the outer list.

-                    chunks.extend(self._force_split_text(para, current_chunk_start))
+                    chunks.extend(self._force_split_text(para, current_chunk_start, base_index=len(chunks)))
-    def _force_split_text(self, text: str, start_char: int) -> List[TextChunk]:
+    def _force_split_text(self, text: str, start_char: int, base_index: int = 0) -> List[TextChunk]:
@@
-            chunks.append(TextChunk(
+            chunks.append(TextChunk(
                 text=chunk_text,
                 start_char=start_char,
                 end_char=start_char + len(chunk_text),
-                chunk_index=len(chunks),
+                chunk_index=base_index + len(chunks),
                 metadata={
                     'strategy': 'semantic_forced_split',
                     'token_count': len(chunk_words)
                 }
             ))

Also applies to: 369-401

tests/test_c_files.py (2)

133-145: Harden AQL against missing fields

Avoid errors when symbols/metrics are absent.

     query = """
     FOR doc IN github_papers
         FILTER doc.language == "c"
         LIMIT 5
         RETURN {
             id: doc.document_id,
             language: doc.language,
-            functions: LENGTH(doc.symbols.functions || []),
-            structs: LENGTH(doc.symbols.structs || []),
-            includes: LENGTH(doc.symbols.includes || []),
-            complexity: doc.code_metrics.complexity
+            functions: LENGTH(COALESCE(doc.symbols.functions, [])),
+            structs: LENGTH(COALESCE(doc.symbols.structs, [])),
+            includes: LENGTH(COALESCE(doc.symbols.includes, [])),
+            complexity: COALESCE(doc.code_metrics.complexity, 0)
         }
     """

124-126: Fix: incorrect result keys when logging processor summary

Use nested extraction/embedding success lists from GenericDocumentProcessor.

-    logger.info(f"Processed: {results.get('total_processed', 0)}")
-    logger.info(f"Extraction success: {results.get('extraction_success', 0)}")
-    logger.info(f"Embedding success: {results.get('embedding_success', 0)}")
+    logger.info(f"Processed: {results.get('total_processed', 0)}")
+    extraction_ok = len(results.get('extraction', {}).get('success', []))
+    embedding_ok = len(results.get('embedding', {}).get('success', []))
+    logger.info(f"Extraction success: {extraction_ok}")
+    logger.info(f"Embedding success: {embedding_ok}")
tests/test_github_treesitter.py (3)

108-115: Use localhost for ARANGO_HOST default, not a private IP.

Avoid environment-specific private IPs in test defaults.

-            'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'),
+            'host': os.getenv('ARANGO_HOST', 'http://localhost:8529'),

160-194: Guard ArangoDB query and handle connection errors.

Skip DB verification when host/password are not set and catch connectivity errors to keep tests resilient.

-    # Check if Tree-sitter metadata was stored
-    if results.get('embedding_success', 0) > 0:
-        logger.info("\nChecking stored metadata in ArangoDB...")
-        from core.database.arango_db_manager import ArangoDBManager
-        
-        db_manager = ArangoDBManager(config['arango'])
-        
-        # Query for documents with Tree-sitter data
-        query = """
+    # Check if Tree-sitter metadata was stored (only when DB access is configured)
+    extraction = results.get('extraction', {})
+    embedding = results.get('embedding', {})
+    can_query_db = bool(config['arango'].get('host') and config['arango'].get('password'))
+    if can_query_db and len(embedding.get('success', [])) > 0:
+        logger.info("\nChecking stored metadata in ArangoDB...")
+        from core.database.arango_db_manager import ArangoDBManager
+        try:
+            db_manager = ArangoDBManager(config['arango'])
+            # Query for documents with Tree-sitter data
+            query = """
         FOR doc IN github_papers
             FILTER doc.has_tree_sitter == true
             LIMIT 5
             RETURN {
                 id: doc.document_id,
                 language: doc.language,
                 has_symbols: doc.has_tree_sitter,
                 function_count: LENGTH(doc.symbols.functions),
                 class_count: LENGTH(doc.symbols.classes),
                 complexity: doc.code_metrics.complexity
             }
         """
-        
-        cursor = db_manager.db.aql.execute(query)
-        docs_with_symbols = list(cursor)
-        
-        if docs_with_symbols:
-            logger.info(f"Found {len(docs_with_symbols)} documents with Tree-sitter symbols:")
-            for doc in docs_with_symbols:
-                logger.info(f"  - {doc['id']}: {doc['language']} "
-                          f"({doc['function_count']} functions, "
-                          f"{doc['class_count']} classes, "
-                          f"complexity: {doc['complexity']})")
-        else:
-            logger.info("No documents with Tree-sitter symbols found in database")
+            cursor = db_manager.db.aql.execute(query)
+            docs_with_symbols = list(cursor)
+            if docs_with_symbols:
+                logger.info(f"Found {len(docs_with_symbols)} documents with Tree-sitter symbols:")
+                for doc in docs_with_symbols:
+                    logger.info(f"  - {doc['id']}: {doc['language']} "
+                                f"({doc['function_count']} functions, "
+                                f"{doc['class_count']} classes, "
+                                f"complexity: {doc['complexity']})")
+            else:
+                logger.info("No documents with Tree-sitter symbols found in database")
+        except Exception as e:
+            logger.warning(f"Skipping DB check due to connection error: {e}")
+    else:
+        logger.info("Skipping ArangoDB verification (ARANGO_HOST/ARANGO_PASSWORD not set or no embeddings).")

56-66: Fix: metadata['type'] check skips all files; process tasks directly

GitHubDocumentManager tasks don’t set metadata['type']; this guard prevents any processing.

-    # Process first 3 code files
+    # Process up to 3 files
     code_count = 0
-    for task in tasks[:10]:
-        if task.metadata.get('type') == 'code':
+    for task in tasks[:10]:
         logger.info(f"\n{'='*60}")
         logger.info(f"Processing: {task.document_id}")
         logger.info(f"File: {task.pdf_path}")
         
         # Extract with Tree-sitter
         result = code_extractor.extract(task.pdf_path)
         
-        if result and result.get('symbols'):
+        if result and result.get('symbols'):
             logger.info(f"Language: {result.get('language')}")
             
             # Display symbol counts
             symbols = result['symbols']
             logger.info("Symbol Table:")
             for category, items in symbols.items():
                 if items:
                     logger.info(f"  {category}: {len(items)} items")
                     # Show first 3 items
                     for item in items[:3]:
                         if isinstance(item, dict):
                             name = item.get('name', item.get('statement', str(item)))
                             line = item.get('line', '?')
                             logger.info(f"    - {name} (line {line})")
             
             # Display metrics
             if result.get('code_metrics'):
                 metrics = result['code_metrics']
                 logger.info("Code Metrics:")
                 logger.info(f"  Lines of code: {metrics.get('lines_of_code', 0)}")
                 logger.info(f"  Complexity: {metrics.get('complexity', 0)}")
                 logger.info(f"  Max depth: {metrics.get('max_depth', 0)}")
                 logger.info(f"  Node count: {metrics.get('node_count', 0)}")
             
             # Display symbol hash
             if result.get('symbol_hash'):
                 logger.info(f"Symbol hash: {result['symbol_hash'][:16]}...")
             
             code_count += 1
             if code_count >= 3:
                 break
-        else:
-            logger.info("No Tree-sitter symbols extracted (might not be a supported language)")
+        else:
+            logger.info("No Tree-sitter symbols extracted (might not be a supported language)")
tools/arxiv/arxiv_manager.py (2)

101-106: Remove unused base_id in validate_arxiv_id (Ruff F841) and align comment.

The regex already accepts optional version suffix; no need to pre-strip. Also avoids the unused variable.

-        # Remove any version suffix for validation
-        base_id = re.sub(r'v\d+$', '', arxiv_id)
+        # Version suffix allowed by regex; no pre-strip needed

143-149: Remove unused local category in get_pdf_path (Ruff F841).

Variable is assigned but not used.

-            category = match.group(1)
-            number = match.group(2)
+            number = match.group(2)
             yymm = number[:4]
             return cls.PDF_BASE_PATH / yymm / f"{base_id.replace('/', '_')}.pdf"
🧹 Nitpick comments (44)
experiments/word2vec_evolution/experiment_config.yaml (4)

38-40: Clarify chunk_overlap units (tokens vs characters vs bytes).

Label the unit to prevent misconfiguration across components.

-  chunk_size: 512
-  chunk_overlap: 50
+  chunk_size: 512        # tokens
+  chunk_overlap: 50      # tokens

47-51: Make bridge thresholds tunable to reduce “no results” risk.

0.85/0.9 thresholds are stringent; expose overrides via CLI/env to ease calibration.

Example (for your runner to read as optional overrides):

 analysis:
   # Bridge detection
   bridge_criteria:
     semantic_similarity_threshold: 0.85
     conveyance_threshold: 0.9
     where_threshold: 0.8  # Proximity in graph
+  params_overrides_env:
+    semantic_similarity_threshold: BRIDGE_SIM_THR
+    conveyance_threshold: BRIDGE_CONVEY_THR
+    where_threshold: BRIDGE_WHERE_THR

81-92: Surface storage connection settings referenced in pipeline.

You store embeddings in ArangoDB here but the config lacks connection hints. Add env‑backed settings (no secrets).

 analysis:
   embedding_model: "jina-v4"
   embedding_dimensions: 2048
+connections:
+  arangodb:
+    url_env: ARANGODB_URL
+    db_env: ARANGODB_DB
+    user_env: ARANGODB_USER
+    password_env: ARANGODB_PASSWORD
+  postgres:
+    dsn_env: POSTGRES_DSN

15-30: Add explicit arXiv URLs for convenience.

Keep ids, but URLs help tooling and reviewers.

   word2vec:
     arxiv_id: "1301.3781"
+    arxiv_url: "https://arxiv.org/abs/1301.3781"
   ...
   doc2vec:
     arxiv_id: "1405.4053"
+    arxiv_url: "https://arxiv.org/abs/1405.4053"
   ...
   code2vec:
     arxiv_id: "1803.09473"
+    arxiv_url: "https://arxiv.org/abs/1803.09473"
tools/arxiv/arxiv_document_manager.py (7)

19-22: Avoid sys.path hacks; rely on package-absolute imports per guidelines.

This is brittle and bypasses proper packaging. The coding guidelines request absolute imports without path mutation.

-import sys
-sys.path.insert(0, str(Path(__file__).parent.parent.parent))
 from core.processors.generic_document_processor import DocumentTask

34-49: Harden base-dir validation (check directory, not just existence).

If a file exists at the path, current logic passes. Validate it’s an actual directory.

-        if not self.pdf_base_dir.exists():
+        if not self.pdf_base_dir.is_dir():
             raise ValueError(f"PDF base directory does not exist: {pdf_base_dir}")

64-71: Reduce duplicate logging on failed preparations.

both this method and _prepare_single_document log failures; this can spam logs. Downgrade here to debug.

         for arxiv_id in arxiv_ids:
             task = self._prepare_single_document(arxiv_id)
             if task:
                 tasks.append(task)
             else:
-                logger.warning(f"Could not prepare document for {arxiv_id}")
+                logger.debug(f"Could not prepare document for {arxiv_id}")

110-119: Align “newest-first” behavior with implementation.

Docstring promises newest-first; current sort is lexicographical by filename. Either update doc or sort by mtime. Also ignore non-YYMM dirs.

-        yymm_dirs = sorted([d for d in self.pdf_base_dir.iterdir() if d.is_dir()], 
-                          reverse=True)
+        yymm_dirs = sorted(
+            [d for d in self.pdf_base_dir.iterdir() if d.is_dir() and d.name.isdigit() and len(d.name) == 4],
+            reverse=True,
+        )
@@
-            pdf_files = sorted(yymm_dir.glob("*.pdf"), reverse=True)
+            pdf_files = sorted(yymm_dir.glob("*.pdf"), key=lambda p: p.stat().st_mtime, reverse=True)

Also applies to: 122-125, 131-131


182-197: Broaden LaTeX source detection to common tarball variants.

arXiv source tarballs often include additional suffixes. Fall back to any “stem*.tar.gz”.

         # Check for LaTeX source
         latex_path = pdf_path.with_suffix('.tex')
         if not latex_path.exists():
             # Try .tar.gz (compressed LaTeX)
             latex_path = pdf_path.with_suffix('.tar.gz')
             if not latex_path.exists():
-                latex_path = None
+                # Fallback: any tarball starting with the stem (e.g., *source*.tar.gz)
+                matches = list(pdf_path.parent.glob(f"{pdf_path.stem}*.tar.gz"))
+                latex_path = matches[0] if matches else None

Also applies to: 198-208


226-237: Validate month in YYMM and keep version check.

Accept only 01–12 months; everything else should be invalid.

         # Check year_month is 4 digits
         if not year_month.isdigit() or len(year_month) != 4:
             return False
+        # Validate month range (01–12)
+        mm = year_month[2:]
+        if not (1 <= int(mm) <= 12):
+            return False
         
         # Check paper number is 4-6 digits, optionally followed by version like v2
         base, _, version = paper_num.partition('v')
         if not base.isdigit() or not (4 <= len(base) <= 6):
             return False
         # If there's a version, validate it's numeric
         if version and not version.isdigit():
             return False

Also applies to: 245-247, 249-255


16-18: Remove unused imports to satisfy Ruff.

Dict, Any, and dataclass aren’t used.

-from typing import List, Optional, Dict, Any
-from dataclasses import dataclass
+from typing import List, Optional
tests/test_overnight_arxiv_2000.py (9)

66-76: Clarify parameter format in docstring.

call this YYMM or YYYYMM to match normalization and CLI.

-        start_year (str): Inclusive YYMM directory name to start searching from (e.g., "1501" = January 2015).
+        start_year (str): Inclusive year-month to start from, in YYMM or YYYYMM
+            (e.g., "1501" or "201501" = January 2015).

86-87: Stabilize sampling for reproducibility.

-always use a local RNG and sort files so order is deterministic when taking all PDFs from a dir._

-    rng = random.Random(seed) if seed is not None else random
+    rng = random.Random(seed) if seed is not None else random.Random()
@@
-            sampled_pdfs = rng.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_files
+            sampled_pdfs = rng.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_files

Additionally (outside selected lines), sort the directory listing once:

# at line 108
pdf_files = sorted(yymm_dir.glob("*.pdf"), key=lambda p: p.name)

Also applies to: 113-113


129-159: Detect LaTeX recursively to handle nested project structures.

many arXiv sources keep .tex files in subfolders; use rglob for accuracy.

-        if latex_dir.exists():
-            # Check for .tex files
-            tex_files = list(latex_dir.glob("*.tex"))
-            if tex_files:
-                latex_count += 1
+        if latex_dir.exists():
+            # Check for .tex files anywhere under the directory
+            if any(latex_dir.rglob("*.tex")):
+                latex_count += 1

162-189: Docstring: add start_year parameter and format details.

keep docs aligned with the function signature and CLI.

     Parameters:
-        num_docs (int): Maximum number of ArXiv papers to include in the test (default 2000).
+        num_docs (int): Maximum number of ArXiv papers to include in the test (default 2000).
+        start_year (str): Inclusive year-month to start from, in YYMM or YYYYMM.

203-203: Expose sampling seed end-to-end for reproducible runs.

forward a seed from CLI into collect_arxiv_papers so the same corpus is selected across runs.

If acceptable, I can add:

  • run_arxiv_overnight_test(..., seed: int | None = None) and pass it to collect_arxiv_papers
  • CLI flag --seed to populate it.

371-385: Guard division by zero in short runs.

rare but safer; keeps metrics robust if a tiny corpus finishes instantly.

-            'papers_per_minute': (len(papers) / test_time) * 60,
+            'papers_per_minute': ((len(papers) / test_time) * 60) if test_time > 0 else 0.0,

399-403: Fix failure logging to show readable, bounded error text.

current code slices a list; join messages and truncate characters instead.

-        if failed_docs:
-            logger.warning(f"  Failed Papers: {len(failed_docs)}")
-            for fail in failed_docs[:3]:
-                logger.warning(f"    - {fail['arxiv_id']}: {fail['errors'][:100]}")
+        if failed_docs:
+            logger.warning(f"  Failed Papers: {len(failed_docs)}")
+            for fail in failed_docs[:3]:
+                err = fail.get('errors')
+                err_str = "; ".join(map(str, err)) if isinstance(err, list) else str(err)
+                logger.warning(f"    - {fail['arxiv_id']}: {err_str[:200]}")

431-436: Make the performance target configurable.

read from env with a sane default; optional CLI can be added later.

-    target_rate = 11.3  # papers/minute from PRD
+    target_rate = float(os.getenv("ARXIV_TARGET_RATE", "11.3"))  # papers/minute

462-466: Broaden CLI help to reflect accepted formats.

matches the YYMM/ YYYYMM normalization.

-        help="Starting YYMM for paper selection (default: 1501 = Jan 2015)"
+        help="Start year-month (YYMM or YYYYMM). Default: 1501 (Jan 2015)"
core/processors/chunking_strategies.py (5)

141-143: Validate chunk_size/overlap ranges (negative/zero cases)

Guard against chunk_size <= 0 or negative overlap to avoid undefined behavior.

-        if chunk_overlap >= chunk_size:
-            raise ValueError("Overlap must be less than chunk size")
+        if chunk_size <= 0:
+            raise ValueError("chunk_size must be > 0")
+        if chunk_overlap < 0:
+            raise ValueError("chunk_overlap must be >= 0")
+        if chunk_overlap >= chunk_size:
+            raise ValueError("Overlap must be less than chunk size")

20-20: Remove unused import

numpy is imported but never used in this module.

-import numpy as np

173-200: Avoid O(n^2) start_char computation in token-based chunking

len(' '.join(tokens[:i])) inside the loop is quadratic. Precompute prefix char offsets once.

-        for i in range(0, len(tokens), stride):
+        # Precompute prefix char offsets to avoid quadratic joins
+        if tokens:
+            spaces = [0] + [1] * (len(tokens) - 1)
+            token_lens = [len(t) for t in tokens]
+            prefix_chars = [0]
+            running = 0
+            for tlen, space in zip(token_lens, spaces):
+                running += tlen + space
+                prefix_chars.append(running)
+        for i in range(0, len(tokens), stride):
@@
-            # Calculate character positions (approximate)
-            start_char = len(' '.join(tokens[:i]))
+            # Calculate character positions (approximate)
+            start_char = prefix_chars[i] if tokens else 0

463-465: Validate window_size and step_size > 0

Prevent runtime ValueError from range step=0 and clarify constraints early.

         if step_size > window_size:
             raise ValueError("Step size cannot be larger than window size")
+        if window_size <= 0:
+            raise ValueError("window_size must be > 0")
+        if step_size <= 0:
+            raise ValueError("step_size must be > 0")

17-17: Modernize typing per Ruff hints (Callable from collections.abc, use | unions)

Keep compatibility while addressing UP035/UP007.

-from typing import List, Dict, Any, Optional, Tuple, Callable, Union
+from typing import List, Dict, Any, Optional, Tuple
+from collections.abc import Callable
@@
-        tokenizer: Optional[Union[Callable[[str], List[str]], Any]] = None
+        tokenizer: Optional[Callable[[str], List[str]] | Any] = None

Also applies to: 111-111

core/processors/generic_document_processor.py (3)

183-194: Make extraction future timeout configurable

Long documents/code may exceed 120s. Allow config override to avoid spurious failures.

-                        result = future.result(timeout=120)
+                        result = future.result(timeout=extraction_config.get('future_timeout_seconds', 120))

218-224: Make embedding future timeout configurable

Same rationale as extraction; 300s may be low under heavy GPU load.

-                        result = future.result(timeout=300)
+                        result = future.result(timeout=embedding_config.get('future_timeout_seconds', 300))

Also applies to: 247-258


409-413: Avoid mutating sys.path in worker init; prefer package install/imports

sys.path hacks are brittle in multiprocess contexts. Prefer installing the package or setting PYTHONPATH for the test environment.

tests/test_c_files.py (3)

13-13: Remove unused import

GitHubDocumentManager isn’t used in this test.

-from tools.github.github_document_manager import GitHubDocumentManager

163-163: Add trailing newline

Conform to POSIX/linters.

 if __name__ == "__main__":
     test_c_files()
+

86-107: Optional: make Arango connection and query conditional

Consider skipping DB steps if ARANGO_HOST/ARANGO_PASSWORD are unset or connection fails, to keep the test runnable in CI without DB.

Also applies to: 129-158

tests/test_github_treesitter.py (5)

127-129: Use a temp staging directory for portability and isolation.

Hardcoding /dev/shm is Linux-specific; prefer a temp dir and still clean it up.

-        'staging': {
-            'directory': '/dev/shm/github_test_staging'
-        }
+        'staging': {
+            'directory': os.getenv('STAGING_DIR', tempfile.mkdtemp(prefix='github_test_staging_'))
+        }
@@
-    import shutil
-    shutil.rmtree(config['staging']['directory'], ignore_errors=True)
+    import shutil
+    shutil.rmtree(config['staging']['directory'], ignore_errors=True)

Add the required import:

@@
-import os
-import json
+import os
+import json
+import tempfile

Also applies to: 195-197


35-39: Avoid hardcoding clone_dir; make it configurable with an env default.

Improves portability and avoids requiring a specific mount path.

-    github_manager = GitHubDocumentManager(
-        clone_dir="/bulk-store/git",
-        cleanup=False  # Keep repos for analysis
-    )
+    github_manager = GitHubDocumentManager(
+        clone_dir=os.getenv('GITHUB_CLONE_DIR', '/tmp/github_repos'),
+        cleanup=False  # Keep repos for analysis
+    )

14-14: Avoid fragile sys.path hacks; compute project root robustly.

Keeps absolute imports while preventing accidental wrong paths.

-# Add parent directories to path
-sys.path.insert(0, str(Path(__file__).parent.parent.parent))
+# Ensure project root on sys.path (tests expect package-style imports)
+project_root = Path(__file__).resolve().parents[1]
+if str(project_root) not in sys.path:
+    sys.path.insert(0, str(project_root))

123-125: Make GPU devices configurable via env; default to CPU.

Prevents failures on machines without GPUs.

-            'embedding': {
-                'workers': 1,
-                'gpu_devices': [0]
-            }
+            'embedding': {
+                'workers': 1,
+                'gpu_devices': json.loads(os.getenv('GPU_DEVICES', '[]'))
+            }

202-203: Add trailing newline (ruff W292).

Satisfies linters and POSIX text file conventions.

tools/arxiv/arxiv_manager.py (8)

16-25: Prune unused imports and types to satisfy Ruff and reduce noise.

os, asyncio, asdict, and Union aren’t used. This will trigger Ruff F401/F401/F401/F401.

-import os
 import re
 import json
 import logging
-import asyncio
 from pathlib import Path
 from datetime import datetime, timezone
-from typing import Dict, Any, List, Optional, Tuple, Union
-from dataclasses import dataclass, asdict
+from typing import Dict, Any, List, Optional, Tuple
+from dataclasses import dataclass

276-289: Open metadata with UTF-8 and log with traceback for resilience.

Minor robustness: specify encoding and use exception logging with stack trace.

-                with open(metadata_path, 'r') as f:
+                with open(metadata_path, 'r', encoding='utf-8') as f:
@@
-            except Exception as e:
-                logger.warning(f"Failed to load metadata cache: {e}")
+            except Exception:
+                logger.exception("Failed to load metadata cache")

327-343: Harden metadata coercion for authors/categories.

Metadata often uses authors_parsed and space-delimited categories. Consider normalizing gracefully to avoid type surprises.

  • Prefer authors = [" ".join(a[:2]).strip() for a in metadata.get("authors_parsed", [])] as first choice, else parse string, else list fallback.
  • Normalize categories via split() if str; ensure list of str.

If you want, I can draft a small _normalize_metadata() helper and wire it here.


372-377: Potential event-loop blocking.

process_document appears CPU/IO heavy; calling it directly in async context can block. If this runs under asyncio with other tasks, consider offloading to a thread/process pool.

Example pattern:

  • await asyncio.get_running_loop().run_in_executor(None, functools.partial(self.processor.process_document, ...))

431-435: Guard against None for extraction counts.

If any of these are None, len(None) will raise.

-                'num_chunks': len(result.chunks),
-                'num_tables': len(result.extraction.tables),
-                'num_equations': len(result.extraction.equations),
-                'num_images': len(result.extraction.images),
+                'num_chunks': len(result.chunks),
+                'num_tables': len(getattr(result.extraction, 'tables', []) or []),
+                'num_equations': len(getattr(result.extraction, 'equations', []) or []),
+                'num_images': len(getattr(result.extraction, 'images', []) or []),

457-467: Default structure lists to empty and keep schema stable.

Safer if any extractor component returns None; also future-proof if figures/images differ.

-                structures_doc = {
+                structures_doc = {
                     '_key': paper_info.sanitized_id,
                     'arxiv_id': paper_info.arxiv_id,
-                    'tables': result.extraction.tables,
-                    'equations': result.extraction.equations,
-                    'images': result.extraction.images,
-                    'figures': result.extraction.figures
+                    'tables': getattr(result.extraction, 'tables', []) or [],
+                    'equations': getattr(result.extraction, 'equations', []) or [],
+                    'images': getattr(result.extraction, 'images', []) or [],
+                    'figures': getattr(result.extraction, 'figures', []) or [],
                 }

471-472: Log full traceback on storage failures.

Use logger.exception for better diagnostics.

-        except Exception as e:
-            logger.error(f"Failed to store ArXiv paper {paper_info.arxiv_id}: {e}")
+        except Exception:
+            logger.exception(f"Failed to store ArXiv paper {paper_info.arxiv_id}")
             raise

26-30: Avoid function-local import; import ExtractionResult at module scope.

Saves repeated import and matches style.

 from core.processors.document_processor import (
     DocumentProcessor,
     ProcessingConfig,
-    ProcessingResult
+    ProcessingResult,
+    ExtractionResult,
 )
-                from core.processors.document_processor import ProcessingResult, ExtractionResult
+                # imports available at module scope

Also applies to: 500-501

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4d99984 and 638dabd.

📒 Files selected for processing (8)
  • core/processors/chunking_strategies.py (12 hunks)
  • core/processors/generic_document_processor.py (11 hunks)
  • experiments/word2vec_evolution/experiment_config.yaml (1 hunks)
  • tests/test_c_files.py (1 hunks)
  • tests/test_github_treesitter.py (1 hunks)
  • tests/test_overnight_arxiv_2000.py (6 hunks)
  • tools/arxiv/arxiv_document_manager.py (8 hunks)
  • tools/arxiv/arxiv_manager.py (15 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Prefer absolute imports following the shown module structure (e.g., core.framework., tools.arxiv.pipelines.)

Files:

  • tests/test_github_treesitter.py
  • tests/test_c_files.py
  • core/processors/chunking_strategies.py
  • core/processors/generic_document_processor.py
  • tests/test_overnight_arxiv_2000.py
  • tools/arxiv/arxiv_manager.py
  • tools/arxiv/arxiv_document_manager.py
{tools/arxiv,core/framework}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{tools/arxiv,core/framework}/**/*.py: Format Python code with Black for tools/arxiv/ and core/framework/
Run Ruff linting on tools/arxiv/ and core/framework/

Files:

  • tools/arxiv/arxiv_manager.py
  • tools/arxiv/arxiv_document_manager.py
🧠 Learnings (2)
📚 Learning: 2025-09-02T23:17:37.712Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-02T23:17:37.712Z
Learning: Applies to tools/arxiv/{pipelines/arxiv_pipeline.py,utils/check_db_status.py} : Use ArangoDB collections: arxiv_papers, arxiv_chunks, arxiv_embeddings, arxiv_structures (with specified fields)

Applied to files:

  • tests/test_github_treesitter.py
  • tools/arxiv/arxiv_manager.py
📚 Learning: 2025-09-02T23:17:37.712Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-02T23:17:37.712Z
Learning: Applies to tools/arxiv/{pipelines/arxiv_pipeline.py,utils/check_db_status.py} : Store all outputs in ArangoDB with atomic transactions

Applied to files:

  • tools/arxiv/arxiv_manager.py
🧬 Code graph analysis (5)
tests/test_github_treesitter.py (3)
tools/github/github_document_manager.py (2)
  • GitHubDocumentManager (52-312)
  • prepare_repository (91-127)
core/processors/generic_document_processor.py (2)
  • GenericDocumentProcessor (46-261)
  • process_documents (91-128)
core/framework/extractors/code_extractor.py (1)
  • CodeExtractor (18-192)
tests/test_c_files.py (3)
tools/github/github_document_manager.py (1)
  • GitHubDocumentManager (52-312)
core/processors/generic_document_processor.py (3)
  • GenericDocumentProcessor (46-261)
  • DocumentTask (29-43)
  • process_documents (91-128)
core/framework/extractors/code_extractor.py (1)
  • CodeExtractor (18-192)
core/processors/chunking_strategies.py (1)
experiments/word2vec_evolution/generate_embeddings.py (1)
  • chunk_text (30-57)
core/processors/generic_document_processor.py (3)
core/framework/embedders.py (1)
  • JinaV4Embedder (52-1003)
core/database/arango_db_manager.py (1)
  • ArangoDBManager (69-252)
tools/arxiv/arxiv_manager.py (1)
  • sanitized_id (60-67)
tests/test_overnight_arxiv_2000.py (1)
tests/test_overnight_arxiv_phased.py (1)
  • collect_arxiv_papers (57-101)
🪛 Ruff (0.12.2)
tests/test_github_treesitter.py

203-203: No newline at end of file

Add trailing newline

(W292)

tests/test_c_files.py

163-163: No newline at end of file

Add trailing newline

(W292)

core/processors/chunking_strategies.py

17-17: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)


17-17: typing.List is deprecated, use list instead

(UP035)


17-17: typing.Dict is deprecated, use dict instead

(UP035)


17-17: typing.Tuple is deprecated, use tuple instead

(UP035)


111-111: Use X | Y for type annotations

Convert to X | Y

(UP007)

🪛 YAMLlint (1.37.1)
experiments/word2vec_evolution/experiment_config.yaml

[error] 112-112: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (11)
experiments/word2vec_evolution/experiment_config.yaml (2)

1-12: Overall: concise, readable config.

Good use of document marker, structured sections, and wrapped descriptions.


34-36: No action needed—Jina v4 outputs 2048-dim embeddings as configured. Tests, pipeline code, and docs consistently assert EMBEDDING_DIM = 2048 for model “jina-v4,” so the settings in experiment_config.yaml are correct.

tools/arxiv/arxiv_document_manager.py (1)

52-61: Docstring clarity—LGTM.

The behavior description (skip invalid/missing, preserve order) is clear and helpful.

core/processors/generic_document_processor.py (1)

271-287: Cannot verify GPU sharding via mp.current_process()._identity
The test script failed with a SemLock error in this environment, so please manually confirm that mp.current_process()._identity is reliably non‐empty under your executor. If it isn’t, switch to a simpler round-robin allocator (e.g. an atomic counter) for assigning CUDA_VISIBLE_DEVICES.

tools/arxiv/arxiv_manager.py (7)

26-31: Absolute imports: good move.

Using absolute imports (core.*) aligns with repo guidelines and eases refactors.


61-66: Docstring clarity for sanitized_id looks good.

Explains transformation and usage clearly.


117-132: PDF path docstring is precise and helpful.

No issues.


155-158: LaTeX path docstring reads well.

Clear precedence for main file detection.


183-197: parse_arxiv_id docs LGTM.

Covers new/old formats and returned fields succinctly.


242-249: init docstring is clear about wiring and behavior.

Nice.


480-490: Batch processing behavior documented well.

Ordering guarantee and failure shaping are clear.

Comment on lines +13 to +31
papers:
word2vec:
arxiv_id: "1301.3781"
title: "Efficient Estimation of Word Representations in Vector Space"
year: 2013
github_repo: "https://github.com/dav/word2vec"

doc2vec:
arxiv_id: "1405.4053"
title: "Distributed Representations of Sentences and Documents"
year: 2014
github_repo: "https://github.com/bnosac/doc2vec"

code2vec:
arxiv_id: "1803.09473"
title: "code2vec: Learning Distributed Representations of Code"
year: 2018
github_repo: "https://github.com/tech-srl/code2vec"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Pin GitHub repos by commit for reproducibility.

Unpinned default branches are moving targets; runs won’t be reproducible.

 papers:
   word2vec:
     arxiv_id: "1301.3781"
     title: "Efficient Estimation of Word Representations in Vector Space"
     year: 2013
     github_repo: "https://github.com/tmikolov/word2vec"
+    commit: "<PIN_SHA_7+>"
 
   doc2vec:
     arxiv_id: "1405.4053"
     title: "Distributed Representations of Sentences and Documents"
     year: 2014
     github_repo: "https://github.com/RaRe-Technologies/gensim"
+    commit: "<PIN_SHA_7+>"
 
   code2vec:
     arxiv_id: "1803.09473"
     title: "code2vec: Learning Distributed Representations of Code"
     year: 2018
     github_repo: "https://github.com/tech-srl/code2vec"
+    commit: "<PIN_SHA_7+>"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
papers:
word2vec:
arxiv_id: "1301.3781"
title: "Efficient Estimation of Word Representations in Vector Space"
year: 2013
github_repo: "https://github.com/dav/word2vec"
doc2vec:
arxiv_id: "1405.4053"
title: "Distributed Representations of Sentences and Documents"
year: 2014
github_repo: "https://github.com/bnosac/doc2vec"
code2vec:
arxiv_id: "1803.09473"
title: "code2vec: Learning Distributed Representations of Code"
year: 2018
github_repo: "https://github.com/tech-srl/code2vec"
papers:
word2vec:
arxiv_id: "1301.3781"
title: "Efficient Estimation of Word Representations in Vector Space"
year: 2013
github_repo: "https://github.com/tmikolov/word2vec"
commit: "<PIN_SHA_7+>"
doc2vec:
arxiv_id: "1405.4053"
title: "Distributed Representations of Sentences and Documents"
year: 2014
github_repo: "https://github.com/RaRe-Technologies/gensim"
commit: "<PIN_SHA_7+>"
code2vec:
arxiv_id: "1803.09473"
title: "code2vec: Learning Distributed Representations of Code"
year: 2018
github_repo: "https://github.com/tech-srl/code2vec"
commit: "<PIN_SHA_7+>"
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/experiment_config.yaml around lines 13 to 31,
the github_repo fields point to unpinned default branches which makes runs
non-reproducible; update each paper entry to pin the source by adding a commit
field (e.g., github_commit: "<full-commit-sha>") or replace github_repo with a
github_repo_commit field that includes the repo URL plus the exact commit or tag
(full SHA preferable), and ensure any tooling that fetches the repo uses that
commit to checkout; repeat for word2vec, doc2vec and code2vec entries.

Comment on lines +151 to +159
# Display results
logger.info("\nPipeline Results:")
logger.info(f"Total processed: {results.get('total_processed', 0)}")
logger.info(f"Successful: {results.get('extraction_success', 0) + results.get('embedding_success', 0)}")
logger.info(f"Failed: {results.get('extraction_failed', 0) + results.get('embedding_failed', 0)}")

if results.get('processing_rate'):
logger.info(f"Processing rate: {results['processing_rate']:.2f} docs/min")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct results parsing (GenericDocumentProcessor returns nested dicts).

Access extraction/embedding summaries via results['extraction'] and results['embedding'] instead of flattened keys; remove non-existent processing_rate.

-    logger.info("\nPipeline Results:")
-    logger.info(f"Total processed: {results.get('total_processed', 0)}")
-    logger.info(f"Successful: {results.get('extraction_success', 0) + results.get('embedding_success', 0)}")
-    logger.info(f"Failed: {results.get('extraction_failed', 0) + results.get('embedding_failed', 0)}")
-    
-    if results.get('processing_rate'):
-        logger.info(f"Processing rate: {results['processing_rate']:.2f} docs/min")
+    logger.info("\nPipeline Results:")
+    logger.info(f"Total processed: {results.get('total_processed', 0)}")
+    extraction = results.get('extraction', {})
+    embedding = results.get('embedding', {})
+    logger.info(f"Extraction: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed")
+    logger.info(f"Embedding: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Display results
logger.info("\nPipeline Results:")
logger.info(f"Total processed: {results.get('total_processed', 0)}")
logger.info(f"Successful: {results.get('extraction_success', 0) + results.get('embedding_success', 0)}")
logger.info(f"Failed: {results.get('extraction_failed', 0) + results.get('embedding_failed', 0)}")
if results.get('processing_rate'):
logger.info(f"Processing rate: {results['processing_rate']:.2f} docs/min")
# Display results
logger.info("\nPipeline Results:")
logger.info(f"Total processed: {results.get('total_processed', 0)}")
extraction = results.get('extraction', {})
embedding = results.get('embedding', {})
logger.info(f"Extraction: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed")
logger.info(f"Embedding: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed")
🤖 Prompt for AI Agents
In tests/test_github_treesitter.py around lines 151 to 159, the test is reading
flattened keys that don't exist because GenericDocumentProcessor returns nested
summaries; update the logging to read results.get('extraction', {}) and
results.get('embedding', {}) and pull totals from those dicts (e.g.
extraction.get('total_processed', 0), extraction.get('success', 0),
extraction.get('failed', 0) and similarly for embedding), compute aggregated
counts from those nested values for the "Total processed", "Successful" and
"Failed" lines, and remove the non-existent processing_rate check/logging.

Comment on lines +54 to +64
def _normalize_yymm(yymm: str) -> tuple[int, int]:
"""Normalize YYMM to (year, month) tuple for correct comparison."""
if len(yymm) != 4 or not yymm.isdigit():
raise ValueError(f"Invalid YYMM: {yymm}")
yy, mm = int(yymm[:2]), int(yymm[2:])
if not 1 <= mm <= 12:
raise ValueError(f"Invalid month in YYMM: {yymm}")
# Handle century: 90-99 = 1990s, 00-89 = 2000s
year = 1900 + yy if yy >= 90 else 2000 + yy
return (year, mm)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle YYYYMM and avoid crashes on 6-digit directory names.

current code breaks when encountering numeric dirs of length 6 (e.g., 201501). Make the normalizer accept YYMM or YYYYMM.

-def _normalize_yymm(yymm: str) -> tuple[int, int]:
-    """Normalize YYMM to (year, month) tuple for correct comparison."""
-    if len(yymm) != 4 or not yymm.isdigit():
-        raise ValueError(f"Invalid YYMM: {yymm}")
-    yy, mm = int(yymm[:2]), int(yymm[2:])
-    if not 1 <= mm <= 12:
-        raise ValueError(f"Invalid month in YYMM: {yymm}")
-    # Handle century: 90-99 = 1990s, 00-89 = 2000s
-    year = 1900 + yy if yy >= 90 else 2000 + yy
-    return (year, mm)
+def _normalize_yymm(yymm: str) -> tuple[int, int]:
+    """Normalize YYMM or YYYYMM to (year, month) for correct comparison."""
+    if not (yymm.isdigit() and len(yymm) in (4, 6)):
+        raise ValueError(f"Invalid YYMM/ YYYYMM: {yymm}")
+    if len(yymm) == 4:
+        yy, mm = int(yymm[:2]), int(yymm[2:])
+        # 90-99 = 1990s, 00-89 = 2000s
+        year = 1900 + yy if yy >= 90 else 2000 + yy
+    else:
+        year, mm = int(yymm[:4]), int(yymm[4:])
+    if not 1 <= mm <= 12:
+        raise ValueError(f"Invalid month in YYMM/ YYYYMM: {yymm}")
+    return (year, mm)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _normalize_yymm(yymm: str) -> tuple[int, int]:
"""Normalize YYMM to (year, month) tuple for correct comparison."""
if len(yymm) != 4 or not yymm.isdigit():
raise ValueError(f"Invalid YYMM: {yymm}")
yy, mm = int(yymm[:2]), int(yymm[2:])
if not 1 <= mm <= 12:
raise ValueError(f"Invalid month in YYMM: {yymm}")
# Handle century: 90-99 = 1990s, 00-89 = 2000s
year = 1900 + yy if yy >= 90 else 2000 + yy
return (year, mm)
def _normalize_yymm(yymm: str) -> tuple[int, int]:
"""Normalize YYMM or YYYYMM to (year, month) for correct comparison."""
if not (yymm.isdigit() and len(yymm) in (4, 6)):
raise ValueError(f"Invalid YYMM/ YYYYMM: {yymm}")
if len(yymm) == 4:
yy, mm = int(yymm[:2]), int(yymm[2:])
# 90-99 = 1990s, 00-89 = 2000s
year = 1900 + yy if yy >= 90 else 2000 + yy
else:
year, mm = int(yymm[:4]), int(yymm[4:])
if not 1 <= mm <= 12:
raise ValueError(f"Invalid month in YYMM/ YYYYMM: {yymm}")
return (year, mm)
🤖 Prompt for AI Agents
In tests/test_overnight_arxiv_2000.py around lines 54 to 64, the _normalize_yymm
function currently only accepts 4-digit YYMM and fails on 6-digit YYYYMM
directory names (e.g., 201501); update the function to accept either 4-digit
(YYMM) or 6-digit (YYYYMM) numeric strings, parse accordingly (if length==4
treat as YYMM and apply the existing century rule, if length==6 take the first
four digits as the full year and last two as month), validate that month is
between 1 and 12, and raise ValueError for any other lengths or non-numeric
input; ensure the function still returns (year, month) as ints.

Comment on lines +89 to +96
yymm_dirs = sorted(
[d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()],
key=lambda d: _normalize_yymm(d.name)
)

# Filter to start from specified year
yymm_dirs = [d for d in yymm_dirs if d.name >= start_year]
# Filter to start from specified year (normalize century)
start_cutoff = _normalize_yymm(start_year)
yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Filter dir names to 4 or 6 digits to prevent ValueError and sort once.

current list may include 6+ digit numeric names and then crash during normalization.

-    yymm_dirs = sorted(
-        [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()],
-        key=lambda d: _normalize_yymm(d.name)
-    )
+    yymm_dirs = [d for d in ARXIV_PDF_BASE.iterdir()
+                 if d.is_dir() and d.name.isdigit() and len(d.name) in (4, 6)]
+    yymm_dirs.sort(key=lambda d: _normalize_yymm(d.name))
@@
-    yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]
+    yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
yymm_dirs = sorted(
[d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()],
key=lambda d: _normalize_yymm(d.name)
)
# Filter to start from specified year
yymm_dirs = [d for d in yymm_dirs if d.name >= start_year]
# Filter to start from specified year (normalize century)
start_cutoff = _normalize_yymm(start_year)
yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]
yymm_dirs = [d for d in ARXIV_PDF_BASE.iterdir()
if d.is_dir() and d.name.isdigit() and len(d.name) in (4, 6)]
yymm_dirs.sort(key=lambda d: _normalize_yymm(d.name))
# Filter to start from specified year (normalize century)
start_cutoff = _normalize_yymm(start_year)
yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]
🤖 Prompt for AI Agents
In tests/test_overnight_arxiv_2000.py around lines 89 to 96, the directory
filtering and sorting can raise ValueError when numeric directory names have
more than 4 or 6 digits and also invokes normalization multiple times; restrict
the initial list comprehension to directories whose names are all digits and
length 4 or 6 (e.g., len(name) in (4,6)), then perform a single sorted(...) call
using key=lambda d: _normalize_yymm(d.name) and finally apply the start_year
cutoff by comparing _normalize_yymm(d.name) to the normalized start_cutoff so
normalization is only called where appropriate.

Comment on lines +78 to +88
Prepare DocumentTask objects for PDFs located in a specific ArXiv YYMM directory.
Searches the manager's PDF base directory under the given year_month (e.g., "2301"), finds up to `limit` PDF files, and builds DocumentTask objects for each PDF. The ArXiv ID passed to the task is constructed as "{year_month}.{paper_id}" where `paper_id` is derived from the PDF filename stem.
Parameters:
year_month (str): Two-digit year and month string in YYMM format (e.g., "2301").
limit (Optional[int]): Maximum number of PDFs to process from the directory. If None, all PDFs are processed.
Returns:
List[DocumentTask]: A list of prepared DocumentTask objects. Returns an empty list if the directory does not exist or no tasks could be prepared.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix limit=0 bug and unify arXiv ID derivation.

  • limit=0 currently processes all PDFs (truthiness check). Should slice when limit is 0.
  • Derive ID consistently with prepare_recent_documents: use full stem if it already contains a dot, else prefix with YYMM. Avoid split('.')[-1] which can break on extra dots in filenames.
-        pdf_files = sorted(dir_path.glob("*.pdf"))
-        if limit:
-            pdf_files = pdf_files[:limit]
+        pdf_files = sorted(dir_path.glob("*.pdf"))
+        if limit is not None:
+            pdf_files = pdf_files[:limit]
@@
-        for pdf_path in pdf_files:
-            arxiv_id = f"{year_month}.{pdf_path.stem.split('.')[-1]}"
+        for pdf_path in pdf_files:
+            stem = pdf_path.stem
+            arxiv_id = stem if "." in stem else f"{year_month}.{stem}"
             task = self._prepare_from_path(pdf_path, arxiv_id)
             if task:
                 tasks.append(task)

Also applies to: 94-103

🤖 Prompt for AI Agents
tools/arxiv/arxiv_document_manager.py lines 78-88 and 94-103: fix two
issues—limit handling and arXiv ID derivation. For limit handling, do not rely
on truthiness; when limit is not None slice the list of PDF files with files =
files[:limit] so limit=0 yields zero files; when limit is None process all
files. For ID derivation, stop using split('.')[-1]; obtain the filename stem
and if it already contains a dot use the full stem as the arXiv id, otherwise
prefix it with the year_month and a dot (f"{year_month}.{stem}"), and apply the
same logic in both code regions (lines 78-88 and 94-103) to keep behavior
consistent.

Comment on lines +153 to +163
"""
Prepare a DocumentTask for a single arXiv identifier.
Given an arXiv ID of the form "YYMM.NNNNN" (or with a version suffix, e.g. "YYMM.NNNNNv1"), this locates the corresponding PDF under the manager's base directory (pdf_base_dir/YYMM/<arxiv_id>.pdf) and, if found, delegates construction of the DocumentTask to _prepare_from_path.
Parameters:
arxiv_id (str): ArXiv identifier expected to contain a single dot separating the YYMM prefix from the paper number.
Returns:
Optional[DocumentTask]: A DocumentTask for the PDF when the ID is valid and the PDF exists; otherwise None.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Validate IDs and fallback to versionless file when vN is missing on disk.

Docstrings allow version suffix. If only versionless PDF exists, we should still prepare the task.

-        # Parse ArXiv ID to get YYMM directory
-        parts = arxiv_id.split('.')
-        if len(parts) != 2:
-            logger.error(f"Invalid ArXiv ID format: {arxiv_id}")
-            return None
-        
-        year_month = parts[0]
-        
-        # Construct PDF path
-        pdf_path = self.pdf_base_dir / year_month / f"{arxiv_id}.pdf"
-        
-        if not pdf_path.exists():
-            logger.warning(f"PDF not found: {pdf_path}")
-            return None
-        
-        return self._prepare_from_path(pdf_path, arxiv_id)
+        # Validate and parse ArXiv ID
+        if not self.validate_arxiv_id(arxiv_id):
+            logger.error(f"Invalid ArXiv ID format: {arxiv_id}")
+            return None
+
+        year_month, paper_num = arxiv_id.split(".")
+
+        # Prefer exact match (including version, if any)
+        pdf_path = self.pdf_base_dir / year_month / f"{arxiv_id}.pdf"
+        if pdf_path.exists():
+            return self._prepare_from_path(pdf_path, arxiv_id)
+
+        # Fallback: try versionless if a version suffix was provided
+        base_num, _, _ = paper_num.partition("v")
+        if base_num != paper_num:
+            alt_id = f"{year_month}.{base_num}"
+            alt_pdf_path = self.pdf_base_dir / year_month / f"{alt_id}.pdf"
+            if alt_pdf_path.exists():
+                return self._prepare_from_path(alt_pdf_path, alt_id)
+
+        logger.warning(f"PDF not found: {pdf_path}")
+        return None

Also applies to: 164-179

Comment on lines +400 to 408
Persist ArXiv processing results and extracted artifacts to the configured ArangoDB collections.
If no database manager is configured, the call is a no-op (it logs a warning and returns). When a DB manager is present, this inserts:
- a main document into the `arxiv_papers` collection (keyed by paper_info.sanitized_id) containing ArXiv metadata and processing metrics;
- one document per text chunk with embeddings into `arxiv_embeddings`;
- an optional `arxiv_structures` document containing tables, equations, images, and figures when any are present.
Maintains separate collections for ArXiv papers to preserve
source-specific metadata and enable specialized queries.
Errors raised by the DB manager or during document insertion are logged and re-raised.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Make DB writes atomic and align collections with design.

Per retrieved learnings, writes to arxiv_papers, arxiv_chunks, arxiv_embeddings, and arxiv_structures should be done in a single Arango transaction to avoid partial state on failures.

  • Wrap all inserts in one transaction (begin/commit/abort) via ArangoDBManager.
  • Store chunks in arxiv_chunks (text/positions), and embeddings in arxiv_embeddings referencing the chunk doc by key.
  • Ensure idempotency (upsert or conflict policy) for re-processing the same paper/version.
    Do you want me to sketch a transaction wrapper API for ArangoDBManager and update this method accordingly?

@r3d91ll r3d91ll merged commit 20674b2 into main Sep 3, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants