Feature/GitHub arxiv tooling #17

r3d91ll · 2025-09-03T00:46:39Z

Summary by CodeRabbit

New Features
- GitHub repository processing with Tree-sitter symbol extraction, graph-based repo→file→chunk→embedding storage, ArXiv metadata service, and an orchestration CLI.
Improvements
- More robust PDF extraction (subprocess timeouts + fallback), file-based checkpoints/resume, richer paper/code/embedding metadata, transactional DB pooling and atomic storage, enhanced list-generation and collection tooling.
Bug Fixes
- Hardened GPU/status monitoring, log timestamp parsing, and resilient extraction/embedding error handling.
Documentation
- New ADRs, PRDs, READMEs, and usage/config guides.
Tests
- Added integration, Tree-sitter, reliability, and large-scale test suites.
Chores
- Expanded .gitignore and dependency/config updates.

@r3d91ll

Docstrings generation was requested by @r3d91ll. * #6 (comment) The following files were modified: * `core/framework/extractors/code_extractor.py` * `core/framework/extractors/docling_extractor.py` * `core/processors/chunking_strategies.py` * `core/processors/document_processor.py` * `core/processors/generic_document_processor.py` * `tests/monitor_overnight.py` * `tests/monitor_phased.py` * `tests/test_batch_processing.py` * `tests/test_batch_processing_simple.py` * `tests/test_document_processor.py` * `tests/test_overnight_2000.py` * `tests/test_overnight_arxiv_2000.py` * `tests/test_overnight_arxiv_phased.py` * `tools/arxiv/arxiv_document_manager.py` * `tools/arxiv/arxiv_manager.py` * `tools/arxiv/pipelines/arxiv_pipeline_v2.py` * `tools/github/github_document_manager.py` * `tools/github/github_pipeline.py`

📝 Add docstrings to `feature/github-integration`

- Add Tree-sitter symbol extraction for code analysis - Implement graph-based storage architecture in ArangoDB - Add robust and code extractors with proper error handling - Fix all CodeRabbit review issues (PyMuPDF, ProcessPoolExecutor, etc.) - Add comprehensive documentation and ADRs - Successfully process repositories with 100+ files/minute rate 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Update `.gitignore` to include new data files for extended collections. - Modify `collect_ai_papers_extended.py` to allow dynamic target count for paper collection. - Implement `run_pipeline_from_list.py` to process papers from a specified list. - Add `run_test_pipeline.py` for configurable testing of the ArXiv pipeline.

…idge analysis This commit adds a comprehensive experiment analyzing the evolution of word embeddings from word2vec (2013) → doc2vec (2014) → code2vec (2018) to identify theory-practice bridges and measure "conveyance" - the ability to transfer knowledge from papers to implementations. Key Changes: - Add word2vec_evolution experiment with complete pipeline - Fix GitHub pipeline to support config-driven repository processing - Create graph structure for connecting papers with code implementations - Implement conveyance measurement framework (C = W×R×H/T × Ctx^α) Experiment Highlights: - Identified Gensim's doc2vec as "pure conveyance" (α ≈ 2.0) * Implementation created solely from paper without original code * Became de facto standard, empirically proving high conveyance - Processed 3 papers and 3 repositories with embeddings - Created edge collections for theory-practice relationships - Established temporal evolution chain in ArangoDB graph Technical Improvements: - GitHub pipeline now reads repositories from config (consistency with ArXiv) - Added graph collections: paper_implements_theory, semantic_similarity, temporal_evolution, conveyance_bridge - Cleaned up deprecated configs and removed large data files from version control Performance: - Papers: 27 chunks, 2048-dim Jina v4 embeddings - Repositories: 275 files, 1,453 embeddings - Processing: 24.8s papers, 135.9s repositories This provides the foundation for entropy mapping and conveyance analysis to quantify how effectively knowledge transfers from academic theory to practical implementations. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

- Implemented `update_pdf_sizes.py` to scan PDF files and update their sizes in the database. - Created `utils.py` for utility functions including normalization of arXiv IDs. - Developed orchestration module with `orchestrator.py` to manage the search, preview, refine, and process workflow for arXiv metadata. - Added `list_generator.py` to generate paper lists from PostgreSQL based on search configurations. - Introduced `check_existing_papers.py` script to verify which papers are already processed in ArangoDB. - Enhanced logging and error handling across modules for better traceability and debugging.

- Remove previously committed *_embeddings.json files from git tracking - Add pattern to .gitignore to prevent future embedding files from being committed - These are large generated files that should not be in version control

Security fixes: - Remove hardcoded DSN credentials from db.yaml, use environment variables - Fix .gitignore to only exclude test artifacts, not test source files - Fix keywords field in arxiv_search_minimal.yaml (empty list instead of empty string) Import path fixes: - Update all imports from tools.arxiv.pipelines.arango_db_manager to core.database.arango_db_manager - Fix incorrect import in demo_github_pipeline.py These changes address critical security vulnerabilities and broken imports identified by CodeRabbit.

ArangoDB Manager fixes: - Fix with_connection decorator to inject via kwargs (works with methods) - Improve TTL index checking to verify specific field presence - Add overwrite parameter to insert_document with proper warnings - Document data loss risks when overwrite=True Extractor fixes: - Fix tree-sitter generator detection to use node attribute - Fix robust_extractor to use public shutdown API first - Add proper exception handling for executor cleanup These changes improve robustness and address potential issues identified by CodeRabbit.

…ements

Feature/arxiv enhancements

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

- Remove all JSON result files that were accidentally tracked - Remove extracted papers and analysis files - Update .gitignore to prevent these files from being added - These are generated output files that should not be in version control

- Merge paper IDs from loaded JSON collection into collected_ids set - Prevents re-collecting papers that are already in existing collections - Handles both 'id' and 'arxiv_id' field names for compatibility - Addresses CodeRabbit review comment in PR #16

coderabbitai · 2025-09-03T00:46:47Z

Walkthrough

Adds a large multi-source processing surface: ArangoDB manager, Postgres-backed ArXiv metadata service and migrations, robust PDF and Tree-sitter code extractors, GenericDocumentProcessor with phased extraction/embedding and checkpointing, GitHub repo processing and graph storage, many CLI/scripts/tests/configs/experiments, docs (ADRs/PRDs), packaging and .gitignore updates, and removal of autogenerated/legacy artifacts.

Changes

Cohort / File(s)	Summary
Repository housekeeping `\.gitignore`, `pyproject.toml`, `README.md`	Add ignore patterns for generated artifacts; adjust Poetry packaging/deps (add psycopg/sqlalchemy/alembic); expand README to document GitHub processing, Tree-sitter, and graph model.
Package inits & bootstrapping `core/__init__.py`, `tools/__init__.py`, `core/database/__init__.py`, `tools/arxiv/db/__init__.py`, `tools/arxiv/orchestration/__init__.py`	New package initializers, module docstrings, and re-exports (ArangoDBManager, retry helper, orchestration facade, DB package exports).
ArangoDB manager `core/database/arango_db_manager.py`	New ArangoDBManager with pooled connections, get/return connection API, with_connection decorator, insert/transaction/lock helpers and retry_with_backoff decorator.
Extractors: Docling / Robust / Code / Tree-sitter `core/framework/extractors/docling_extractor.py`, `robust_extractor.py`, `code_extractor.py`, `tree_sitter_extractor.py`	Add RobustExtractor (subprocess isolation, timeout, PyMuPDF fallback); extend DoclingExtractor to handle text and fallback semantics; add TreeSitterExtractor (multi-language symbol/metrics/hash); CodeExtractor gains optional Tree-sitter integration with graceful fallback and enriched metadata.
Chunking & processors `core/processors/*`, `core/processors/document_processor.py`, `core/processors/generic_document_processor.py`	Tokenizer interface expanded (callable or object), sliding/semantic chunking updates, docstring expansions; GenericDocumentProcessor adds staging directory handling, phased extraction/embedding orchestration, worker init, Jina embedder init, ArangoDBManager wiring, enriched storage (symbols, repo, code metrics), and transactional storage/cleanup.
Phase management & pipelines `tools/arxiv/pipelines/arxiv_pipeline.py`, `tools/arxiv/pipelines/arxiv_pipeline_v2.py`	PhaseManager metadata snapshot loading and staged-file enrichment; pipeline v2 adds file-based checkpointing (load/save/_should_use_checkpoint), source handling, and enriched result outputs.
Postgres ArXiv metadata service & tooling `tools/arxiv/db/*` (configs, config dataclasses, pg util, alembic env/migrations, snapshot loader, OAI harvester, export/scan/update, compute assessment, helpers)	New typed DB config loader, psycopg connection util, Alembic environment + initial migration, snapshot loader, OAI harvester to Postgres, export_ids/list generation, artifact scanners, pdf-size updaters, compute-assessment modules and many CLI utilities.
Orchestration & list generation `tools/arxiv/orchestration/orchestrator.py`, `tools/arxiv/utils/list_generator.py`	New ArxivPipelineOrchestrator (search→preview→refine→process state machine) and ArxivListGenerator for Postgres-driven ID list generation and outputs.
ArXiv scripts, monitoring & tests `tools/arxiv/scripts/`, `tools/arxiv/monitoring/`, `tools/arxiv/tests/`, `tests/`	Many CLI utilities (collectors, runners, checkers), monitoring tooling (weekend dashboard), and new/extended integration/diagnostic tests (RobustExtractor, Tree-sitter, GitHub flows, large-scale runners, various test harnesses).
GitHub processing workspace `tools/github/`, `tools/github/configs/`, `tools/github/README.md`	Add GitHubDocumentManager, demo pipeline, setup scripts, expanded config (more extensions/repos), README documenting graph-based storage and Tree-sitter pipeline.
Experiments: Word2Vec evolution `experiments/word2vec_evolution/*`	New experiment suite: configs, manifests, embedding generation, analysis, graph setup, import/store/verify scripts, experiment outputs and summaries.
Docs, ADRs & PRDs `docs/*/`	Add ADRs (0003–0005), PRDs (ArXiv metadata, GitHub integration), implementation notes and issue/write-ups.
Removals / legacy cleanup `arxiv_pipeline_v2_results_.json`, `old.claude/`	Delete autogenerated pipeline result JSONs and legacy old.claude ADR/agent templates.

Sequence Diagram(s)

%%{init: {"themeVariables": {"primaryColor":"#f5f7fa","actorBorder":"#244","noteBg":"#f0f4ff"}}}%%
sequenceDiagram
  autonumber
  actor User
  participant GDM as GitHubDocumentManager
  participant GDP as GenericDocumentProcessor
  participant EX as Code/Tree-sitter Extractor
  participant EMB as JinaV4Embedder
  participant DB as ArangoDBManager

  User->>GDM: prepare_repository(repo_url)
  GDM-->>User: tasks (file list)
  User->>GDP: process_documents(tasks)
  GDP->>EX: extract(file)
  EX-->>GDP: symbols/metrics + text
  GDP->>EMB: embed (late chunking)
  EMB-->>GDP: embeddings
  GDP->>DB: begin transaction → insert repo/file/chunk/embedding
  DB-->>GDP: commit
  GDP-->>User: summary (processed counts)

sequenceDiagram
  autonumber
  participant DP as DocumentProcessor
  participant RE as RobustExtractor
  participant DL as Docling Worker (subprocess)
  participant PM as PyMuPDF Fallback

  DP->>RE: extract(pdf_path)
  RE->>DL: run with timeout
  alt Docling success
    DL-->>RE: content
    RE-->>DP: result (docling)
  else timeout/error
    RE->>PM: fallback extract
    PM-->>RE: text-only content
    RE-->>DP: result (pymupdf_fallback)
  end

sequenceDiagram
  autonumber
  actor Operator
  participant ORCH as ArxivPipelineOrchestrator
  participant SQL as Postgres
  participant LG as ListGenerator
  participant PIPE as ArXiv Pipeline v2

  Operator->>ORCH: load search config
  ORCH->>SQL: assess_compute_cost(build_where)
  SQL-->>ORCH: assessment
  ORCH->>Operator: preview/refine
  Operator->>ORCH: approve
  ORCH->>LG: generate_list()
  LG-->>ORCH: paper_ids + files
  ORCH->>PIPE: run with list/config
  PIPE-->>ORCH: results file
  ORCH-->>Operator: completed + next-step cmd

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Possibly related issues

ArXiv Metadata Service (Postgres) #10 — ArXiv Metadata Service: this PR adds tools/arxiv/db modules, Alembic migrations, and ingestion/harvest/export scripts matching that issue.
Enhancement: Tree-sitter Symbol Extraction and Graph-based Repository Storage #8 — Tree-sitter integration: TreeSitterExtractor and CodeExtractor changes implement Tree-sitter symbol extraction and metrics described by the issue.
r3d91ll/HADES#9 — ArangoDB transaction/lock API: ArangoDBManager implements pooled connections, transactions, retries, and TTL locks aligned with this issue.
r3d91ll/HADES#13 — GitHub processing & graph storage: tools/github workspace, Graph model, and pipeline changes correspond to this objective.

Possibly related PRs

feat: GitHub Repository Integration (Issue #5) #6 — Overlaps on GitHub pipeline and Tree-sitter/code-extraction integration added here.
feat: Implement multi-source document processing architecture (Issue #1) #4 — Related multi-source processing architecture (GenericDocumentProcessor, pipelines, extractors).
Directory restaructure #2 — Broad repository restructure and tooling reorganization that overlaps with this PR.

Poem

A rabbit taps keys in the moonlit lab,
Tree-sitter hums and symbols gab,
Docling guards while PyMuPDF mends,
Graphs link papers to code as friends.
Hop—commit—pipelines sing; embeddings rise, applause that never ends 🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/github_arxiv_tooling

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 106

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (37)

tests/test_overnight_arxiv_2000.py (5)

54-109: Fix YYMM filtering (lexicographic bug across centuries) and make sampling reproducible.

Comparing "YYMM" as strings admits 1990s (e.g., "9912") when starting at "1501". Normalize to (YYYY, MM).
Add optional seed for deterministic runs.

Apply this diff:

-def collect_arxiv_papers(target_count: int = 2000, start_year: str = "1501") -> List[Tuple[Path, str]]:
+def collect_arxiv_papers(target_count: int = 2000, start_year: str = "1501", seed: int | None = None) -> List[Tuple[Path, str]]:
@@
-    Parameters:
+    Parameters:
         target_count (int): Maximum number of papers to collect.
         start_year (str): Inclusive YYMM directory name to start searching from (e.g., "1501" = January 2015).
+        seed (int | None): Optional RNG seed for reproducible sampling.
@@
-    papers = []
+    papers = []
+    rng = random.Random(seed) if seed is not None else random
@@
-    yymm_dirs = sorted([d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()])
-
-    # Filter to start from specified year
-    yymm_dirs = [d for d in yymm_dirs if d.name >= start_year]
+    yymm_dirs = sorted(
+        [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()],
+        key=lambda d: _normalize_yymm(d.name)
+    )
+    # Filter to start from specified year (normalize century)
+    start_cutoff = _normalize_yymm(start_year)
+    yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]
@@
-            sampled_pdfs = random.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_files
+            sampled_pdfs = rng.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_files

Add this helper near the constants:

def _normalize_yymm(yymm: str) -> tuple[int, int]:
    if len(yymm) != 4 or not yymm.isdigit():
        raise ValueError(f"Invalid YYMM: {yymm}")
    yy, mm = int(yymm[:2]), int(yymm[2:])
    if not 1 <= mm <= 12:
        raise ValueError(f"Invalid month in YYMM: {yymm}")
    year = 1900 + yy if yy >= 90 else 2000 + yy
    return (year, mm)

223-244: Import heavy deps inside the function and make GPU use dynamic.

Localize imports to avoid import-time failures in non-GPU environments.
Detect CUDA and only enable FP16 when GPU is available.

@@
-        # Configure processor
-        config = ProcessingConfig(
+        # Local imports to avoid import-time side effects
+        from core.processors.document_processor import DocumentProcessor, ProcessingConfig
+        try:
+            import torch  # optional
+            use_gpu = torch.cuda.is_available()
+        except Exception:
+            use_gpu = False
+
+        # Configure processor
+        config = ProcessingConfig(
             chunking_strategy=test_config['strategy'],
             chunk_size_tokens=test_config['chunk_size'],
             chunk_overlap_tokens=test_config['overlap'],
             batch_size=test_config['batch_size'],
-            use_gpu=True,
-            use_fp16=True,  # Use FP16 for memory efficiency
+            use_gpu=use_gpu,
+            use_fp16=use_gpu,  # Only enable FP16 on GPU
             use_ocr=False,  # Disable OCR for speed
             extract_tables=True,  # Keep table extraction
             extract_equations=True  # Keep equation extraction
         )

185-189: Wire CLI start-year through to collection.

The --start-year flag currently has no effect.

-        papers = collect_arxiv_papers(num_docs)
+        papers = collect_arxiv_papers(num_docs, start_year=start_year)

400-407: Report the actual number of papers processed, not the requested count.

This corrects summary inaccuracies when fewer files are found or runs abort early.

-        'total_papers_processed': num_docs,
+        'total_papers_processed': len(papers),

459-459: Pass start_year from CLI to the runner.

Without this, --start-year is ignored.

-        results = run_arxiv_overnight_test(args.docs)
+        results = run_arxiv_overnight_test(args.docs, start_year=args.start_year)

tests/test_overnight_arxiv_phased.py (7)

49-52: Make STAGING_DIR portable and robust (fallback if /dev/shm is unavailable).

Hard-coding /dev/shm breaks on macOS/CI and raises FileNotFoundError. Also prefer parents=True.

Apply:

-# Staging directory for inter-phase communication
-STAGING_DIR = Path("/dev/shm/overnight_staging")
-STAGING_DIR.mkdir(exist_ok=True)
+# Staging directory for inter-phase communication
+# Prefer env override; fall back to /dev/shm if present, else system temp
+default_staging_base = Path("/dev/shm") if Path("/dev/shm").exists() else Path(os.getenv("TMPDIR", "/tmp"))
+STAGING_DIR = Path(os.getenv("STAGING_DIR", default_staging_base / "overnight_staging"))
+STAGING_DIR.mkdir(parents=True, exist_ok=True)

53-55: Parameterize ARXIV_PDF_BASE via environment for portability.

Hard-coded path will fail outside your lab host. Allow override.

-# ArXiv PDF repository
-ARXIV_PDF_BASE = Path("/bulk-store/arxiv-data/pdf")
+# ArXiv PDF repository
+ARXIV_PDF_BASE = Path(os.getenv("ARXIV_PDF_BASE", "/bulk-store/arxiv-data/pdf"))

78-83: Fix incorrect lexicographic comparison of year-month directory names.

String comparison can misorder mixed 4- vs 6-digit names (e.g., "9999" > "202001"). Compare numerically and sort by int.

-    yymm_dirs = sorted([d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()])
-    yymm_dirs = [d for d in yymm_dirs if d.name >= start_year]
+    # Normalize to numeric comparison; accept 4- or 6-digit directory names
+    int_start = int(start_year)
+    yymm_dirs = sorted(
+        [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()],
+        key=lambda p: int(p.name)
+    )
+    yymm_dirs = [d for d in yymm_dirs if int(d.name) >= int_start]

133-141: Improve I/O robustness and error logging.

Use encoding and log tracebacks for faster diagnosis.

-        with open(staging_file, 'w') as f:
+        with open(staging_file, 'w', encoding='utf-8') as f:
             json.dump(result, f)
-        return {'arxiv_id': arxiv_id, 'success': True, 'staged_path': str(staging_file)}
+        return {'arxiv_id': arxiv_id, 'success': True, 'staged_path': str(staging_file)}
@@
-    except Exception as e:
-        logger.error(f"Failed to extract {arxiv_id}: {e}")
+    except Exception as e:
+        logger.error(f"Failed to extract {arxiv_id}: {e}", exc_info=True)
         return {'arxiv_id': arxiv_id, 'success': False, 'error': str(e)}

268-281: Avoid setting CUDA_VISIBLE_DEVICES at runtime inside workers.

Setting the env var post-import doesn’t constrain already-initialized CUDA contexts and can confuse device indexing. Rely on torch.cuda.set_device only.

         if torch.cuda.is_available():
             torch.cuda.set_device(gpu_id)
-            os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)

303-312: Prevent OOM by not returning full chunks payload.

Aggregate counts and persist details elsewhere if required.

-        return {
+        return {
             'arxiv_id': arxiv_id,
             'success': True,
             'num_chunks': len(chunks_with_embeddings),
-            'chunks': chunks_with_embeddings,  # Store if needed
             'gpu_used': gpu_id if torch.cuda.is_available() else 'cpu'
         }

502-516: Compute overall metrics from actual phase outputs, not requested num_docs.

Avoids inflated rates when fewer papers are processed successfully.

-    total_time = extraction_results['phase_time'] + embedding_results['phase_time']
-    overall_rate = num_docs / total_time * 60
+    total_time = extraction_results['phase_time'] + embedding_results['phase_time']
+    processed_docs = embedding_results.get('total_files', 0) or extraction_results.get('total_papers', 0)
+    overall_rate = (processed_docs / total_time * 60) if total_time > 0 else 0
@@
-        'total_papers_processed': num_docs,
+        'total_papers_processed': processed_docs,

tools/arxiv/monitoring/monitor_overnight.py (1)

46-49: Guard against partial JSON reads while file is being written.

json.load() can fail if the writer hasn’t finished flushing. Retry instead of crashing.

Apply:

-        if result_file.exists():
-            with open(result_file) as f:
-                results = json.load(f)
+        if result_file.exists():
+            try:
+                with open(result_file) as f:
+                    results = json.load(f)
+            except json.JSONDecodeError:
+                # File may be mid-write; try again next cycle
+                time.sleep(10)
+                continue

tests/test_overnight_2000.py (1)

43-113: --mock flag currently does nothing; monkey patch happens before args are parsed.

The embedder is chosen at import time; flipping MOCK_EMBEDDER later won’t reapply the patch. Move the mock class and monkey patch behind a function and call it when args.mock is true.

-# Use real embedder by default for accurate testing
-MOCK_EMBEDDER = False  # Set to True only for quick debugging
-
-if MOCK_EMBEDDER:
-    logger.info("Using MOCK embedder for faster processing")
-    
-    class MockEmbedder:
-        """Mock embedder for testing without loading heavy models."""
-        ...
-    # Monkey patch the embedder
-    import core.framework.embedders
-    core.framework.embedders.JinaV4Embedder = MockEmbedder
-else:
-    logger.info("Using REAL Jina embedder - this will be slower")
+MOCK_EMBEDDER = False  # toggled via CLI
+
+def apply_mock_embedder():
+    logger.info("Using MOCK embedder for faster processing")
+    class MockEmbedder:
+        """Mock embedder for testing without loading heavy models."""
+        def __init__(self, *args, **kwargs):
+            self.embedding_dim = 2048
+        def embed_texts(self, texts: List[str], batch_size: int = 1):
+            import numpy as np
+            return np.random.randn(len(texts), self.embedding_dim).astype(np.float32)
+        def embed_with_late_chunking(self, text: str):
+            import numpy as np
+            from core.framework.embedders import ChunkWithEmbedding
+            words = text.split()
+            chunk_size = 256
+            chunks = []
+            for i in range(0, len(words), chunk_size - 50):
+                chunk_text = ' '.join(words[i:i+chunk_size])
+                chunk = ChunkWithEmbedding(
+                    text=chunk_text,
+                    embedding=np.random.randn(self.embedding_dim).astype(np.float32),
+                    start_char=i * 5,
+                    end_char=(i + len(chunk_text.split())) * 5,
+                    start_token=i,
+                    end_token=min(i + chunk_size, len(words)),
+                    chunk_index=len(chunks),
+                    total_chunks=0,
+                    context_window_used=chunk_size
+                )
+                chunks.append(chunk)
+            for c in chunks: c.total_chunks = len(chunks)
+            return chunks
+    import core.framework.embedders as _emb
+    _emb.JinaV4Embedder = MockEmbedder

And in main, after parsing args:

-    if args.mock:
-        MOCK_EMBEDDER = True
-        logger.info("WARNING: Using MOCK embedder - results won't reflect real performance!")
+    if args.mock:
+        MOCK_EMBEDDER = True
+        apply_mock_embedder()
+        logger.info("WARNING: Using MOCK embedder - results won't reflect real performance!")

tools/arxiv/monitoring/monitor_phased.py (1)

104-110: Fix summary slicing: lines.index(line) can return the first occurrence, not the last.

If “FINAL SUMMARY” appears multiple times, you may print the wrong block. Use the last occurrence.

-                # Print summary lines
-                for summary_line in lines[lines.index(line):]:
+                # Print summary lines from the last occurrence
+                try:
+                    start = len(lines) - 1 - list(reversed(lines)).index(line)
+                except ValueError:
+                    start = 0
+                for summary_line in lines[start:]:
                     if "Overall Rate:" in summary_line or "Extraction:" in summary_line or "Embedding:" in summary_line:
                         print(summary_line.strip())
                 return

tools/arxiv/arxiv_manager.py (4)

101-112: Docstring says “ignore version suffix,” but code validates original ID; use base_id.

Small logic/consistency fix; also removes an unused variable warning.
-        base_id = re.sub(r'v\d+$', '', arxiv_id)
+        base_id = re.sub(r'v\d+$', '', arxiv_id)
@@
-        if cls.ARXIV_ID_PATTERN.match(arxiv_id):
+        if cls.ARXIV_ID_PATTERN.match(base_id):
             return True, None
@@
-        if cls.OLD_ARXIV_ID_PATTERN.match(arxiv_id):
+        if cls.OLD_ARXIV_ID_PATTERN.match(base_id):
             return True, None
413-468: Missing transactional writes and distributed lock; risk of partial writes and duplicate processing.

Per learnings, writes should be atomic and guarded by a lock. Use Arango locks (arxiv_locks) and wrap all inserts in a single transaction.

Proposed flow:

acquire_lock(paper_info.sanitized_id) → skip or retry if locked

begin_transaction(write_collections=['arxiv_papers','arxiv_chunks','arxiv_embeddings','arxiv_structures']) // or tables/equations/images if you split

insert arxiv_papers, then chunks and structures within the same tx

commit; finally release_lock

Do you want me to draft a concrete implementation aligned with core.database.ArangoDBManager’s transaction API?

31-31: Update ArangoDBManager import to new core.database module
In tools/arxiv/arxiv_manager.py (line 31), change the import to:
-from tools.arxiv.pipelines.arango_db_manager import ArangoDBManager
+from core.database.arango_db_manager import ArangoDBManager
389-391: Remove await from synchronous insert_document calls

In _store_arxiv_result (tools/arxiv/arxiv_manager.py), db_manager.insert_document is defined as a regular def, not async def, so awaiting it will raise. Drop the await before each call to:

self.db_manager.insert_document('arxiv_papers', arxiv_doc)

self.db_manager.insert_document('arxiv_embeddings', chunk_doc)

self.db_manager.insert_document('arxiv_structures', structures_doc)

Occurrences at lines 438–440, 454–455 and 466–467.

tools/github/github_document_manager.py (2)

268-270: Avoid sys.path manipulation; use absolute imports.

Modify to import DocumentTask via the package, per guidelines. The sys.path hack is fragile.

-        # Import here to avoid circular dependency
-        import sys
-        sys.path.insert(0, str(Path(__file__).parent.parent.parent))
-        from core.processors.generic_document_processor import DocumentTask
+        from core.processors.generic_document_processor import DocumentTask

168-188: URL parsing is brittle; handle extra path segments and both HTTPS/SSH forms robustly.

Use urllib.parse and split on path safely; current split() will crash on owner/repo/subpath.

-        # Handle different URL formats
-        url = url.strip()
-        
-        # Extract owner/repo from various formats
-        if "github.com" in url:
-            # https://github.com/owner/repo or git@github.com:owner/repo
-            parts = url.split("github.com")[-1]
-            parts = parts.strip(":/").replace(".git", "")
-            owner, name = parts.split("/")
-        else:
-            # Assume owner/repo format
-            owner, name = url.split("/")
+        from urllib.parse import urlparse
+        url = url.strip()
+        owner = name = None
+        if "github.com" in url:
+            if url.startswith("git@"):
+                path = url.split("github.com:", 1)[-1]
+            else:
+                path = urlparse(url).path
+            parts = [p for p in path.replace(".git", "").strip("/").split("/") if p]
+            if len(parts) >= 2:
+                owner, name = parts[0], parts[1]
+        else:
+            parts = [p for p in url.split("/") if p]
+            if len(parts) == 2:
+                owner, name = parts
+        if not owner or not name:
+            raise ValueError(f"Unrecognized GitHub repo identifier: {url}")

core/processors/chunking_strategies.py (1)

477-497: SlidingWindow: emit a remainder/single chunk for short texts

Current guard skips any chunk when len(tokens) <= window_size. Emit a remainder/single chunk so short docs aren’t dropped.

-        # Handle remainder if any
-        if len(tokens) > self.window_size:
-            last_index = len(chunks) * self.step_size
-            if last_index < len(tokens):
-                remaining_tokens = tokens[last_index:]
-                if len(remaining_tokens) >= self.step_size:  # Only if substantial
-                    chunk_text = ' '.join(remaining_tokens)
-                    start_char = len(' '.join(tokens[:last_index]))
-                    
-                    chunk = TextChunk(
-                        text=chunk_text,
-                        start_char=start_char,
-                        end_char=start_char + len(chunk_text),
-                        chunk_index=len(chunks),
-                        metadata={
-                            'strategy': 'sliding_window_remainder',
-                            'token_count': len(remaining_tokens)
-                        }
-                    )
-                    chunks.append(chunk)
+        # Handle remainder or produce a single chunk for short inputs
+        last_index = len(chunks) * self.step_size
+        if last_index < len(tokens):
+            remaining_tokens = tokens[last_index:]
+            if remaining_tokens:  # emit any meaningful remainder
+                chunk_text = ' '.join(remaining_tokens)
+                start_char = len(' '.join(tokens[:last_index]))
+                chunk = TextChunk(
+                    text=chunk_text,
+                    start_char=start_char,
+                    end_char=start_char + len(chunk_text),
+                    chunk_index=len(chunks),
+                    metadata={
+                        'strategy': 'sliding_window_remainder' if chunks else 'sliding_window_single',
+                        'window_size': self.window_size,
+                        'step_size': self.step_size,
+                        'token_count': len(remaining_tokens)
+                    }
+                )
+                chunks.append(chunk)

tools/arxiv/arxiv_document_manager.py (2)

95-97: limit=0 bug: processes all files instead of none

if limit: treats 0 as falsy. Use is not None.

-        if limit:
-            pdf_files = pdf_files[:limit]
+        if limit is not None:
+            pdf_files = pdf_files[:limit]

246-252: Critical: validate_arxiv_id strips all digits and rejects valid IDs

rstrip('v0123456789') removes all trailing digits; for "2305.12345" it becomes "", failing validation. Parse the optional version suffix instead.

-        # Check paper number is 4-6 digits (or with version like 12345v2)
-        paper_num_base = paper_num.rstrip('v0123456789')
-        if not paper_num_base.isdigit() or not (4 <= len(paper_num_base) <= 6):
+        # Check paper number is 4–6 digits, optionally followed by version like v2
+        base, _, _ = paper_num.partition('v')
+        if not base.isdigit() or not (4 <= len(base) <= 6):
             return False

tools/arxiv/pipelines/arxiv_pipeline_v2.py (3)

167-186: Checkpoint ‘enabled’ should accept booleans and strings. Current logic misinterprets YAML true/false.

If config uses YAML booleans, _should_use_checkpoint falls into the auto branch erroneously.

Apply:

 def _should_use_checkpoint(self, count: int) -> bool:
@@
-        if self.checkpoint_enabled == 'true':
-            return True
-        elif self.checkpoint_enabled == 'false':
-            return False
-        else:  # auto mode
-            return count >= self.checkpoint_auto_threshold
+        v = self.checkpoint_enabled
+        # Accept booleans and common string variants
+        if isinstance(v, bool):
+            return v
+        if isinstance(v, str):
+            s = v.strip().lower()
+            if s in ('true', 'yes', 'on', '1'):
+                return True
+            if s in ('false', 'no', 'off', '0'):
+                return False
+            # auto or unknown → treat as auto
+        return count >= self.checkpoint_auto_threshold

254-267: Checkpoint never updates: gating on results['success'] is wrong.

GenericDocumentProcessor returns per-phase results; there’s no top-level ‘success’. Update based on embedding successes.

Apply:

-        # Step 3: Update checkpoint if enabled
-        if use_checkpoint and results.get('success'):
-            # Add successfully processed documents to checkpoint
-            if 'embedding' in results and 'success' in results['embedding']:
-                for doc_id in results['embedding']['success']:
-                    self.processed_ids.add(doc_id)
-                
-                # Save checkpoint periodically or at the end
-                self._save_checkpoint(
-                    extraction_results=results.get('extraction'),
-                    embedding_results=results.get('embedding')
-                )
-                logger.info(f"Checkpoint updated: {len(self.processed_ids)} total documents processed")
+        # Step 3: Update checkpoint if enabled
+        if use_checkpoint:
+            embed_success = results.get('embedding', {}).get('success', [])
+            if embed_success:
+                self.processed_ids.update(embed_success)
+                # Save checkpoint (periodic save_interval can be added later)
+                self._save_checkpoint(
+                    extraction_results=results.get('extraction'),
+                    embedding_results=results.get('embedding')
+                )
+                logger.info(f"Checkpoint updated: {len(self.processed_ids)} total documents processed")

277-294: Fix reporting: handle missing top-level ‘success’ and zero-division on avg chunks.

Avoid KeyError on results['success'] and only compute averages when total_processed > 0.

Apply:

-        if results['success']:
-            extraction = results['extraction']
-            embedding = results['embedding']
+        extraction = results.get('extraction', {})
+        embedding = results.get('embedding', {})
+        total_processed = results.get('total_processed', 0)
+        if extraction or embedding:
             logger.info(f"Extraction: {len(extraction['success'])} success, {len(extraction['failed'])} failed")
             logger.info(f"Embedding: {len(embedding['success'])} success, {len(embedding['failed'])} failed")
-            logger.info(f"Total processed: {results['total_processed']} documents")
+            logger.info(f"Total processed: {total_processed} documents")
 
-            if results['total_processed'] > 0:
-                rate = results['total_processed'] / elapsed * 60
+            if total_processed > 0:
+                rate = total_processed / elapsed * 60
                 logger.info(f"Processing rate: {rate:.1f} documents/minute")
-                
-                if 'chunks_created' in embedding:
-                    avg_chunks = embedding['chunks_created'] / results['total_processed']
-                    logger.info(f"Average chunks per document: {avg_chunks:.1f}")
+                if 'chunks_created' in embedding:
+                    avg_chunks = embedding['chunks_created'] / total_processed
+                    logger.info(f"Average chunks per document: {avg_chunks:.1f}")

core/framework/extractors/docling_extractor.py (3)

306-315: Flatten structured outputs and include a stable version to match downstream expectations

DocumentProcessor._extract_content reads top-level tables/equations/images/figures and version. Here, Docling results are nested under structures and lack version, so structured content is silently dropped and extractor_version becomes unknown.

Apply this diff to surface the structures at top level while keeping structures for backward compatibility and add version='docling_v2':

 return {
-    'full_text': full_text,
-    'markdown': full_text,  # Also provide as 'markdown' key
-    'structures': structures,
-    'metadata': {
-        'extractor': 'docling_v2',
-        'num_pages': num_pages,
-        'processing_time': None  # Docling doesn't provide this
-    }
+    'full_text': full_text,
+    'markdown': full_text,  # Also provide as 'markdown' key
+    # Flattened keys for downstream compatibility
+    'tables': structures.get('tables', []),
+    'equations': structures.get('equations', []),
+    'images': structures.get('images', []),
+    'figures': structures.get('figures', []),
+    # Retain original nested payload for backward compatibility
+    'structures': structures,
+    'metadata': {
+        'extractor': 'docling_v2',
+        'num_pages': num_pages,
+        'processing_time': None  # Populated by caller
+    },
+    'version': 'docling_v2'
 }

176-193: Fix processing_time accounting when falling back after a processing error

processing_time is computed before running the fallback path, under-reporting total time. Recompute after the fallback completes:

-        except Exception as e:
-            # Compute duration before handling exception
-            duration = (datetime.now() - start_time).total_seconds()
+        except Exception as e:
             # Don't use fallback for pre-validation errors (they should fail fast)
             error_msg = str(e)
             if any(x in error_msg for x in ["empty", "Invalid PDF header", "Cannot read PDF file"]):
                 logger.error(f"Pre-validation failed for {pdf_path}: {e}")
                 raise RuntimeError(f"Pre-validation failed for {pdf_path}: {e}") from e
             
             # Use fallback for processing errors if enabled
             if self.use_fallback:
                 logger.warning(f"Extraction failed, attempting fallback: {e}")
                 result = self._extract_fallback(pdf_path)
-                # Add processing time and pdf_path to metadata
-                result['metadata']['processing_time'] = duration
+                # Add processing time and pdf_path to metadata (include fallback runtime)
+                duration = (datetime.now() - start_time).total_seconds()
+                # Ensure metadata dict exists
+                result.setdefault('metadata', {})
+                result['metadata']['processing_time'] = duration
                 result['metadata']['pdf_path'] = str(pdf_path)
                 return result

388-401: Retain top‐level pdf_path or update all downstream usages
Multiple consumers index extracted['pdf_path'] (e.g. tools/arxiv/pipelines/arxiv_pipeline.py:95, experiments/word2vec_evolution/store_in_arango.py:65, tests/test_overnight_arxiv_phased.py:128). Removing that key in extract_batch will break these; either keep the top‐level duplication for backward compatibility or refactor every reference to use metadata['pdf_path'].

core/processors/document_processor.py (2)

392-403: Honor nested structures from extractors for backward compatibility

If upstream extractors return structures (as in current Docling path), tables/equations/images are otherwise lost. Merge from structures when top-level keys are missing:

-        return ExtractionResult(
-            full_text=full_text,
-            tables=docling_result.get('tables', []),
-            equations=docling_result.get('equations', []),
-            images=docling_result.get('images', []),
-            figures=docling_result.get('figures', []),
-            metadata=docling_result.get('metadata', {}),
-            latex_source=latex_source,
-            has_latex=has_latex,
-            extraction_time=time.time() - start_time,
-            extractor_version=docling_result.get('version', 'unknown')
-        )
+        structures = docling_result.get('structures', {}) or {}
+        return ExtractionResult(
+            full_text=full_text,
+            tables=docling_result.get('tables') or structures.get('tables', []),
+            equations=docling_result.get('equations') or structures.get('equations', []),
+            images=docling_result.get('images') or structures.get('images', []),
+            figures=docling_result.get('figures') or structures.get('figures', []),
+            metadata=docling_result.get('metadata', {}),
+            latex_source=latex_source,
+            has_latex=has_latex,
+            extraction_time=time.time() - start_time,
+            extractor_version=docling_result.get('version', 'unknown')
+        )

491-505: Make embedding generation robust to runtime errors

If embed_texts raises, process_document fails. Catch and degrade to zero-vector as documented.

-            # Generate embedding for chunk
-            # JinaV4Embedder expects a list and returns array of embeddings
-            embeddings = self.embedder.embed_texts([chunk['text']], batch_size=1)
-            embedding = embeddings[0] if len(embeddings) > 0 else None
+            # Generate embedding for chunk (robust to failures)
+            try:
+                embeddings = self.embedder.embed_texts([chunk['text']], batch_size=1)
+                embedding = embeddings[0] if len(embeddings) > 0 else None
+            except Exception as e:
+                logger.warning(f"Embedding failed for chunk {i}: {type(e).__name__}: {e}")
+                embedding = None

tools/arxiv/scripts/run_pipeline_from_list.py (1)

1-300: Enable Black and fix all lint violations in tools/arxiv
Environment lacks a black command and Ruff still reports 229 violations in tools/arxiv—including blank-line whitespace (W293), trailing whitespace (W291), unused variables (F841), bare except clauses (E722), and module‐level import ordering (E402). Install Black (pip install black), ensure it’s on PATH, then re-run
ruff check tools/arxiv --fix  
black tools/arxiv
to auto‐fix and manually address any remaining errors.

tools/arxiv/pipelines/arxiv_pipeline.py (1)

666-675: Align storage with required Arango collections and validate embedding dimensionality.

Guidelines require: arxiv_papers, arxiv_chunks, arxiv_embeddings, arxiv_structures; also vectors must be 2048-d with model='jina-v4'.

Include arxiv_structures in the transaction’s write set and store structures in one document.
Assert/guard that each embedding vector has 2048 dims; fail-fast otherwise.

-                write_collections = ['arxiv_papers', 'arxiv_chunks', 'arxiv_embeddings']
-                structures = doc.get('structures', {})
-                if structures.get('equations'):
-                    write_collections.append('arxiv_equations')
-                if structures.get('tables'):
-                    write_collections.append('arxiv_tables')
-                if structures.get('images'):
-                    write_collections.append('arxiv_images')
+                write_collections = ['arxiv_papers', 'arxiv_chunks', 'arxiv_embeddings', 'arxiv_structures']
+                structures = doc.get('structures', {})

-                        txn_db.collection('arxiv_embeddings').insert({
+                        emb_vec = chunk.embedding.tolist()
+                        if len(emb_vec) != 2048:
+                            raise ValueError(f"Unexpected embedding size {len(emb_vec)} for {arxiv_id}; expected 2048")
+                        txn_db.collection('arxiv_embeddings').insert({
                             '_key': f"{chunk_key}_emb",
                             'paper_id': sanitized_id,
                             'chunk_id': chunk_key,
-                            'vector': chunk.embedding.tolist(),
+                            'vector': emb_vec,
                             'model': 'jina-v4',
                             'embedding_method': 'late_chunking',
                             'embedding_date': datetime.now().isoformat()
                         }, overwrite=True)

-                    # Store equations if present
-                    for i, eq in enumerate(structures.get('equations', [])):
-                        txn_db.collection('arxiv_equations').insert({
-                            '_key': f"{sanitized_id}_eq_{i}",
-                            'paper_id': sanitized_id,
-                            'equation_index': i,
-                            'latex': eq.get('latex', ''),
-                            'label': eq.get('label', ''),
-                            'type': eq.get('type', 'display')
-                        }, overwrite=True)
-                    
-                    # Store tables if present
-                    for i, tbl in enumerate(structures.get('tables', [])):
-                        txn_db.collection('arxiv_tables').insert({
-                            '_key': f"{sanitized_id}_table_{i}",
-                            'paper_id': sanitized_id,
-                            'table_index': i,
-                            'caption': tbl.get('caption', ''),
-                            'headers': tbl.get('headers', []),
-                            'data_rows': tbl.get('rows', [])
-                        }, overwrite=True)
-                    
-                    # Store images if present
-                    for i, img in enumerate(structures.get('images', [])):
-                        txn_db.collection('arxiv_images').insert({
-                            '_key': f"{sanitized_id}_img_{i}",
-                            'paper_id': sanitized_id,
-                            'image_index': i,
-                            'caption': img.get('caption', ''),
-                            'type': img.get('type', 'figure')
-                        }, overwrite=True)
+                    # Store all detected structures in a single document
+                    txn_db.collection('arxiv_structures').insert({
+                        '_key': sanitized_id,
+                        'paper_id': sanitized_id,
+                        'equations': structures.get('equations', []),
+                        'tables': structures.get('tables', []),
+                        'images': structures.get('images', []),
+                        'counts': {
+                            'equations': len(structures.get('equations', [])),
+                            'tables': len(structures.get('tables', [])),
+                            'images': len(structures.get('images', [])),
+                        }
+                    }, overwrite=True)

Also applies to: 740-748, 750-781

core/processors/generic_document_processor.py (2)

154-160: Make extraction task timeout configurable.

Hard-coded 120s may be too short. Read from config with a sensible default.

-        num_workers = extraction_config.get('workers', 8)
+        num_workers = extraction_config.get('workers', 8)
         gpu_devices = extraction_config.get('gpu_devices', [0, 1])
+        task_timeout = extraction_config.get('task_timeout_seconds', 600)
@@
-                        result = future.result(timeout=120)
+                        result = future.result(timeout=task_timeout)

Also applies to: 179-184

409-414: Fix ArangoDBManager import to honor package layout and avoid sys.path hacks.

Prefer absolute imports per guidelines; keep a guarded fallback if needed.

-    import sys
-    from pathlib import Path
-    sys.path.insert(0, str(Path(__file__).parent.parent.parent))
-    from tools.arxiv.pipelines.arango_db_manager import ArangoDBManager
+    try:
+        from core.database.arango_db_manager import ArangoDBManager  # preferred
+    except Exception:
+        from tools.arxiv.pipelines.arango_db_manager import ArangoDBManager  # fallback

tools/arxiv/db/config.py (1)

1-152: Install Black, update pyproject.toml, and run Ruff + Black on tools/arxiv/
Move ignore, select, and per-file-ignores into the lint section of pyproject.toml, then install Black (pip install black), and run:
ruff check tools/arxiv --fix
black tools/arxiv

core/processors/chunking_strategies.py

experiments/word2vec_evolution/experiment_config.yaml

experiments/word2vec_evolution/process_github_repos.py

coderabbitai · 2025-09-03T01:06:28Z

experiments/word2vec_evolution/process_github_repos.py

+    # First, ensure the GitHub graph is set up
+    setup_script = Path(__file__).parent.parent.parent / 'tools' / 'github' / 'setup_github_graph.py'
+    if setup_script.exists():
+        logger.info("\nEnsuring GitHub graph collections exist...")
+        result = subprocess.run(
+            [sys.executable, str(setup_script)],
+            capture_output=True,
+            text=True,
+            cwd=str(setup_script.parent)
+        )
+        if "created" in result.stdout.lower() or "already exists" in result.stdout.lower():
+            logger.info("  ✓ GitHub graph collections ready")
+
+    # Process each repository


🛠️ Refactor suggestion

Check setup_github_graph return code and surface failures

Currently only scanning stdout text; a failing script can be silently ignored.

- result = subprocess.run( + result = subprocess.run( [sys.executable, str(setup_script)], capture_output=True, text=True, cwd=str(setup_script.parent) ) - if "created" in result.stdout.lower() or "already exists" in result.stdout.lower(): - logger.info(" ✓ GitHub graph collections ready") + if result.returncode == 0 and ( + "created" in result.stdout.lower() or "already exists" in result.stdout.lower() + ): + logger.info(" ✓ GitHub graph collections ready") + else: + logger.warning(f" ⚠ GitHub graph setup may have failed (rc={result.returncode}). Inspect output:") + logger.warning(result.stdout.strip()[:1000]) + logger.warning(result.stderr.strip()[:1000])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# First, ensure the GitHub graph is set up

setup_script = Path(__file__).parent.parent.parent / 'tools' / 'github' / 'setup_github_graph.py'

if setup_script.exists():

logger.info("\nEnsuring GitHub graph collections exist...")

result = subprocess.run(

[sys.executable, str(setup_script)],

capture_output=True,

text=True,

cwd=str(setup_script.parent)

)

if "created" in result.stdout.lower() or "already exists" in result.stdout.lower():

logger.info(" ✓ GitHub graph collections ready")

# Process each repository

# First, ensure the GitHub graph is set up

setup_script = Path(__file__).parent.parent.parent / 'tools' / 'github' / 'setup_github_graph.py'

if setup_script.exists():

logger.info("\nEnsuring GitHub graph collections exist...")

result = subprocess.run(

[sys.executable, str(setup_script)],

capture_output=True,

text=True,

cwd=str(setup_script.parent)

)

if result.returncode == 0 and (

"created" in result.stdout.lower() or "already exists" in result.stdout.lower()

):

logger.info(" ✓ GitHub graph collections ready")

else:

logger.warning(f" ⚠ GitHub graph setup may have failed (rc={result.returncode}). Inspect output:")

logger.warning(result.stdout.strip()[:1000])

logger.warning(result.stderr.strip()[:1000])

# Process each repository

🤖 Prompt for AI Agents

In experiments/word2vec_evolution/process_github_repos.py around lines 111 to 124, the wrapper currently only inspects stdout text from running setup_github_graph.py and can silently ignore failures; update it to check subprocess.CompletedProcess.returncode and treat non-zero return codes as errors: after running subprocess.run, if result.returncode != 0 log an error including result.stderr (and result.stdout for context), and raise SystemExit or an exception (or sys.exit(1)) to stop processing; keep the existing success check for the positive messages but only after confirming returncode == 0.

@coderabbitai fix submitted. please confirm

@r3d91ll I can see that the fix has not yet been applied to the code. Looking at the current diff for experiments/word2vec_evolution/process_github_repos.py, the code still shows the original implementation on lines 111-124:

result = subprocess.run( [sys.executable, str(setup_script)], capture_output=True, text=True, cwd=str(setup_script.parent) ) if "created" in result.stdout.lower() or "already exists" in result.stdout.lower(): logger.info(" ✓ GitHub graph collections ready")

The suggested improvements are missing:

No result.returncode check

No error handling for non-zero return codes

No stderr logging on failures

No warning when setup fails

To apply the fix, you need to replace the current setup_github_graph section with the improved version that checks the return code and handles failures properly.

experiments/word2vec_evolution/run_acid_pipeline.py

coderabbitai · 2025-09-03T01:06:31Z

tools/arxiv/tests/quick_pipeline_test.py

+# Fix Python path
+script_dir = Path(__file__).resolve().parent
+project_root = script_dir.parent.parent.parent
+sys.path.insert(0, str(project_root))
+sys.path.insert(0, str(script_dir.parent))
+
+from pipelines.arxiv_pipeline import ACIDPhasedPipeline
+from core.database import ArangoDBManager


🛠️ Refactor suggestion

Fix imports and satisfy Ruff E402; prefer absolute package paths

Move sys.path adjustments before any imports and use absolute imports.

-# Fix Python path -script_dir = Path(__file__).resolve().parent -project_root = script_dir.parent.parent.parent -sys.path.insert(0, str(project_root)) -sys.path.insert(0, str(script_dir.parent)) - -from pipelines.arxiv_pipeline import ACIDPhasedPipeline -from core.database import ArangoDBManager +# Fix Python path (before imports) +script_dir = Path(__file__).resolve().parent +project_root = script_dir.parent.parent.parent +sys.path.insert(0, str(project_root)) + +from tools.arxiv.pipelines.arxiv_pipeline import ACIDPhasedPipeline +from core.database.arango_db_manager import ArangoDBManager

Also remove unused imports (json, datetime) to keep lint clean.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Fix Python path

script_dir = Path(__file__).resolve().parent

project_root = script_dir.parent.parent.parent

sys.path.insert(0, str(project_root))

sys.path.insert(0, str(script_dir.parent))

from pipelines.arxiv_pipeline import ACIDPhasedPipeline

from core.database import ArangoDBManager

# Fix Python path (before imports)

script_dir = Path(__file__).resolve().parent

project_root = script_dir.parent.parent.parent

sys.path.insert(0, str(project_root))

from tools.arxiv.pipelines.arxiv_pipeline import ACIDPhasedPipeline

from core.database.arango_db_manager import ArangoDBManager

🧰 Tools

🪛 Ruff (0.12.2)

22-22: Module level import not at top of file

(E402)

23-23: Module level import not at top of file

(E402)

🤖 Prompt for AI Agents

In tools/arxiv/tests/quick_pipeline_test.py around lines 16 to 23, the test currently performs sys.path modifications after module imports which triggers Ruff E402 and uses relative or non-absolute imports plus includes unused imports; move the sys.path.insert(...) calls to the very top of the file before any import statements, change imports to absolute package paths (e.g., from pipelines.arxiv_pipeline import ACIDPhasedPipeline -> use the full package absolute import as appropriate in your project), and remove any unused imports like json and datetime so the file imports are ordered and lint-clean.

coderabbitai · 2025-09-03T01:06:31Z

tools/arxiv/tests/quick_pipeline_test.py

+        print(f"\nProcessing {len(available_papers)} available papers...")
+
+        # The pipeline processes from filesystem, not from a list
+        # We'll use the 'local' source and let it find papers
+        pipeline.run(source='local', count=len(available_papers))
+


🛠️ Refactor suggestion

Ensure the pipeline processes the exact test papers

source='local' + count=N may pick arbitrary PDFs. Populate specific_list with available_papers and use that source.

- # The pipeline processes from filesystem, not from a list - # We'll use the 'local' source and let it find papers - pipeline.run(source='local', count=len(available_papers)) + # Force the pipeline to process exactly our IDs + pipeline.config.setdefault('processing', {}).setdefault('specific_list', {})['arxiv_ids'] = available_papers + pipeline.run(source='specific_list', count=len(available_papers))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f"\nProcessing {len(available_papers)} available papers...")

# The pipeline processes from filesystem, not from a list

# We'll use the 'local' source and let it find papers

pipeline.run(source='local', count=len(available_papers))

print(f"\nProcessing {len(available_papers)} available papers...")

# Force the pipeline to process exactly our IDs

pipeline.config.setdefault('processing', {}).setdefault('specific_list', {})['arxiv_ids'] = available_papers

pipeline.run(source='specific_list', count=len(available_papers))

🤖 Prompt for AI Agents

In tools/arxiv/tests/quick_pipeline_test.py around lines 96 to 101, the test currently uses source='local' with count which can pick arbitrary PDFs; instead populate the pipeline's specific_list with available_papers and call pipeline.run with source='specific_list' (pass the specific_list argument set to available_papers) so the pipeline processes exactly those test papers rather than any local PDFs; remove or ignore the count parameter when using specific_list.

coderabbitai · 2025-09-03T01:06:31Z

tools/arxiv/tests/run_large_scale_test.sh

+# =========================================
+# Runs the complete test suite for 5000+ papers
+
+set -e  # Exit on error


🛠️ Refactor suggestion

Propagate failures in pipelines (tee): enable pipefail

Without set -o pipefail, failures in python ... | tee won’t be detected.

-set -e # Exit on error +set -e # Exit on error +set -o pipefail # Propagate failures through pipelines

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

set -e # Exit on error

set -e # Exit on error

set -o pipefail # Propagate failures through pipelines

🤖 Prompt for AI Agents

In tools/arxiv/tests/run_large_scale_test.sh around line 6, the script uses "set -e" but doesn't enable pipefail so failures in pipelines like "python ... | tee" can be ignored; update the shell options to include "set -o pipefail" (or change to "set -euo pipefail" if you also want nounset and errexit behavior) so pipeline failures are propagated and the script exits on any command in a pipeline failing.

coderabbitai · 2025-09-03T01:06:32Z

tools/arxiv/tests/run_large_scale_test.sh

+echo "Clearing previous staging data..."
+rm -rf "$STAGING_DIR"/*
+


🛠️ Refactor suggestion

Safe rm -rf: guard against accidental “/*”

Harden deletes with parameter expansion.

-rm -rf "$STAGING_DIR"/* +rm -rf "${STAGING_DIR:?}/"*

Repeat the same change at Line 175.

🧰 Tools

🪛 Shellcheck (0.10.0)

[warning] 90-90: Use "${var:?}" to ensure this never expands to /* .

(SC2115)

🤖 Prompt for AI Agents

In tools/arxiv/tests/run_large_scale_test.sh around lines 89-91 (and again at line 175), the unguarded rm -rf "$STAGING_DIR"/* can evaluate to rm -rf /* if STAGING_DIR is empty; replace it with a safe parameter expansion guard so the command aborts when STAGING_DIR is unset/empty (e.g., use the ${VAR:?} form) and apply the same change at line 175 to harden both deletes.

coderabbitai · 2025-09-03T01:06:32Z

tools/arxiv/tests/run_large_scale_test.sh

+# Run the full test suite
+python test_large_scale_processing.py \
+    --config ../configs/large_scale_test.yaml \
+    --papers "$PAPER_LIST" \
+    --arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log"
+
+TEST_EXIT_CODE=$?
+


⚠️ Potential issue

Capture the Python exit code when piping through tee

$? captures tee’s status. Use ${PIPESTATUS[0]}.

python test_large_scale_processing.py \ --config ../configs/large_scale_test.yaml \ --papers "$PAPER_LIST" \ --arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log" -TEST_EXIT_CODE=$? +TEST_EXIT_CODE=${PIPESTATUS[0]}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Run the full test suite

python test_large_scale_processing.py \

--config ../configs/large_scale_test.yaml \

--papers "$PAPER_LIST" \

--arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log"

TEST_EXIT_CODE=$?

# Run the full test suite

python test_large_scale_processing.py \

--config ../configs/large_scale_test.yaml \

--papers "$PAPER_LIST" \

--arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log"

TEST_EXIT_CODE=${PIPESTATUS[0]}

🤖 Prompt for AI Agents

In tools/arxiv/tests/run_large_scale_test.sh around lines 137 to 144, the script currently captures the exit code with TEST_EXIT_CODE=$? after piping the python command through tee, which returns tee's status instead of the Python process; replace that line with TEST_EXIT_CODE=${PIPESTATUS[0]} (or enable pipefail and use $? accordingly) to capture the Python process exit code, and ensure the script is executed with bash (shebang or invoked with bash) so ${PIPESTATUS[@]} is supported.

coderabbitai

Review continued from previous batch...

coderabbitai · 2025-09-03T01:07:39Z

core/framework/extractors/robust_extractor.py

+def _extract_with_pymupdf(pdf_path: str) -> Dict[str, Any]:
+    """Fallback extraction using PyMuPDF."""
+    if not PYMUPDF_AVAILABLE:
+        return None
+
+    try:
+        with fitz.open(pdf_path) as doc:
+            text_parts = []
+            num_pages = len(doc)
+
+            for page_num, page in enumerate(doc):
+                page_text = page.get_text()
+                if page_text.strip():
+                    text_parts.append(f"--- Page {page_num + 1} ---\n{page_text}")
+
+            full_text = "\n\n".join(text_parts)
+
+            return {
+                'full_text': full_text,
+                'text': full_text,
+                'markdown': full_text,
+                'num_pages': num_pages,
+                'extractor': 'pymupdf_fallback'
+            }
+    except Exception as e:


🛠️ Refactor suggestion

Return a schema-compatible fallback result (include metadata/structures)

Current fallback payload lacks the 'metadata' block used elsewhere and deviates from DoclingExtractor’s shape.

Apply:

-def _extract_with_pymupdf(pdf_path: str) -> Dict[str, Any]: +def _extract_with_pymupdf(pdf_path: str) -> dict[str, Any]: @@ - return { - 'full_text': full_text, - 'text': full_text, - 'markdown': full_text, - 'num_pages': num_pages, - 'extractor': 'pymupdf_fallback' - } + return { + 'full_text': full_text, + 'markdown': full_text, + 'structures': {}, + 'metadata': { + 'extractor': 'pymupdf_fallback', + 'num_pages': num_pages, + 'pdf_path': str(pdf_path), + 'processing_time': None, + 'warning': 'Using fallback extractor - structures not extracted' + } + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _extract_with_pymupdf(pdf_path: str) -> Dict[str, Any]:

"""Fallback extraction using PyMuPDF."""

if not PYMUPDF_AVAILABLE:

return None

try:

with fitz.open(pdf_path) as doc:

text_parts = []

num_pages = len(doc)

for page_num, page in enumerate(doc):

page_text = page.get_text()

if page_text.strip():

text_parts.append(f"--- Page {page_num + 1} ---\n{page_text}")

full_text = "\n\n".join(text_parts)

return {

'full_text': full_text,

'text': full_text,

'markdown': full_text,

'num_pages': num_pages,

'extractor': 'pymupdf_fallback'

}

except Exception as e:

def _extract_with_pymupdf(pdf_path: str) -> dict[str, Any]:

"""Fallback extraction using PyMuPDF."""

if not PYMUPDF_AVAILABLE:

return None

try:

with fitz.open(pdf_path) as doc:

text_parts = []

num_pages = len(doc)

for page_num, page in enumerate(doc):

page_text = page.get_text()

if page_text.strip():

text_parts.append(f"--- Page {page_num + 1} ---\n{page_text}")

full_text = "\n\n".join(text_parts)

return {

- 'full_text': full_text,

- 'text': full_text,

- 'markdown': full_text,

- 'num_pages': num_pages,

'full_text': full_text,

'markdown': full_text,

'structures': {},

'metadata': {

'extractor': 'pymupdf_fallback',

'num_pages': num_pages,

'pdf_path': str(pdf_path),

'processing_time': None,

'warning': 'Using fallback extractor - structures not extracted'

}

}

except Exception as e:

…

🤖 Prompt for AI Agents

In core/framework/extractors/robust_extractor.py around lines 42 to 66, the pymupdf fallback return lacks the expected DoclingExtractor-shaped keys; update the returned dict to include a top-level 'metadata' dict (e.g., {'source': pdf_path, 'num_pages': num_pages, 'extractor': 'pymupdf_fallback'}) and a 'structures' list (e.g., per-page objects with page_number and text) in addition to the existing text/full_text/markdown fields so the payload matches the rest of the codebase's schema.

coderabbitai · 2025-09-03T01:07:40Z

docs/implementation/arxiv_metadata_service_implementation.md

+- **Schema Migration**: Added columns for compute cost assessment
+  - `pdf_size_bytes`: Track actual PDF file sizes
+  - `processing_complexity`: Estimates processing difficulty
+  - `estimated_tokens`: Estimated token count for LLM processing
+  - `created_at`, `modified_at`: Timestamps managed by triggers
+- **Migration Script**: `migrations/add_compute_assessment_columns.sql`
+  - Successfully applied to existing database with 2.7M papers
+  - Includes indexes for performance
+


💡 Verification agent

🧩 Analysis chain

Docs reference a SQL migration not shown in this PR; verify file exists and is chained.

🏁 Script executed:

#!/bin/bash # Verify referenced migration exists and links to this initial revision fd -t f -a 'add_compute_assessment_columns.sql' . fd -t f -a 'tools/arxiv/db/migrations/versions' | sort rg -n "down_revision\s*=\s*[\"']20250901_000001_initial_schema[\"']" tools/arxiv/db/migrations/versions -S

Length of output: 659

🏁 Script executed:

#!/bin/bash set -e echo "Finding add_compute_assessment_columns.sql" find . -type f -name 'add_compute_assessment_columns.sql' || true echo "Searching codebase for migration name references" rg -l 'add_compute_assessment_columns' -F . || true echo "Listing migrations directory under tools/arxiv/db/migrations" find tools/arxiv/db/migrations -type f -print || true

Length of output: 656

Add missing migration file and versioned script
The docs reference migrations/add_compute_assessment_columns.sql, but that SQL file isn’t in the repo and there’s no Alembic version under tools/arxiv/db/migrations/versions chaining from 20250901_000001_initial_schema.py. Add the SQL migration and a corresponding version script with down_revision='20250901_000001_initial_schema'.

🧰 Tools

🪛 LanguageTool

[grammar] ~12-~12: There might be a mistake here.
Context: ...size_bytes: Track actual PDF file sizes - processing_complexity: Estimates processing difficulty - e...

(QB_NEW_EN)

[grammar] ~13-~13: There might be a mistake here.
Context: ...lexity: Estimates processing difficulty - estimated_tokens`: Estimated token count for LLM processi...

(QB_NEW_EN)

[grammar] ~14-~14: There might be a mistake here.
Context: ...Estimated token count for LLM processing - created_at, modified_at: Timestamps managed by t...

(QB_NEW_EN)

[grammar] ~15-~15: There might be a mistake here.
Context: ...fied_at: Timestamps managed by triggers - **Migration Script**: migrations/add_comp...

(QB_NEW_EN)

[grammar] ~16-~16: There might be a mistake here.
Context: ...ged by triggers - Migration Script: migrations/add_compute_assessment_columns.sql - Successfully applied to existing databas...

(QB_NEW_EN)

[grammar] ~17-~17: There might be a mistake here.
Context: ...ed to existing database with 2.7M papers - Includes indexes for performance ### 2....

(QB_NEW_EN)

🤖 Prompt for AI Agents

In docs/implementation/arxiv_metadata_service_implementation.md around lines 11 to 19, the docs reference a missing migration file and Alembic version script; add a SQL migration at migrations/add_compute_assessment_columns.sql containing the DDL to add pdf_size_bytes, processing_complexity, estimated_tokens, created_at, modified_at plus appropriate indexes and triggers, and create a corresponding Alembic version script under tools/arxiv/db/migrations/versions named with a proper timestamp that sets down_revision='20250901_000001_initial_schema', implements upgrade() to run the SQL (or emit equivalent SQLAlchemy migration steps) and downgrade() to revert the changes, register the new revision in Alembic env, commit both files, and verify by running the alembic upgrade to ensure the migration applies cleanly against the existing DB.

coderabbitai · 2025-09-03T01:07:40Z

docs/issues/github_theory_practice_bridge_detection.md

+- Existing GitHub tools in `/home/todd/olympus/HADES/tools/github/`
+- Tree-sitter for code parsing
+- Jina v4 code adapter for embeddings
+- ArangoDB for graph storage
+- PostgreSQL for metadata
+


🛠️ Refactor suggestion

Replace local absolute path with repo-relative reference.

Hard-coding /home/todd/... breaks for all other environments.

-- Existing GitHub tools in `/home/todd/olympus/HADES/tools/github/` +- Existing GitHub tools in `tools/github/`

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- Existing GitHub tools in `/home/todd/olympus/HADES/tools/github/`

- Tree-sitter for code parsing

- Jina v4 code adapter for embeddings

- ArangoDB for graph storage

- PostgreSQL for metadata

- Existing GitHub tools in `tools/github/`

- Tree-sitter for code parsing

- Jina v4 code adapter for embeddings

- ArangoDB for graph storage

- PostgreSQL for metadata

🤖 Prompt for AI Agents

In docs/issues/github_theory_practice_bridge_detection.md around lines 162 to 167, replace the hard-coded absolute path "/home/todd/olympus/HADES/tools/github/" with a repository-relative reference (for example "tools/github/" or "./tools/github/") or a placeholder like "$REPO_ROOT/tools/github/" so the docs work across environments; update the list entry accordingly and ensure any adjacent instructions reflect the new repo-relative path convention.

coderabbitai · 2025-09-03T01:07:40Z

experiments/word2vec_evolution/experiment_summary.md

+## Database Status
+- **ArangoDB** (`academy_store` @ 192.168.1.69:8529)
+  - 2,491 total papers (including our 3)
+  - 42,511 total chunks
+  - 43,511 total embeddings
+  - 1 GitHub repository (word2vec only)
+


⚠️ Potential issue

Redact internal host details and sensitive counts.

Publishing a private IP (192.168.1.69) and internal corpus sizes invites information leakage. Reference env-configured endpoints instead and omit exact totals (or move to an internal report).

Apply this diff:

-## Database Status -- **ArangoDB** (`academy_store` @ 192.168.1.69:8529) - - 2,491 total papers (including our 3) - - 42,511 total chunks - - 43,511 total embeddings - - 1 GitHub repository (word2vec only) +## Database Status (sanitized) +- ArangoDB database and collections reachable via ARANGO_HOST/ARANGO_PASSWORD env vars. +- Document, chunk, and embedding totals available internally (omit from public docs). +- GitHub repository ingestion validated for at least one repository.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Database Status

- **ArangoDB** (`academy_store` @ 192.168.1.69:8529)

- 2,491 total papers (including our 3)

- 42,511 total chunks

- 43,511 total embeddings

- 1 GitHub repository (word2vec only)

## Database Status (sanitized)

- ArangoDB database and collections reachable via ARANGO_HOST/ARANGO_PASSWORD env vars.

- Document, chunk, and embedding totals available internally (omit from public docs).

- GitHub repository ingestion validated for at least one repository.

🧰 Tools

🪛 LanguageTool

[grammar] ~82-~82: There might be a mistake here.
Context: ...e analysis (partial) ## Database Status - ArangoDB (academy_store @ 192.168.1....

(QB_NEW_EN)

[grammar] ~83-~83: There might be a mistake here.
Context: ...** (academy_store @ 192.168.1.69:8529) - 2,491 total papers (including our 3) -...

(QB_NEW_EN)

[grammar] ~84-~84: There might be a mistake here.
Context: ... - 2,491 total papers (including our 3) - 42,511 total chunks - 43,511 total emb...

(QB_NEW_EN)

[grammar] ~85-~85: There might be a mistake here.
Context: ...including our 3) - 42,511 total chunks - 43,511 total embeddings - 1 GitHub rep...

(QB_NEW_EN)

[grammar] ~86-~86: There might be a mistake here.
Context: ...total chunks - 43,511 total embeddings - 1 GitHub repository (word2vec only) ## ...

(QB_NEW_EN)

🤖 Prompt for AI Agents

In experiments/word2vec_evolution/experiment_summary.md around lines 82 to 88, the document exposes an internal IP address and exact dataset counts; replace the hard-coded host/IP with a reference to an environment-configured endpoint (e.g., ARO_DB_HOST or configured connection string) and remove or redact precise totals (use vague descriptors like "≈2.5k papers", "tens of thousands of chunks/embeddings" or move exact numbers to an internal-only report). Also remove any repository-specific sensitive identifiers or mark them as internal. Ensure the public summary contains no raw internal hostnames/IPs or exact counts.

coderabbitai · 2025-09-03T01:07:40Z

experiments/word2vec_evolution/import_papers.py

+    response = requests.get(pdf_url, stream=True)
+    response.raise_for_status()
+
+    with open(output_path, 'wb') as f:
+        for chunk in response.iter_content(chunk_size=8192):
+            f.write(chunk)
+


🛠️ Refactor suggestion

Add request timeout, UA, and skip keep-alive chunks when downloading.

Prevents hangs and follows good HTTP hygiene for arXiv.

- response = requests.get(pdf_url, stream=True) - response.raise_for_status() - - with open(output_path, 'wb') as f: - for chunk in response.iter_content(chunk_size=8192): - f.write(chunk) + # Use a UA, streaming, and sensible connect/read timeouts + headers = {"User-Agent": "HADES-Lab/1.0 (+contact@example.com)"} # set to your contact + with requests.get(pdf_url, stream=True, timeout=(5, 30), headers=headers) as response: + response.raise_for_status() + with open(output_path, 'wb') as f: + for chunk in response.iter_content(chunk_size=8192): + if chunk: # skip keep-alive chunks + f.write(chunk)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

response = requests.get(pdf_url, stream=True)

response.raise_for_status()

with open(output_path, 'wb') as f:

for chunk in response.iter_content(chunk_size=8192):

f.write(chunk)

# Use a UA, streaming, and sensible connect/read timeouts

headers = {"User-Agent": "HADES-Lab/1.0 (+contact@example.com)"} # set to your contact

with requests.get(pdf_url, stream=True, timeout=(5, 30), headers=headers) as response:

response.raise_for_status()

with open(output_path, 'wb') as f:

for chunk in response.iter_content(chunk_size=8192):

if chunk: # skip keep-alive chunks

f.write(chunk)

🤖 Prompt for AI Agents

In experiments/word2vec_evolution/import_papers.py around lines 55 to 61, the download request lacks a timeout, a User-Agent header, and doesn't skip keep-alive (empty) chunks; update the requests.get call to include a timeout (e.g. timeout=10) and headers with a descriptive User-Agent (and optionally Connection: close), and when writing response.iter_content(chunk_size=8192) skip empty chunks by doing if not chunk: continue before f.write(chunk) so keep-alive fragments are ignored.

coderabbitai · 2025-09-03T01:07:44Z

tools/arxiv/scripts/run_test_pipeline.py

+    cmd = [
+        sys.executable,
+        str(pipeline_script),
+        "--config", str(config_path),
+        "--count", str(count),
+        "--arango-password", arango_password
+    ]
+
+    print(f"Running command:")
+    print(f"  {' '.join(cmd[:3])}")
+    print(f"  {' '.join(cmd[3:5])}")
+    print(f"  {' '.join(cmd[5:7])}")
+    print(f"  --arango-password [HIDDEN]")
+    print()


🛠️ Refactor suggestion

⚠️ Potential issue

Don’t pass DB password via CLI; use env var. Also fix Ruff’s F541 f-strings.

Process args are visible via ps; switch to environment passing and simplify logging prints (no placeholders → no f-strings).

Apply:

@@ - cmd = [ + cmd = [ sys.executable, str(pipeline_script), "--config", str(config_path), - "--count", str(count), - "--arango-password", arango_password + "--count", str(count), ] - print(f"Running command:") - print(f" {' '.join(cmd[:3])}") - print(f" {' '.join(cmd[3:5])}") - print(f" {' '.join(cmd[5:7])}") - print(f" --arango-password [HIDDEN]") + print("Running command:") + print(" " + " ".join(cmd)) + print(" ARANGO_PASSWORD passed via environment") @@ - result = subprocess.run( + env = os.environ.copy() + env["ARANGO_PASSWORD"] = arango_password + result = subprocess.run( cmd, cwd=str(pipeline_dir), - capture_output=False, # Show output in real-time - text=True + capture_output=False, # Show output in real-time + text=True, + env=env )

Also applies to: 84-89

🧰 Tools

🪛 Ruff (0.12.2)

72-72: f-string without any placeholders

Remove extraneous f prefix

(F541)

76-76: f-string without any placeholders

Remove extraneous f prefix

(F541)

coderabbitai · 2025-09-03T01:07:44Z

tools/arxiv/tests/validate_pipeline.py

+def check_database() -> Tuple[bool, List[str]]:
+    """Check database connectivity."""
+    issues = []
+
+    try:
+        from core.database import ArangoDBManager
+
+        config = {
+            'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'),
+            'database': 'academy_store',
+            'username': 'root',
+            'password': os.getenv('ARANGO_PASSWORD')
+        }


🛠️ Refactor suggestion

⚠️ Potential issue

Fix import to use absolute module path and avoid fragile package re-exports

from core.database import ArangoDBManager likely fails unless re-exported in core/database/__init__.py. Import directly:

- from core.database import ArangoDBManager + from core.database.arango_db_manager import ArangoDBManager

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def check_database() -> Tuple[bool, List[str]]:

"""Check database connectivity."""

issues = []

try:

from core.database import ArangoDBManager

config = {

'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'),

'database': 'academy_store',

'username': 'root',

'password': os.getenv('ARANGO_PASSWORD')

}

def check_database() -> Tuple[bool, List[str]]:

"""Check database connectivity."""

issues = []

try:

from core.database.arango_db_manager import ArangoDBManager

config = {

'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'),

'database': 'academy_store',

'username': 'root',

'password': os.getenv('ARANGO_PASSWORD')

}

🤖 Prompt for AI Agents

In tools/arxiv/tests/validate_pipeline.py around lines 102 to 114, the code uses a re-exported import "from core.database import ArangoDBManager" which is fragile; replace it with a direct import from the module that actually defines ArangoDBManager (for example "from core.database.manager import ArangoDBManager" or the correct defining module), so the test does not rely on package re-exports and will import reliably.

coderabbitai · 2025-09-03T01:07:44Z

tools/arxiv/tests/validate_pipeline.py

+        required_collections = [
+            'arxiv_papers',
+            'arxiv_chunks', 
+            'arxiv_embeddings',
+            'arxiv_structures'
+        ]


🛠️ Refactor suggestion

⚠️ Potential issue

Align required collections with the manager’s schema

arxiv_structures isn’t created by _ensure_collections; the actual collections are arxiv_equations, arxiv_tables, and arxiv_images. Update the check:

- required_collections = [ - 'arxiv_papers', - 'arxiv_chunks', - 'arxiv_embeddings', - 'arxiv_structures' - ] + required_collections = [ + 'arxiv_papers', + 'arxiv_chunks', + 'arxiv_embeddings', + 'arxiv_equations', + 'arxiv_tables', + 'arxiv_images', + ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

required_collections = [

'arxiv_papers',

'arxiv_chunks',

'arxiv_embeddings',

'arxiv_structures'

]

required_collections = [

'arxiv_papers',

'arxiv_chunks',

'arxiv_embeddings',

'arxiv_equations',

'arxiv_tables',

'arxiv_images',

]

🤖 Prompt for AI Agents

In tools/arxiv/tests/validate_pipeline.py around lines 124 to 129, the required_collections list incorrectly expects 'arxiv_structures' which is not created by _ensure_collections; replace it with the actual collections created by the manager: 'arxiv_equations', 'arxiv_tables', and 'arxiv_images'. Update the required_collections list to include these three names (and remove 'arxiv_structures'), so the test aligns with the manager schema and validates the real collections created by _ensure_collections.

coderabbitai · 2025-09-03T01:07:44Z

tools/github/demo_github_pipeline.py

+    # Process first 5 files as demo
+    demo_tasks = tasks[:5]
+    logger.info(f"\n   Processing {len(demo_tasks)} files as demo...")
+
+    results = processor.process_documents(demo_tasks)
+
+    # Step 3: Show results
+    logger.info("\n3. Processing Results:")
+    logger.info(f"   - Total processed: {results.get('total_processed', 0)}")
+    logger.info(f"   - Extraction phase: {results.get('extraction_success', 0)} success, {results.get('extraction_failed', 0)} failed")
+    logger.info(f"   - Embedding phase: {results.get('embedding_success', 0)} success, {results.get('embedding_failed', 0)} failed")
+


⚠️ Potential issue

Fix result keys: use nested extraction/embedding summaries from GenericDocumentProcessor.

Current keys (extraction_success, etc.) don’t exist and always print zeros.

- logger.info("\n3. Processing Results:") - logger.info(f" - Total processed: {results.get('total_processed', 0)}") - logger.info(f" - Extraction phase: {results.get('extraction_success', 0)} success, {results.get('extraction_failed', 0)} failed") - logger.info(f" - Embedding phase: {results.get('embedding_success', 0)} success, {results.get('embedding_failed', 0)} failed") + logger.info("\n3. Processing Results:") + extraction = results.get('extraction', {}) or {} + embedding = results.get('embedding', {}) or {} + logger.info(f" - Total processed: {results.get('total_processed', 0)}") + logger.info(f" - Extraction phase: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed") + logger.info(f" - Embedding phase: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Process first 5 files as demo

demo_tasks = tasks[:5]

logger.info(f"\n Processing {len(demo_tasks)} files as demo...")

results = processor.process_documents(demo_tasks)

# Step 3: Show results

logger.info("\n3. Processing Results:")

logger.info(f" - Total processed: {results.get('total_processed', 0)}")

logger.info(f" - Extraction phase: {results.get('extraction_success', 0)} success, {results.get('extraction_failed', 0)} failed")

logger.info(f" - Embedding phase: {results.get('embedding_success', 0)} success, {results.get('embedding_failed', 0)} failed")

# Step 3: Show results

logger.info("\n3. Processing Results:")

extraction = results.get('extraction', {}) or {}

embedding = results.get('embedding', {}) or {}

logger.info(f" - Total processed: {results.get('total_processed', 0)}")

logger.info(f" - Extraction phase: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed")

logger.info(f" - Embedding phase: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed")

🤖 Prompt for AI Agents

In tools/github/demo_github_pipeline.py around lines 84 to 95, the log uses non-existent flat keys like extraction_success and embedding_success which always print zeros; update the log to read the nested summaries produced by GenericDocumentProcessor (e.g. results.get("extraction", {}).get("success", 0) and results.get("extraction", {}).get("failed", 0) and similarly for results.get("embedding", {}).get("success", 0) / .get("failed", 0)), keeping fallbacks to 0 so missing keys don’t crash, and adjust the logger.info calls to use those nested values.

coderabbitai · 2025-09-03T01:07:44Z

tools/github/README.md

+FOR paper IN arxiv_embeddings
+  FOR code IN github_embeddings
+    LET similarity = DOT(paper.embedding, code.embedding)
+    FILTER similarity > 0.8
+    RETURN {
+      paper: paper.title,
+      code: code.source_file,
+      similarity: similarity
+    }


💡 Verification agent

🧩 Analysis chain

AQL uses an undefined DOT() function; the example won’t run on vanilla ArangoDB.

Replace with an available similarity function (e.g., COSINE_SIMILARITY or COSINE_DISTANCE depending on ArangoDB version), or reference a precomputed field.

- LET similarity = DOT(paper.embedding, code.embedding) + // Requires ArangoDB vector functions (3.11/3.12+) or your own UDF. + LET similarity = 1 - COSINE_DISTANCE(paper.embedding, code.embedding)

🌐 Web query:

Does ArangoDB AQL provide built-in cosine similarity or distance functions for vector arrays (version 3.11+)? If so, what are the function names and usage?

💡 Result:

Yes.

Built-in AQL vector functions available in ArangoDB 3.11:

COSINE_SIMILARITY(x, y) — returns cosine similarity between two numeric arrays (or a 2D array and a 1D array). Example: COSINE_SIMILARITY([0,1], [1,0]) → 0. [1][2]

L1_DISTANCE(x, y) — Manhattan (L1) distance between two numeric arrays. Example: L1_DISTANCE([-1,-1], [2,2]) → 6. [2]

L2_DISTANCE(x, y) — Euclidean (L2) distance between two numeric arrays. Example: L2_DISTANCE([1,1], [5,2]) → 4.1231056. [2]

Notes:

Vector index–accelerated functions for approximate nearest-neighbor searches (e.g., APPROX_NEAR_COSINE and APPROX_NEAR_L2) were introduced later (3.12+) and require a vector index and the experimental vector-index option; those are used in queries like: SORT APPROX_NEAR_COSINE(doc.vector, @q) DESC. [3]

Sources: ArangoDB AQL function docs and release notes. [1][2][3]

Replace undefined DOT() with built-in AQL functions
File: tools/github/README.md lines 139–147

- LET similarity = DOT(paper.embedding, code.embedding) + LET similarity = COSINE_SIMILARITY(paper.embedding, code.embedding)

Use COSINE_SIMILARITY(x, y) for cosine similarity (ArangoDB 3.11+); for distances, use L1_DISTANCE or L2_DISTANCE; in 3.12+ you can also leverage APPROX_NEAR_COSINE/APPROX_NEAR_L2 for index-accelerated searches.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

FOR paper IN arxiv_embeddings

FOR code IN github_embeddings

LET similarity = DOT(paper.embedding, code.embedding)

FILTER similarity > 0.8

RETURN {

paper: paper.title,

code: code.source_file,

similarity: similarity

}

FOR paper IN arxiv_embeddings

FOR code IN github_embeddings

LET similarity = COSINE_SIMILARITY(paper.embedding, code.embedding)

FILTER similarity > 0.8

RETURN {

paper: paper.title,

code: code.source_file,

similarity: similarity

}

🤖 Prompt for AI Agents

In tools/github/README.md around lines 139 to 147, the query uses an undefined DOT() function; replace it with the built-in AQL similarity functions: compute similarity = COSINE_SIMILARITY(paper.embedding, code.embedding) and use FILTER similarity > 0.8; alternatively for distances use L1_DISTANCE or L2_DISTANCE, and on ArangoDB 3.12+ consider switching to APPROX_NEAR_COSINE or APPROX_NEAR_L2 for index-accelerated searches and better performance.

coderabbitai

Review continued from previous batch...

coderabbitai · 2025-09-03T01:08:52Z

core/framework/extractors/tree_sitter_extractor.py

+            # Calculate metrics
+            metrics = self._calculate_metrics(tree, content)
+
+            # Extract structure
+            structure = self._extract_structure(tree, content, language)
+
+            return {
+                'symbols': symbols,
+                'metrics': metrics,
+                'structure': structure,
+                'language': language,
+                'tree_sitter_version': tree_sitter.__version__ if hasattr(tree_sitter, '__version__') else 'unknown'


🛠️ Refactor suggestion

Metrics use the wrong language; pass the detected language into _calculate_metrics.

Currently, _calculate_metrics infers language from tree.language (which isn’t available on Tree objects), silently falling back to "python". Complexity is miscounted for non-Python files.

Apply this diff:

- metrics = self._calculate_metrics(tree, content) + metrics = self._calculate_metrics(tree, content, language)

-def _calculate_metrics(self, tree, content: str) -> Dict[str, Any]: +def _calculate_metrics(self, tree, content: str, language: str) -> Dict[str, Any]: @@ - # Get language-specific control flow nodes or use a default set - language = getattr(tree, 'language', 'python') - cf_nodes = control_flow_nodes.get(language, control_flow_nodes['python']) + # Use the detected language passed from caller + cf_nodes = control_flow_nodes.get(language, control_flow_nodes['python'])

Also applies to: 687-731

🤖 Prompt for AI Agents

In core/framework/extractors/tree_sitter_extractor.py around lines 155-166 (and the similar block around lines 687-731), _calculate_metrics is being called without the detected language, causing it to infer language incorrectly from tree and default to "python"; update calls to pass the detected language variable (e.g., metrics = self._calculate_metrics(tree, content, language)) and modify _calculate_metrics to accept a language parameter and use that value instead of trying to read tree.language or defaulting to "python".

coderabbitai · 2025-09-03T01:08:52Z

core/framework/extractors/tree_sitter_extractor.py

+                name_node = node.child_by_field_name('name')
+                if name_node:
+                    # Check for generator by looking for '*' token
+                    # Rely on the node's generator attribute for JS/TS
+                    is_generator = bool(getattr(node, 'generator', False))
+
+                    symbols['functions'].append({
+                        'name': content[name_node.start_byte:name_node.end_byte],
+                        'line': name_node.start_point[0] + 1,
+                        'async': self._has_child_type(node, 'async'),
+                        'generator': is_generator
+                    })
+


🛠️ Refactor suggestion

Fix JS/TS generator/async detection.

Tree-sitter nodes don’t expose a “generator” boolean. The current getattr(...) will always be False.

Apply this diff:

- if name_node: - # Check for generator by looking for '*' token - # Rely on the node's generator attribute for JS/TS - is_generator = bool(getattr(node, 'generator', False)) - - symbols['functions'].append({ - 'name': content[name_node.start_byte:name_node.end_byte], - 'line': name_node.start_point[0] + 1, - 'async': self._has_child_type(node, 'async'), - 'generator': is_generator - }) + if name_node: + prefix = content[node.start_byte:name_node.start_byte] + is_async = self._has_child_type(node, 'async') or prefix.lstrip().startswith('async') + is_generator = '*' in prefix + symbols['functions'].append({ + 'name': content[name_node.start_byte:name_node.end_byte], + 'line': name_node.start_point[0] + 1, + 'async': is_async, + 'generator': is_generator, + })

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

name_node = node.child_by_field_name('name')

if name_node:

# Check for generator by looking for '*' token

# Rely on the node's generator attribute for JS/TS

is_generator = bool(getattr(node, 'generator', False))

symbols['functions'].append({

'name': content[name_node.start_byte:name_node.end_byte],

'line': name_node.start_point[0] + 1,

'async': self._has_child_type(node, 'async'),

'generator': is_generator

})

name_node = node.child_by_field_name('name')

if name_node:

prefix = content[node.start_byte:name_node.start_byte]

is_async = self._has_child_type(node, 'async') or prefix.lstrip().startswith('async')

is_generator = '*' in prefix

symbols['functions'].append({

'name': content[name_node.start_byte:name_node.end_byte],

'line': name_node.start_point[0] + 1,

'async': is_async,

'generator': is_generator,

})

🤖 Prompt for AI Agents

In core/framework/extractors/tree_sitter_extractor.py around lines 321 to 333, the code uses getattr(node, 'generator', False) which is invalid for tree-sitter nodes and will always be False; replace that with a real check for the '*' generator token by scanning the node for a child token or character '*' before the function name (e.g., inspect node.children or the source slice between node.start_byte and name_node.start_byte for a '*' character) and set is_generator based on that check (keep the async detection using self._has_child_type(node, 'async')).

coderabbitai · 2025-09-03T01:08:52Z

docs/adr/0005-arxiv-pipeline-v2-architecture.md

+    def __init__(self, config_path: str):
+        # Load configuration
+        import json
+        with open(config_path, 'r') as f:
+            self.config = json.load(f)
+


⚠️ Potential issue

Fix example: loads YAML with json.load (incorrect).

Use yaml.safe_load to match the YAML config referenced.

- import json - with open(config_path, 'r') as f: - self.config = json.load(f) + import yaml + with open(config_path) as f: + self.config = yaml.safe_load(f)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(self, config_path: str):

# Load configuration

import json

with open(config_path, 'r') as f:

self.config = json.load(f)

def __init__(self, config_path: str):

# Load configuration

import yaml

with open(config_path) as f:

self.config = yaml.safe_load(f)

🤖 Prompt for AI Agents

In docs/adr/0005-arxiv-pipeline-v2-architecture.md around lines 51 to 56, the constructor currently uses json.load to parse a YAML configuration file; change this to import and use yaml.safe_load so the YAML config is parsed correctly (add "import yaml" and replace json.load with yaml.safe_load when reading the file), ensuring the file is still opened in text mode and assigned to self.config.

coderabbitai · 2025-09-03T01:08:53Z

experiments/word2vec_evolution/analyze_theory_practice_bridges.py

+        # Connect to ArangoDB
+        client = ArangoClient(hosts='http://192.168.1.69:8529')
+        self.db = client.db('academy_store', username='root', password='1luv93ngu1n$')
+


⚠️ Potential issue

Secret leak: hardcoded Arango root password and host

Credentials and host must not be in code. Rotate the password immediately and switch to env/config-based loading.

+import os @@ - client = ArangoClient(hosts='http://192.168.1.69:8529') - self.db = client.db('academy_store', username='root', password='1luv93ngu1n$') + host = os.getenv("ARANGO_HOST", "http://localhost:8529") + db_name = os.getenv("ARANGO_DB", "academy_store") + user = os.getenv("ARANGO_USER", "root") + password = os.getenv("ARANGO_PASSWORD") + if not password: + raise ValueError("ARANGO_PASSWORD not set") + client = ArangoClient(hosts=host) + self.db = client.db(db_name, username=user, password=password)

If this secret was committed, consider revoking it and auditing access.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Connect to ArangoDB

client = ArangoClient(hosts='http://192.168.1.69:8529')

self.db = client.db('academy_store', username='root', password='1luv93ngu1n$')

import os

# Connect to ArangoDB

host = os.getenv("ARANGO_HOST", "http://localhost:8529")

db_name = os.getenv("ARANGO_DB", "academy_store")

user = os.getenv("ARANGO_USER", "root")

password = os.getenv("ARANGO_PASSWORD")

if not password:

raise ValueError("ARANGO_PASSWORD not set")

client = ArangoClient(hosts=host)

self.db = client.db(db_name, username=user, password=password)

🤖 Prompt for AI Agents

In experiments/word2vec_evolution/analyze_theory_practice_bridges.py around lines 28-31, the ArangoDB host and root password are hardcoded; replace these with configuration-driven loading (e.g., read ARANGO_HOST, ARANGO_USER, ARANGO_PASSWORD from environment variables or a secured config store), validate presence and fail fast with a clear error if missing, and remove the plaintext credentials from the file; after updating, rotate/revoke the exposed credentials and ensure the repo history and any CI secrets do not contain the old values (use git-secret / git-filter-repo or similar if necessary).

coderabbitai · 2025-09-03T01:08:53Z

experiments/word2vec_evolution/analyze_theory_practice_bridges.py

+        for i, p_emb in enumerate(paper_embeddings):
+            for j, c_emb in enumerate(code_embeddings):
+                # Cosine similarity (1 - cosine distance)
+                similarity = 1 - cosine(p_emb, c_emb)
+                similarity_matrix[i, j] = similarity
+


🛠️ Refactor suggestion

NaN-safe cosine similarity

Guard against zero vectors producing NaNs.

- similarity = 1 - cosine(p_emb, c_emb) + similarity = 1 - cosine(p_emb, c_emb) + if not np.isfinite(similarity): + similarity = 0.0 similarity_matrix[i, j] = similarity

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for i, p_emb in enumerate(paper_embeddings):

for j, c_emb in enumerate(code_embeddings):

# Cosine similarity (1 - cosine distance)

similarity = 1 - cosine(p_emb, c_emb)

similarity_matrix[i, j] = similarity

for i, p_emb in enumerate(paper_embeddings):

for j, c_emb in enumerate(code_embeddings):

# Cosine similarity (1 - cosine distance)

similarity = 1 - cosine(p_emb, c_emb)

if not np.isfinite(similarity):

similarity = 0.0

similarity_matrix[i, j] = similarity

🤖 Prompt for AI Agents

In experiments/word2vec_evolution/analyze_theory_practice_bridges.py around lines 160 to 165, the cosine calculation can produce NaN when one or both vectors are zero; modify the loop to compute cosine similarity with an explicit norm check: compute the L2 norms of p_emb and c_emb, if either norm is zero set similarity to 0.0 (or another defined fallback) else compute similarity = dot(p_emb, c_emb) / (norm_p * norm_c); assign that value to similarity_matrix[i, j]. Ensure no NaNs are written by handling the zero-norm case before division.

coderabbitai · 2025-09-03T01:08:55Z

tools/arxiv/scripts/run_weekend_test.sh

+COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections"
+if [ -d "$COLLECTION_DIR" ]; then
+    PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}')
+    echo "Found $PAPER_COUNT papers in existing collections"
+
+    if [ "$PAPER_COUNT" -lt 20000 ]; then
+        echo "Collection has less than 20,000 papers. Running extended collector..."
+        cd "$ARXIV_DIR/scripts"
+        python collect_ai_papers_extended.py
+    else
+        echo -e "${GREEN}✅ Sufficient papers available ($PAPER_COUNT)${NC}"
+    fi


🛠️ Refactor suggestion

Add validation for paper count parsing

The paper count extraction using awk could fail silently if the file format is unexpected. Consider adding error handling for the arithmetic operation.

Apply this improved version with validation:

COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections" if [ -d "$COLLECTION_DIR" ]; then PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}') + if ! [[ "$PAPER_COUNT" =~ ^[0-9]+$ ]]; then + echo -e "${RED}ERROR: Invalid paper count detected${NC}" + exit 1 + fi echo "Found $PAPER_COUNT papers in existing collections"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections"

if [ -d "$COLLECTION_DIR" ]; then

PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}')

echo "Found $PAPER_COUNT papers in existing collections"

if [ "$PAPER_COUNT" -lt 20000 ]; then

echo "Collection has less than 20,000 papers. Running extended collector..."

cd "$ARXIV_DIR/scripts"

python collect_ai_papers_extended.py

else

echo -e "${GREEN}✅ Sufficient papers available ($PAPER_COUNT)${NC}"

fi

COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections"

if [ -d "$COLLECTION_DIR" ]; then

PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}')

if ! [[ "$PAPER_COUNT" =~ ^[0-9]+$ ]]; then

echo -e "${RED}ERROR: Invalid paper count detected${NC}"

exit 1

fi

echo "Found $PAPER_COUNT papers in existing collections"

if [ "$PAPER_COUNT" -lt 20000 ]; then

echo "Collection has less than 20,000 papers. Running extended collector..."

cd "$ARXIV_DIR/scripts"

python collect_ai_papers_extended.py

else

echo -e "${GREEN}✅ Sufficient papers available ($PAPER_COUNT)${NC}"

fi

coderabbitai · 2025-09-03T01:08:55Z

tools/arxiv/scripts/run_weekend_test.sh

+if [ -d "$STAGING_DIR" ]; then
+    rm -rf "$STAGING_DIR"/*
+    echo "✅ Staging directory cleaned"
+else
+    mkdir -p "$STAGING_DIR"
+    echo "✅ Staging directory created"
+fi


⚠️ Potential issue

Fix potential dangerous expansion in staging directory cleanup

The rm -rf command on line 87 could expand to /* if $STAGING_DIR is unexpectedly empty, potentially deleting all files on the system.

Apply this fix to ensure safe directory cleanup:

# Step 2: Clean up staging area echo -e "${YELLOW}Step 2: Cleanup${NC}" echo "Cleaning staging directory..." STAGING_DIR="/dev/shm/weekend_staging" if [ -d "$STAGING_DIR" ]; then - rm -rf "$STAGING_DIR"/* + rm -rf "${STAGING_DIR:?}"/* echo "✅ Staging directory cleaned" else mkdir -p "$STAGING_DIR" echo "✅ Staging directory created" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [ -d "$STAGING_DIR" ]; then

rm -rf "$STAGING_DIR"/*

echo "✅ Staging directory cleaned"

else

mkdir -p "$STAGING_DIR"

echo "✅ Staging directory created"

fi

if [ -d "$STAGING_DIR" ]; then

rm -rf "${STAGING_DIR:?}"/*

echo "✅ Staging directory cleaned"

else

mkdir -p "$STAGING_DIR"

echo "✅ Staging directory created"

fi

🧰 Tools

🪛 Shellcheck (0.10.0)

[warning] 87-87: Use "${var:?}" to ensure this never expands to /* .

(SC2115)

🤖 Prompt for AI Agents

In tools/arxiv/scripts/run_weekend_test.sh around lines 86 to 92, the rm -rf "$STAGING_DIR"/* can expand to /* if $STAGING_DIR is empty; ensure $STAGING_DIR is non-empty and existing, then replace the unsafe rm usage with a safe deletion of only the directory contents (for example check [ -n "$STAGING_DIR" ] and [ -d "$STAGING_DIR" ] first, then use a command that deletes only its contents such as find "$STAGING_DIR" -mindepth 1 -delete or find "$STAGING_DIR" -mindepth 1 -maxdepth 1 -exec rm -rf -- {} +) so you never run rm on / or the root glob when the variable is empty.

coderabbitai · 2025-09-03T01:08:55Z

tools/arxiv/tests/test_large_scale_processing.py

+        pipeline = ACIDPhasedPipeline(
+            config_path=str(self.config_path),
+            arango_password=self.arango_password
+        )


💡 Verification agent

🧩 Analysis chain

Verify that ACIDPhasedPipeline constructor accepts arango_password parameter

The code passes arango_password to ACIDPhasedPipeline constructor, but this parameter doesn't appear in the constructor signature shown in the relevant code snippets.

🏁 Script executed:

#!/bin/bash # Check the ACIDPhasedPipeline constructor signature ast-grep --pattern 'class ACIDPhasedPipeline: $$$ def __init__(self, $_): $$$'

Length of output: 17365

ACIDPhasedPipeline constructor missing arango_password parameter
The test at tools/arxiv/tests/test_large_scale_processing.py:108–111 calls

ACIDPhasedPipeline(config_path=…, arango_password=…)

but the class’s __init__ in tools/arxiv/pipelines/arxiv_pipeline.py only accepts config_path (line 836). Add an arango_password parameter to its signature or remove it from the test.

🤖 Prompt for AI Agents

In tools/arxiv/pipelines/arxiv_pipeline.py around line 836, the ACIDPhasedPipeline.__init__ only accepts config_path but the test constructs it with arango_password; update the constructor signature to accept an optional arango_password parameter (e.g., arango_password=None), assign it to self.arango_password, and propagate it to any internal Arango/DB client creation code that requires a password; if no client currently uses it, still store the attribute to preserve test compatibility.

coderabbitai · 2025-09-03T01:08:55Z

tools/arxiv/tests/test_large_scale_processing.py

+                results = pipeline.process_papers(batch, count=len(batch))
+
+                total_processed += results.get('successful', 0)
+                total_failed += results.get('failed', 0)
+


💡 Verification agent

🧩 Analysis chain

Pipeline's process_papers method signature inconsistency

The code calls pipeline.process_papers() with various arguments including detailed_timing=True, but the pipeline class shown in relevant snippets doesn't have this method. Only the run() method is visible in the ACIDPhasedPipeline class.

Also applies to: 196-198, 365-369

🏁 Script executed:

#!/bin/bash # Search for process_papers method in ACIDPhasedPipeline class ast-grep --pattern 'class ACIDPhasedPipeline: $$$ def process_papers($$$): $$$' # Also check if there's a different pipeline class being used rg -n "def process_papers" --type=py

Length of output: 197

🏁 Script executed:

#!/bin/bash # Display the definition of process_papers in OnDemandProcessor sed -n '115,135p' tools/arxiv/utils/on_demand_processor.py

Length of output: 655

🏁 Script executed:

#!/bin/bash # Show imports and pipeline instantiation in test file sed -n '1,80p' tools/arxiv/tests/test_large_scale_processing.py

Length of output: 2484

🏁 Script executed:

#!/bin/bash # Locate and display the ACIDPhasedPipeline class definition and its methods file=$(rg -l "class ACIDPhasedPipeline" --type=py) echo "Found pipeline file: $file" sed -n '1,200p' "$file"

Length of output: 6245

ACIDPhasedPipeline lacks process_papers—update tests or implement method
Tests in tools/arxiv/tests/test_large_scale_processing.py (lines 122–126, 196–198, 365–369) invoke pipeline.process_papers, but ACIDPhasedPipeline only defines run(). Either add a process_papers alias/wrapper in the pipeline class or change the tests to call pipeline.run() with the appropriate arguments.

coderabbitai · 2025-09-03T01:08:55Z

tools/arxiv/utils/list_generator.py

+            output_files = write_outputs(
+                out_dir=output_dir,
+                prefix=output.get('prefix', 'arxiv_ids'),
+                ids=ids,
+                with_pdf=with_pdf,
+                missing_pdf=missing_pdf,
+                stats=stats,
+                monthly=monthly,
+                suffix=stamp,
+                update_symlinks=output.get('update_symlinks', True)
+            )


⚠️ Potential issue

Bug: write_outputs returns None; generate_list expects a dict and loses file paths.

This silently makes output_files None. Either change write_outputs to return paths or compute them here after calling it.

- output_files = write_outputs( - out_dir=output_dir, - prefix=output.get('prefix', 'arxiv_ids'), + prefix = output.get('prefix', 'arxiv_ids') + update_symlinks = output.get('update_symlinks', True) + write_outputs( + out_dir=output_dir, + prefix=prefix, ids=ids, with_pdf=with_pdf, missing_pdf=missing_pdf, stats=stats, monthly=monthly, suffix=stamp, - update_symlinks=output.get('update_symlinks', True) + update_symlinks=update_symlinks, ) - - logger.info("Output files written", files=output_files) + output_files = { + "master_ts": output_dir / f"{prefix}_sweep_{stamp}.txt", + "stats_ts": output_dir / f"{prefix}_sweep_stats_{stamp}.json", + } + if update_symlinks: + output_files["master"] = output_dir / f"{prefix}_sweep.txt" + output_files["stats"] = output_dir / f"{prefix}_stats.json" + if with_pdf: + output_files["with_pdfs_ts"] = output_dir / f"{prefix}_with_pdfs_{stamp}.txt" + if update_symlinks: + output_files["with_pdfs"] = output_dir / f"{prefix}_with_pdfs.txt" + if missing_pdf: + output_files["missing_pdfs_ts"] = output_dir / f"{prefix}_missing_pdfs_{stamp}.txt" + if update_symlinks: + output_files["missing_pdfs"] = output_dir / f"{prefix}_missing_pdfs.txt" + + logger.info("Output files written", files={k: str(v) for k, v in output_files.items()})

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

output_files = write_outputs(

out_dir=output_dir,

prefix=output.get('prefix', 'arxiv_ids'),

ids=ids,

with_pdf=with_pdf,

missing_pdf=missing_pdf,

stats=stats,

monthly=monthly,

suffix=stamp,

update_symlinks=output.get('update_symlinks', True)

)

# Determine prefix and symlink behavior up front

prefix = output.get("prefix", "arxiv_ids")

update_symlinks = output.get("update_symlinks", True)

# Actually write the outputs (returns None)

write_outputs(

out_dir=output_dir,

prefix=prefix,

ids=ids,

with_pdf=with_pdf,

missing_pdf=missing_pdf,

stats=stats,

monthly=monthly,

suffix=stamp,

update_symlinks=update_symlinks,

)

# Manually reconstruct the paths that were written

output_files = {

"master_ts": output_dir / f"{prefix}_sweep_{stamp}.txt",

"stats_ts": output_dir / f"{prefix}_sweep_stats_{stamp}.json",

}

if update_symlinks:

output_files["master"] = output_dir / f"{prefix}_sweep.txt"

output_files["stats"] = output_dir / f"{prefix}_stats.json"

if with_pdf:

output_files["with_pdfs_ts"] = output_dir / f"{prefix}_with_pdfs_{stamp}.txt"

if update_symlinks:

output_files["with_pdfs"] = output_dir / f"{prefix}_with_pdfs.txt"

if missing_pdf:

output_files["missing_pdfs_ts"] = output_dir / f"{prefix}_missing_pdfs_{stamp}.txt"

if update_symlinks:

output_files["missing_pdfs"] = output_dir / f"{prefix}_missing_pdfs.txt"

logger.info("Output files written", files={k: str(v) for k, v in output_files.items()})

…improve error handling

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

core/processors/chunking_strategies.py (2)

86-96: Preserve newlines in _clean_text; current normalization breaks paragraph-aware strategies

Collapsing all whitespace to single spaces removes newlines, which prevents SemanticChunking from detecting paragraphs/sentences and materially changes downstream behavior. Preserve newline characters while still collapsing spaces/tabs.

Apply this diff:

 def _clean_text(self, text: str) -> str:
-        """
-        Normalize input text for chunking by collapsing all whitespace to single spaces, removing null characters, and trimming leading/trailing spaces.
-        
-        This produces a compact, single-line-safe string suitable for downstream tokenization and chunk boundary calculations.
-        """
-        # Remove excessive whitespace
-        text = re.sub(r'\s+', ' ', text)
-        # Remove null characters
-        text = text.replace('\x00', '')
-        return text.strip()
+        """
+        Normalize input text for chunking by collapsing runs of spaces/tabs, preserving newlines,
+        removing null characters, and trimming edges. This keeps paragraph/sentence boundaries intact.
+        """
+        # Normalize line endings
+        text = text.replace('\r\n', '\n').replace('\r', '\n')
+        # Collapse spaces/tabs but preserve newlines
+        text = re.sub(r'[^\S\n]+', ' ', text)
+        # Remove null characters
+        text = text.replace('\x00', '')
+        # Collapse 3+ newlines to exactly two (paragraph boundary), and strip
+        text = re.sub(r'\n{3,}', '\n\n', text).strip()
+        return text

476-484: SlidingWindow remainder drops tail < step_size and duplicates tokens; compute tail from last window end

tail < step_size is currently discarded, losing content.
Remainder starts at len(chunks) * step_size, which duplicates tokens already covered by the last window; the uncovered tail actually begins at last_start + window_size.
Also avoid O(n^2) joins for start_char by precomputing prefix chars.

Apply this diff:

 def create_chunks(self, text: str, **kwargs) -> List[TextChunk]:
@@
         text = self._clean_text(text)
         tokens = text.split()  # Simple tokenization
+        # Precompute cumulative char positions (O(n)) for O(1) offset lookup
+        prefix_chars = [0]
+        total = 0
+        for idx, tok in enumerate(tokens):
+            if idx > 0:
+                total += 1  # space between tokens
+            total += len(tok)
+            prefix_chars.append(total)
@@
-            start_char = len(' '.join(tokens[:i]))
+            start_char = prefix_chars[i]
@@
-        # Handle remainder if any
-        if len(tokens) > self.window_size:
-            last_index = len(chunks) * self.step_size
-            if last_index < len(tokens):
-                remaining_tokens = tokens[last_index:]
-                if len(remaining_tokens) >= self.step_size:  # Only if substantial
-                    chunk_text = ' '.join(remaining_tokens)
-                    start_char = len(' '.join(tokens[:last_index]))
-                    
-                    chunk = TextChunk(
-                        text=chunk_text,
-                        start_char=start_char,
-                        end_char=start_char + len(chunk_text),
-                        chunk_index=len(chunks),
-                        metadata={
-                            'strategy': 'sliding_window_remainder',
-                            'token_count': len(remaining_tokens)
-                        }
-                    )
-                    chunks.append(chunk)
+        # Handle uncovered tail after the last full window (if any)
+        if chunks:
+            last_start = (len(chunks) - 1) * self.step_size
+            tail_start = last_start + self.window_size
+            if tail_start < len(tokens):
+                remaining_tokens = tokens[tail_start:]
+                chunk_text = ' '.join(remaining_tokens)
+                start_char = prefix_chars[tail_start]
+                chunk = TextChunk(
+                    text=chunk_text,
+                    start_char=start_char,
+                    end_char=start_char + len(chunk_text),
+                    chunk_index=len(chunks),
+                    metadata={
+                        'strategy': 'sliding_window_remainder',
+                        'token_count': len(remaining_tokens)
+                    }
+                )
+                chunks.append(chunk)

Also applies to: 501-521

♻️ Duplicate comments (2)

experiments/word2vec_evolution/experiment_config.yaml (1)

112-112: Fix: add a trailing newline at EOF to satisfy YAMLlint/CI.

Current error: “no new line character at the end of file.”

Apply:
-  - validation_report/  # Metrics and hypothesis testing results
+  - validation_report/  # Metrics and hypothesis testing results
+

core/processors/chunking_strategies.py (1)

119-123: Tokenizer object: make convert_tokens_to_string optional; fall back to ' '.join()

Requiring convert_tokens_to_string excludes many tokenizer implementations that only expose .tokenize and .decode. Accept .tokenize and use a safe fallback for reconstruction. This mirrors the earlier guidance on the tokenizer interface.

         Parameters:
@@
-            tokenizer: Optional tokenizer that can be either:
-                - A callable function that takes a string and returns a list of tokens: Callable[[str], List[str]]
-                - A tokenizer-like object with .tokenize(text) and .convert_tokens_to_string(tokens) methods
-                - None (uses simple whitespace split)
+            tokenizer: Optional tokenizer that can be either:
+                - A callable: (str) -> list[str]
+                - A tokenizer-like object with .tokenize(text)->list[str] and optionally
+                  .convert_tokens_to_string(tokens)->str (fallback is ' '.join(tokens))
+                - None (uses simple whitespace split)
@@
-        # Validate tokenizer interface if provided
+        # Validate tokenizer interface if provided
         if tokenizer is not None:
-            if not (callable(tokenizer) or 
-                    (hasattr(tokenizer, 'tokenize') and hasattr(tokenizer, 'convert_tokens_to_string'))):
+            if not (callable(tokenizer) or hasattr(tokenizer, 'tokenize')):
                 raise TypeError(
-                    "Tokenizer must be either a callable that returns List[str], "
-                    "or an object with .tokenize() and .convert_tokens_to_string() methods"
+                    "Tokenizer must be either a callable that returns list[str], "
+                    "or an object with .tokenize(str)->list[str]. "
+                    "convert_tokens_to_string(tokens)->str is optional."
                 )
@@
-            if self.tokenizer:
-                if callable(self.tokenizer):
-                    # For callable tokenizers, join tokens with space
-                    chunk_text = ' '.join(chunk_tokens)
-                else:
-                    # For tokenizer objects, use convert_tokens_to_string method
-                    chunk_text = self.tokenizer.convert_tokens_to_string(chunk_tokens)
+            if self.tokenizer:
+                if callable(self.tokenizer):
+                    chunk_text = ' '.join(chunk_tokens)
+                else:
+                    if hasattr(self.tokenizer, 'convert_tokens_to_string'):
+                        chunk_text = self.tokenizer.convert_tokens_to_string(chunk_tokens)
+                    else:
+                        chunk_text = ' '.join(chunk_tokens)
             else:
                 chunk_text = ' '.join(chunk_tokens)

Also applies to: 131-139, 186-193

🧹 Nitpick comments (6)

experiments/word2vec_evolution/experiment_config.yaml (3)
81-101: Stage ordering may be non-deterministic if a map is parsed unordered.

If the runner relies on order, prefer a list of stages with explicit names.

Minimal illustration:
-pipeline:
-  stages:
-    1_import:
-      - Download papers from ArXiv
-      - Clone GitHub repositories
-      - Extract text content from both sources
+pipeline:
+  stages:
+    - name: import
+      tasks:
+        - Download papers from ArXiv
+        - Clone GitHub repositories
+        - Extract text content from both sources
Confirm whether your pipeline preserves YAML map insertion order; if not, adopt the list form.

42-51: Optional: centralize thresholds with YAML anchors to reduce drift.

Keeps validation text and criteria consistent.
 analysis:
   # Entropy analysis
-  entropy_window_size: 100
-  entropy_threshold_low: 0.3   # Low entropy = high clarity/bridge point
-  entropy_threshold_high: 0.7  # High entropy = conceptual complexity
+  entropy_window_size: 100
+  entropy_thresholds: &entropy_thresholds
+    low: 0.3    # Low entropy = high clarity/bridge point
+    high: 0.7   # High entropy = conceptual complexity

   # Bridge detection
   bridge_criteria:
     semantic_similarity_threshold: 0.85
     conveyance_threshold: 0.9
     where_threshold: 0.8  # Proximity in graph
+
+hypotheses:
   h1:
     name: "Theory-Practice Bridge Detection"
     description: |
       Papers and their implementation code will show low entropy (high clarity)
       at specific bridge points where theoretical concepts map directly to
       implementation patterns.
     validation: |
-      Measure entropy at points of high semantic similarity between paper
-      and code. Expect entropy < 0.3 at true bridge points.
+      Measure entropy at points of high semantic similarity between paper
+      and code. Expect entropy < *entropy_thresholds.low at true bridge points.
Also applies to: 56-62, 69-71, 75-80

32-46: Add a seed for reproducibility

Under analysis, add random_seed: 42 to stabilize chunking and ordering.

Embedding dimensions (2048) already match the jina-v4 default.
core/processors/chunking_strategies.py (3)
174-176: TokenBased start_char is O(n^2); precompute prefix chars to O(1) per chunk

Repeated joins inside the loop are quadratic. Precompute a prefix array once.
         chunks = []
         stride = self.chunk_size - self.chunk_overlap
+        # Precompute cumulative char positions for O(1) start offsets
+        prefix_chars = [0]
+        total = 0
+        for idx, tok in enumerate(tokens):
+            if idx > 0:
+                total += 1  # space between tokens
+            total += len(tok)
+            prefix_chars.append(total)
@@
-            start_char = len(' '.join(tokens[:i]))
+            start_char = prefix_chars[i]
Also applies to: 196-197

17-17: Modernize typing per Ruff UP035/UP007: use built-in generics and | unions

Adopt built-in generics and import Callable from collections.abc. Improves readability and aligns with current typing best practices.
-from typing import List, Dict, Any, Optional, Tuple, Callable, Union
+from collections.abc import Callable
+from typing import Any
@@
-        tokenizer: Optional[Union[Callable[[str], List[str]], Any]] = None
+        tokenizer: Callable[[str], list[str]] | Any | None = None
Follow-up (optional): Convert other annotations to built-ins, e.g., List[TextChunk] -> list[TextChunk], Dict[str, Any] -> dict[str, Any].

Also applies to: 111-111

85-96: Offer: add targeted tests to pin behavior

I can add tests to prevent regressions:

_clean_text preserves newlines and paragraph boundaries (round-trips).

SlidingWindow covers all tokens including tails < step_size; no gaps.

Semantic forced split yields strictly increasing, unique chunk_index values.

Do you want me to open a follow-up PR with these tests?

Also applies to: 464-524, 368-399

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8736363 and 80596db.

📒 Files selected for processing (2)

core/processors/chunking_strategies.py (12 hunks)
experiments/word2vec_evolution/experiment_config.yaml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Prefer absolute imports following the shown module structure (e.g., core.framework., tools.arxiv.pipelines.)

Files:

core/processors/chunking_strategies.py

🧬 Code graph analysis (1)

core/processors/chunking_strategies.py (1)

experiments/word2vec_evolution/generate_embeddings.py (1)

chunk_text (30-57)

🪛 YAMLlint (1.37.1)

experiments/word2vec_evolution/experiment_config.yaml

[error] 112-112: no new line character at the end of file

(new-line-at-end-of-file)

🪛 Ruff (0.12.2)

core/processors/chunking_strategies.py

17-17: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)

17-17: typing.List is deprecated, use list instead

(UP035)

17-17: typing.Dict is deprecated, use dict instead

(UP035)

17-17: typing.Tuple is deprecated, use tuple instead

(UP035)

111-111: Use X | Y for type annotations

Convert to X | Y

(UP007)

🔇 Additional comments (3)

experiments/word2vec_evolution/experiment_config.yaml (2)

1-112: Well-structured experiment config.

Clear sections, sensible defaults, and readable comments. Nice.

13-31: Sanity check passed: arXiv IDs and GitHub repo URLs resolve successfully.

core/processors/chunking_strategies.py (1)

41-46: Docstring and API clarifications: LGTM

Clearer parameter semantics, return types, and behavior descriptions improve the API surface without changing runtime behavior.

Also applies to: 51-57, 72-82, 114-127, 147-159, 241-248, 256-269, 344-348, 355-362, 402-417, 449-457, 466-475, 540-554

coderabbitai · 2025-09-03T01:22:55Z

core/processors/chunking_strategies.py

+        """
+        Force-split an oversized text into consecutive TextChunk objects of at most `self.max_chunk_size` tokens.
+        
+        The text is split on whitespace into words and rejoined into chunks of up to `max_chunk_size` words. `start_char` is used as the initial character offset for the first chunk; subsequent chunks' `start_char`/`end_char` are computed by advancing the offset by the length of the produced chunk text plus a single separating space. Character positions are therefore approximate and do not preserve original spacing or punctuation.
+        
+        Parameters:
+            text (str): The input text to split.
+            start_char (int): Character index in the original document corresponding to the start of `text`.
+        
+        Returns:
+            List[TextChunk]: A list of TextChunk objects with metadata containing `strategy='semantic_forced_split'` and `token_count`. Chunk indices are zero-based and reflect the order of the produced chunks.
+        """


⚠️ Potential issue

chunk_index resets inside _force_split_text; leads to duplicate/incorrect indices

Chunks produced here start from 0 regardless of prior chunks, causing duplicate indices in the final list. Provide a base_index to offset numbering.

-def _force_split_text(self, text: str, start_char: int) -> List[TextChunk]: +def _force_split_text(self, text: str, start_char: int, base_index: int = 0) -> List[TextChunk]: @@ - for i in range(0, len(words), self.max_chunk_size): + for i in range(0, len(words), self.max_chunk_size): chunk_words = words[i:i + self.max_chunk_size] chunk_text = ' '.join(chunk_words) @@ - chunks.append(TextChunk( + chunks.append(TextChunk( text=chunk_text, start_char=start_char, end_char=start_char + len(chunk_text), - chunk_index=len(chunks), + chunk_index=base_index + len(chunks), metadata={ 'strategy': 'semantic_forced_split', 'token_count': len(chunk_words) } ))

Also update the call site to pass the current offset (in create_chunks):

- chunks.extend(self._force_split_text(para, current_chunk_start)) + chunks.extend(self._force_split_text(para, current_chunk_start, base_index=len(chunks)))

Also applies to: 383-399

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

experiments/word2vec_evolution/process_github_repos.py (1)
100-127: Good: now checking returncode; add a timeout to avoid hangs

You implemented the earlier suggestion to check return codes. Add a reasonable timeout so setup can’t hang the run.
-        result = subprocess.run(
+        result = subprocess.run(
             [sys.executable, str(setup_script)],
             capture_output=True,
             text=True,
-            cwd=str(setup_script.parent)
+            cwd=str(setup_script.parent),
+            timeout=120
         )

🧹 Nitpick comments (4)

experiments/word2vec_evolution/process_github_repos.py (4)

22-27: Fix E402 import placement and avoid sys.path hacks

Move the GitHubPipelineManager import into main() (after sys.path manipulation) and loosen the type in process_repository to avoid top-level import ordering issues. This removes the Ruff E402 and aligns with absolute-import guidance without polluting module scope.

Apply:

@@
-# Add parent to path for imports
-sys.path.insert(0, str(Path(__file__).parent.parent.parent))
-
-# Import GitHub pipeline manager for direct use
-from tools.github.github_pipeline_manager import GitHubPipelineManager
+# Add project root to path for imports
+sys.path.insert(0, str(Path(__file__).parent.parent.parent))
@@
-def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager):
+from typing import Any
+
+def process_repository(repo_url: str, name: str, manager: Any):

And inside main(), just before constructing manager:

@@
-    # Create manager instance
+    # Create manager instance
+    from tools.github.github_pipeline_manager import GitHubPipelineManager  # local to satisfy E402

128-145: Guard against missing keys in config to avoid KeyError

If github_simple.yaml lacks expected sections, access like config['arango'] will KeyError. Add get() checks with clear errors.

-    with open(config_path) as f:
-        config = yaml.safe_load(f)
+    with open(config_path) as f:
+        config = yaml.safe_load(f) or {}
@@
-    config['arango']['password'] = arango_password
+    if 'arango' not in config:
+        logger.error("Config missing 'arango' section")
+        sys.exit(1)
+    config['arango']['password'] = arango_password
+    if 'processing' not in config or 'github' not in config['processing']:
+        logger.error("Config missing 'processing.github' section")
+        sys.exit(1)

159-172: Persist richer per-repo metrics in results.json

Capture totals when available. Helps downstream analysis.

-        success = process_repository(repo_info['repo'], repo_info['name'], manager)
+        success = process_repository(repo_info['repo'], repo_info['name'], manager)
+        # Optionally, pull lightweight stats from manager cache/DB in future
         results.append({
             'repo': repo_info['repo'],
             'name': repo_info['name'],
             'paper': repo_info['paper'],
             'success': success,
             'timestamp': datetime.now().isoformat()
         })

201-209: Use sys.exit and add trailing newline

Small polish: prefer sys.exit for scripts; also fix missing newline at EOF (Ruff W292).

-if __name__ == "__main__":
-    exit(main())
+if __name__ == "__main__":
+    sys.exit(main())
+

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 80596db and 2af1332.

📒 Files selected for processing (1)

experiments/word2vec_evolution/process_github_repos.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Prefer absolute imports following the shown module structure (e.g., core.framework., tools.arxiv.pipelines.)

Files:

experiments/word2vec_evolution/process_github_repos.py

🧬 Code graph analysis (1)

experiments/word2vec_evolution/process_github_repos.py (1)

tools/github/github_pipeline_manager.py (3)

GitHubPipelineManager (23-316)

process_repository (53-87)

main (319-366)

🪛 Ruff (0.12.2)

experiments/word2vec_evolution/process_github_repos.py

26-26: Module level import not at top of file

(E402)

209-209: No newline at end of file

Add trailing newline

(W292)

🔇 Additional comments (1)

experiments/word2vec_evolution/process_github_repos.py (1)

68-89: Repo URL handling already supported: GitHubDocumentManager._parse_repo_url explicitly accepts the shorthand "owner/repo", and process_repository passes that format end-to-end, so no changes are needed.

coderabbitai · 2025-09-03T01:28:56Z

experiments/word2vec_evolution/process_github_repos.py

+def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager):
+    """
+    Process a single GitHub repository using the GitHub pipeline.
+    
+    Args:
+        repo_url: GitHub URL (owner/repo format)
+        name: Friendly name for logging
+        manager: GitHubPipelineManager instance
+    """


⚠️ Potential issue

Normalize repo URL format to prevent processing failures

GitHubPipelineManager examples expect a full https URL, but the docstring and callers pass owner/repo. Normalize to support both to avoid parse failures at runtime.

def process_repository(repo_url: str, name: str, manager: Any): @@ - logger.info(f"Processing {name} ({repo_url})...") + # Accept 'owner/repo' or full URL and normalize + repo_url_norm = repo_url if repo_url.startswith("http") else f"https://github.com/{repo_url}" + logger.info(f"Processing {name} ({repo_url_norm})...")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager):

"""

Process a single GitHub repository using the GitHub pipeline.

Args:

repo_url: GitHub URL (owner/repo format)

name: Friendly name for logging

manager: GitHubPipelineManager instance

"""

def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager):

"""

Process a single GitHub repository using the GitHub pipeline.

Args:

repo_url: GitHub URL (owner/repo format)

name: Friendly name for logging

manager: GitHubPipelineManager instance

"""

# Accept 'owner/repo' or full URL and normalize

repo_url_norm = repo_url if repo_url.startswith("http") else f"https://github.com/{repo_url}"

logger.info(f"Processing {name} ({repo_url_norm})...")

# …rest of function unchanged…

coderabbitai · 2025-09-03T01:28:56Z

experiments/word2vec_evolution/process_github_repos.py

+    try:
+        # Process the repository directly using the manager
+        results = manager.process_repository(repo_url)
+
+        # Check results
+        if results and 'stored' in results:
+            stored_count = results.get('stored', 0)
+            logger.info(f"  ✓ {name} processed successfully")
+            logger.info(f"    Stored {stored_count} embeddings")
+
+            # Log additional stats if available
+            if 'repository' in results:
+                repo_info = results['repository']
+                logger.info(f"    Repository: {repo_info.get('full_name', repo_url)}")
+                if 'stats' in repo_info:
+                    stats = repo_info['stats']
+                    logger.info(f"    Files: {stats.get('file_count', 'N/A')}")
+                    logger.info(f"    Chunks: {stats.get('chunk_count', 'N/A')}")
+
+            return True
+        else:
+            logger.error(f"  ✗ {name} processing returned no results")
+            return False


⚠️ Potential issue

Don’t gate success on a non-guaranteed 'stored' key; log robust metrics instead

GitHubPipelineManager.main() prints total_processed/total_functions; 'stored' may not exist, leading to false negatives. Consider success when process_repository returns without exception; log known fields when present.

- # Process the repository directly using the manager - results = manager.process_repository(repo_url) + # Process the repository directly using the manager + results = manager.process_repository(repo_url_norm) @@ - # Check results - if results and 'stored' in results: - stored_count = results.get('stored', 0) - logger.info(f" ✓ {name} processed successfully") - logger.info(f" Stored {stored_count} embeddings") + # Consider success if a result dict is returned; log available metrics + if isinstance(results, dict): + file_count = results.get('total_processed') + func_count = results.get('total_functions') + emb_count = results.get('embeddings_stored', results.get('stored')) + logger.info(f" ✓ {name} processed successfully") + if file_count is not None or func_count is not None or emb_count is not None: + logger.info( + " " + + ", ".join( + s for s in [ + f"Files={file_count}" if file_count is not None else None, + f"Functions={func_count}" if func_count is not None else None, + f"Embeddings={emb_count}" if emb_count is not None else None, + ] if s + ) + ) @@ - logger.error(f" ✗ {name} processing returned no results") + logger.error(f" ✗ {name} processing returned no results") return False

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In experiments/word2vec_evolution/process_github_repos.py around lines 39 to 61, the current logic treats the absence of a non-guaranteed 'stored' key as a failure; change it to consider the repository processed successfully if process_repository returns without raising an exception (i.e., results is not None/False), and only use 'stored' (and other optional fields like 'repository', 'stats', 'file_count', 'chunk_count') for logging when present; specifically, replace the strict "if results and 'stored' in results" gate with a check for a truthy results object, log stored_count if available (fallback to 'N/A' or 0), log repository and stats when present, and return True for successful non-exceptional returns while still returning False when results is None/False.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

experiments/word2vec_evolution/experiment_config.yaml (1)
113-113: Fix trailing spaces and add newline at EOF to satisfy yamllint.

Line 113 has stray spaces and the file lacks a final newline, which will keep yamllint failing.

Apply:
-  
+

🧹 Nitpick comments (1)

experiments/word2vec_evolution/experiment_config.yaml (1)
47-51: Clarify naming: consider proximity_threshold over where_threshold.

Improves readability; the intent is proximity in the graph.
-    where_threshold: 0.8  # Proximity in graph
+    proximity_threshold: 0.8  # Proximity in graph

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2af1332 and 4d99984.

📒 Files selected for processing (1)

experiments/word2vec_evolution/experiment_config.yaml (1 hunks)

🧰 Additional context used

🪛 YAMLlint (1.37.1)

experiments/word2vec_evolution/experiment_config.yaml

[error] 113-113: no new line character at the end of file

(new-line-at-end-of-file)

[error] 113-113: trailing spaces

(trailing-spaces)

🔇 Additional comments (2)

experiments/word2vec_evolution/experiment_config.yaml (2)

81-107: Verify loader supports list-of-strings under pipeline.stages
The experiment runner uses a plain yaml.safe_load (no Pydantic or schema validation) to read pipeline.stages; confirm the code that iterates over config['pipeline']['stages'] treats each stage value as a List[str] of step names rather than expecting objects with id/params.

34-36: Embedding model/dimension consistency confirmed. The “jina-v4” model outputs 2048-dimensional embeddings per the Jina Embeddings API, and all code paths (defaults in DocumentProcessor, tests, and pipelines) consistently use embedding_model “jina-v4” with embedding_dimensions 2048.

coderabbitai · 2025-09-03T01:35:00Z

experiments/word2vec_evolution/experiment_config.yaml

+  word2vec:
+    arxiv_id: "1301.3781"
+    title: "Efficient Estimation of Word Representations in Vector Space"
+    year: 2013
+    github_repo: "https://github.com/dav/word2vec"
+
+  doc2vec:
+    arxiv_id: "1405.4053"
+    title: "Distributed Representations of Sentences and Documents"
+    year: 2014
+    github_repo: "https://github.com/bnosac/doc2vec"
+
+  code2vec:
+    arxiv_id: "1803.09473"
+    title: "code2vec: Learning Distributed Representations of Code"
+    year: 2018
+    github_repo: "https://github.com/tech-srl/code2vec"
+


💡 Verification agent

🧩 Analysis chain

Sanity-check paper → repo mappings.

Make sure these GitHub repos are the intended canonical implementations for the cited papers to avoid skewed bridge analyses.

🌐 Web query:

Are these canonical GitHub repos for the respective papers? - 1301.3781 word2vec → github.com/dav/word2vec - 1405.4053 doc2vec → github.com/bnosac/doc2vec - 1803.09473 code2vec → github.com/tech-srl/code2vec

💡 Result:

Short answers:

1301.3781 (word2vec): The canonical/original code is the Google/ Tomas Mikolov export (tmikolov/word2vec / code.google.com archive). dav/word2vec is a community mirror/clone, not the original canonical repo. [1][2]

1405.4053 (doc2vec / Paragraph Vector): bnosac/doc2vec is a third‑party R package/wrapper (not the original paper authors' canonical code). The paper (Le & Mikolov) does not have a single official GitHub repo from the authors; common implementations live in Gensim and other third‑party repos (and some C/C++/wrappers such as hiyijian/doc2vec). [3][4]

1803.09473 (code2vec): github.com/tech-srl/code2vec is the authors’ official repository referenced by the paper (official). [5]

Sources:
[1] tmikolov/word2vec (export of original Google code).
[2] dav/word2vec (mirror; README notes “Original from https://code.google.com/p/word2vec/”).
[3] ArXiv: “Distributed Representations of Sentences and Documents” (Le & Mikolov).
[4] bnosac/doc2vec (R wrapper / third‑party implementation; points to other impls).
[5] ArXiv + tech-srl/code2vec (paper links repo).

Use canonical implementations for word2vec and doc2vec

experiments/word2vec_evolution/experiment_config.yaml (14–31): replace word2vec github_repo with https://github.com/tmikolov/word2vec (official export; dav/word2vec is a community mirror) (github.com)

doc2vec: authors did not publish an official repo; bnosac/doc2vec is a third-party R wrapper (github.com); for a canonical implementation, use Gensim (https://github.com/RaRe-Technologies/gensim) (en.wikipedia.org)

code2vec mapping is correct (https://github.com/tech-srl/code2vec) (github.com)

🤖 Prompt for AI Agents

experiments/word2vec_evolution/experiment_config.yaml lines 14-31: the YAML points to non-canonical repos for word2vec and doc2vec; update word2vec github_repo to the official upstream https://github.com/tmikolov/word2vec and replace doc2vec github_repo with the canonical Gensim implementation https://github.com/RaRe-Technologies/gensim (leave code2vec mapping as-is).

- Fix lexicographic YYMM filtering bug with proper century handling in overnight test scripts - Update ArangoDBManager import paths to use core.database module - Remove incorrect await from synchronous insert_document calls - Fix tokenizer interface to support both callable and object-based tokenizers - Fix validate_arxiv_id bug that was stripping all digits - Clean up YAML formatting in experiment config These changes resolve critical bugs identified in the code review.

fix: Address CodeRabbit review comments from PR #17

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)

tests/test_overnight_arxiv_2000.py (1)

418-425: Report the actual number of papers processed.

set this to len(papers), not the requested num_docs.
-        'total_papers_processed': num_docs,
+        'total_papers_processed': len(papers),

core/processors/chunking_strategies.py (2)

85-95: Fix: _clean_text collapses newlines, breaking paragraph/sentence-aware strategies

Collapsing all whitespace to a single space erases newlines, so SemanticChunking can no longer detect paragraphs/sentences after calling _clean_text. Preserve newlines while normalizing other whitespace.

 def _clean_text(self, text: str) -> str:
-        # Remove excessive whitespace
-        text = re.sub(r'\s+', ' ', text)
-        # Remove null characters
-        text = text.replace('\x00', '')
-        return text.strip()
+        # Normalize line endings and whitespace while preserving paragraph breaks
+        text = text.replace('\r\n', '\n').replace('\r', '\n')
+        # Collapse inline whitespace but keep newlines
+        text = re.sub(r'[ \t\f\v]+', ' ', text)
+        # Collapse 3+ consecutive newlines to 2 to represent paragraph breaks
+        text = re.sub(r'\n{3,}', '\n\n', text)
+        # Remove null characters
+        text = text.replace('\x00', '')
+        return text.strip()

503-523: SlidingWindow remainder overlaps last window; compute correct remainder start

Remainder should begin at last_start + window_size, not len(chunks) * step_size.

-        # Handle remainder if any
-        if len(tokens) > self.window_size:
-            last_index = len(chunks) * self.step_size
-            if last_index < len(tokens):
-                remaining_tokens = tokens[last_index:]
-                if len(remaining_tokens) >= self.step_size:  # Only if substantial
-                    chunk_text = ' '.join(remaining_tokens)
-                    start_char = len(' '.join(tokens[:last_index]))
+        # Handle remainder: tokens after the last full window
+        if len(tokens) > self.window_size and chunks:
+            last_start = (len(chunks) - 1) * self.step_size
+            remainder_start = last_start + self.window_size
+            if remainder_start < len(tokens):
+                remaining_tokens = tokens[remainder_start:]
+                if len(remaining_tokens) >= self.step_size:  # Only if substantial
+                    chunk_text = ' '.join(remaining_tokens)
+                    start_char = len(' '.join(tokens[:remainder_start]))

core/processors/generic_document_processor.py (2)

341-357: Handle extractor returning no content (None/empty) to avoid downstream NoneType errors

WORKER_DOCLING.extract may return None; _embed_and_store expects a dict. Short-circuit with a failure result instead of staging an invalid document.
         # Extract PDF
         extracted = WORKER_DOCLING.extract(task.pdf_path)
+        if not extracted:
+            return {
+                'success': False,
+                'document_id': task.document_id,
+                'error': 'Extractor returned no content'
+            }
429-433: Ensure overridden collections exist (e.g., github_*), or inserts will fail

ArangoDBManager.ensure_collections creates only arxiv* by default. When overriding collections, create them up front.
     # Initialize DB manager with collections
     WORKER_DB_MANAGER = ArangoDBManager(arango_config)
     WORKER_DB_MANAGER.collections = collections  # Override with source-specific collections
+    # Ensure overridden collections exist
+    for coll in collections.values():
+        if not WORKER_DB_MANAGER.db.has_collection(coll):
+            WORKER_DB_MANAGER.db.create_collection(coll)

tools/arxiv/arxiv_manager.py (2)

442-455: Split chunk storage: separate arxiv_chunks from arxiv_embeddings and handle missing embeddings.

Aligns with the intended schema; also avoids failures when embeddings are absent.

-            # Store chunks with embeddings
-            for chunk in result.chunks:
-                chunk_doc = {
-                    '_key': f"{paper_info.sanitized_id}_chunk_{chunk.chunk_index}",
-                    'paper_id': paper_info.sanitized_id,
-                    'arxiv_id': paper_info.arxiv_id,
-                    'chunk_index': chunk.chunk_index,
-                    'text': chunk.text,
-                    'embedding': chunk.embedding.tolist(),
-                    'start_char': chunk.start_char,
-                    'end_char': chunk.end_char,
-                    'context_window_used': chunk.context_window_used
-                }
-                self.db_manager.insert_document('arxiv_embeddings', chunk_doc)
+            # Store chunks and embeddings (separate collections)
+            for chunk in result.chunks:
+                chunk_key = f"{paper_info.sanitized_id}_chunk_{chunk.chunk_index}"
+                chunk_only_doc = {
+                    '_key': chunk_key,
+                    'paper_id': paper_info.sanitized_id,
+                    'arxiv_id': paper_info.arxiv_id,
+                    'chunk_index': chunk.chunk_index,
+                    'text': chunk.text,
+                    'start_char': chunk.start_char,
+                    'end_char': chunk.end_char,
+                    'context_window_used': getattr(chunk, 'context_window_used', None),
+                }
+                self.db_manager.insert_document('arxiv_chunks', chunk_only_doc)
+
+                embedding_doc = {
+                    '_key': f"{chunk_key}_emb",
+                    'paper_id': paper_info.sanitized_id,
+                    'arxiv_id': paper_info.arxiv_id,
+                    'chunk_id': chunk_key,
+                    'chunk_index': chunk.chunk_index,
+                    'embedding': (chunk.embedding.tolist() if getattr(chunk, 'embedding', None) is not None else None),
+                }
+                self.db_manager.insert_document('arxiv_embeddings', embedding_doc)

1-490: Fix linting and formatting issues in tools/arxiv/arxiv_manager.py
Ruff reports 107 issues (72 auto-fixable) across these categories:

Import organization: unsorted imports and unused imports (os, asyncio, typing.Union, dataclasses.asdict) (I001, F401)
Type annotations: switch from deprecated typing.Dict/List/Tuple to built-ins (dict, list, tuple) and use X | None instead of Optional[X] (UP035, UP006, UP045)
Whitespace cleanup: remove blank lines containing spaces (W293) and add trailing newline at EOF (W292)
Dead code: remove unused locals (base_id, category) (F841)
Miscellaneous: remove unnecessary open mode argument (UP015) and replace datetime.now(timezone.utc) with datetime.now(datetime.UTC) alias (UP017)

Apply ruff check --fix (add --unsafe-fixes for type-annotation changes) and Black formatting to resolve these issues.

♻️ Duplicate comments (10)

experiments/word2vec_evolution/experiment_config.yaml (2)

14-31: Use canonical repos for word2vec/doc2vec to avoid skewed mappings.

Swap to the upstream export for word2vec and a canonical implementation for doc2vec (Gensim). Code2vec entry is fine as-is.
   word2vec:
     arxiv_id: "1301.3781"
     title: "Efficient Estimation of Word Representations in Vector Space"
     year: 2013
-    github_repo: "https://github.com/dav/word2vec"
+    github_repo: "https://github.com/tmikolov/word2vec"

   doc2vec:
     arxiv_id: "1405.4053"
     title: "Distributed Representations of Sentences and Documents"
     year: 2014
-    github_repo: "https://github.com/bnosac/doc2vec"
+    github_repo: "https://github.com/RaRe-Technologies/gensim"
112-112: Fix YAML lint error: add a trailing newline at EOF.

CI is failing with "no new line character at the end of file". Add a single newline after Line 112.
   - validation_report/  # Metrics and hypothesis testing results
+

core/processors/chunking_strategies.py (1)

316-316: Fix duplicate chunk_index when force-splitting large paragraphs

_force_split_text starts indices at 0, causing duplicates when its chunks are extended into the main list. Offset indices by the current length of the outer list.

-                    chunks.extend(self._force_split_text(para, current_chunk_start))
+                    chunks.extend(self._force_split_text(para, current_chunk_start, base_index=len(chunks)))

-    def _force_split_text(self, text: str, start_char: int) -> List[TextChunk]:
+    def _force_split_text(self, text: str, start_char: int, base_index: int = 0) -> List[TextChunk]:
@@
-            chunks.append(TextChunk(
+            chunks.append(TextChunk(
                 text=chunk_text,
                 start_char=start_char,
                 end_char=start_char + len(chunk_text),
-                chunk_index=len(chunks),
+                chunk_index=base_index + len(chunks),
                 metadata={
                     'strategy': 'semantic_forced_split',
                     'token_count': len(chunk_words)
                 }
             ))

Also applies to: 369-401

tests/test_c_files.py (2)

133-145: Harden AQL against missing fields

Avoid errors when symbols/metrics are absent.

     query = """
     FOR doc IN github_papers
         FILTER doc.language == "c"
         LIMIT 5
         RETURN {
             id: doc.document_id,
             language: doc.language,
-            functions: LENGTH(doc.symbols.functions || []),
-            structs: LENGTH(doc.symbols.structs || []),
-            includes: LENGTH(doc.symbols.includes || []),
-            complexity: doc.code_metrics.complexity
+            functions: LENGTH(COALESCE(doc.symbols.functions, [])),
+            structs: LENGTH(COALESCE(doc.symbols.structs, [])),
+            includes: LENGTH(COALESCE(doc.symbols.includes, [])),
+            complexity: COALESCE(doc.code_metrics.complexity, 0)
         }
     """

124-126: Fix: incorrect result keys when logging processor summary

Use nested extraction/embedding success lists from GenericDocumentProcessor.

-    logger.info(f"Processed: {results.get('total_processed', 0)}")
-    logger.info(f"Extraction success: {results.get('extraction_success', 0)}")
-    logger.info(f"Embedding success: {results.get('embedding_success', 0)}")
+    logger.info(f"Processed: {results.get('total_processed', 0)}")
+    extraction_ok = len(results.get('extraction', {}).get('success', []))
+    embedding_ok = len(results.get('embedding', {}).get('success', []))
+    logger.info(f"Extraction success: {extraction_ok}")
+    logger.info(f"Embedding success: {embedding_ok}")

tests/test_github_treesitter.py (3)

108-115: Use localhost for ARANGO_HOST default, not a private IP.

Avoid environment-specific private IPs in test defaults.

-            'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'),
+            'host': os.getenv('ARANGO_HOST', 'http://localhost:8529'),

160-194: Guard ArangoDB query and handle connection errors.

Skip DB verification when host/password are not set and catch connectivity errors to keep tests resilient.

-    # Check if Tree-sitter metadata was stored
-    if results.get('embedding_success', 0) > 0:
-        logger.info("\nChecking stored metadata in ArangoDB...")
-        from core.database.arango_db_manager import ArangoDBManager
-        
-        db_manager = ArangoDBManager(config['arango'])
-        
-        # Query for documents with Tree-sitter data
-        query = """
+    # Check if Tree-sitter metadata was stored (only when DB access is configured)
+    extraction = results.get('extraction', {})
+    embedding = results.get('embedding', {})
+    can_query_db = bool(config['arango'].get('host') and config['arango'].get('password'))
+    if can_query_db and len(embedding.get('success', [])) > 0:
+        logger.info("\nChecking stored metadata in ArangoDB...")
+        from core.database.arango_db_manager import ArangoDBManager
+        try:
+            db_manager = ArangoDBManager(config['arango'])
+            # Query for documents with Tree-sitter data
+            query = """
         FOR doc IN github_papers
             FILTER doc.has_tree_sitter == true
             LIMIT 5
             RETURN {
                 id: doc.document_id,
                 language: doc.language,
                 has_symbols: doc.has_tree_sitter,
                 function_count: LENGTH(doc.symbols.functions),
                 class_count: LENGTH(doc.symbols.classes),
                 complexity: doc.code_metrics.complexity
             }
         """
-        
-        cursor = db_manager.db.aql.execute(query)
-        docs_with_symbols = list(cursor)
-        
-        if docs_with_symbols:
-            logger.info(f"Found {len(docs_with_symbols)} documents with Tree-sitter symbols:")
-            for doc in docs_with_symbols:
-                logger.info(f"  - {doc['id']}: {doc['language']} "
-                          f"({doc['function_count']} functions, "
-                          f"{doc['class_count']} classes, "
-                          f"complexity: {doc['complexity']})")
-        else:
-            logger.info("No documents with Tree-sitter symbols found in database")
+            cursor = db_manager.db.aql.execute(query)
+            docs_with_symbols = list(cursor)
+            if docs_with_symbols:
+                logger.info(f"Found {len(docs_with_symbols)} documents with Tree-sitter symbols:")
+                for doc in docs_with_symbols:
+                    logger.info(f"  - {doc['id']}: {doc['language']} "
+                                f"({doc['function_count']} functions, "
+                                f"{doc['class_count']} classes, "
+                                f"complexity: {doc['complexity']})")
+            else:
+                logger.info("No documents with Tree-sitter symbols found in database")
+        except Exception as e:
+            logger.warning(f"Skipping DB check due to connection error: {e}")
+    else:
+        logger.info("Skipping ArangoDB verification (ARANGO_HOST/ARANGO_PASSWORD not set or no embeddings).")

56-66: Fix: metadata['type'] check skips all files; process tasks directly

GitHubDocumentManager tasks don’t set metadata['type']; this guard prevents any processing.

-    # Process first 3 code files
+    # Process up to 3 files
     code_count = 0
-    for task in tasks[:10]:
-        if task.metadata.get('type') == 'code':
+    for task in tasks[:10]:
         logger.info(f"\n{'='*60}")
         logger.info(f"Processing: {task.document_id}")
         logger.info(f"File: {task.pdf_path}")
         
         # Extract with Tree-sitter
         result = code_extractor.extract(task.pdf_path)
         
-        if result and result.get('symbols'):
+        if result and result.get('symbols'):
             logger.info(f"Language: {result.get('language')}")
             
             # Display symbol counts
             symbols = result['symbols']
             logger.info("Symbol Table:")
             for category, items in symbols.items():
                 if items:
                     logger.info(f"  {category}: {len(items)} items")
                     # Show first 3 items
                     for item in items[:3]:
                         if isinstance(item, dict):
                             name = item.get('name', item.get('statement', str(item)))
                             line = item.get('line', '?')
                             logger.info(f"    - {name} (line {line})")
             
             # Display metrics
             if result.get('code_metrics'):
                 metrics = result['code_metrics']
                 logger.info("Code Metrics:")
                 logger.info(f"  Lines of code: {metrics.get('lines_of_code', 0)}")
                 logger.info(f"  Complexity: {metrics.get('complexity', 0)}")
                 logger.info(f"  Max depth: {metrics.get('max_depth', 0)}")
                 logger.info(f"  Node count: {metrics.get('node_count', 0)}")
             
             # Display symbol hash
             if result.get('symbol_hash'):
                 logger.info(f"Symbol hash: {result['symbol_hash'][:16]}...")
             
             code_count += 1
             if code_count >= 3:
                 break
-        else:
-            logger.info("No Tree-sitter symbols extracted (might not be a supported language)")
+        else:
+            logger.info("No Tree-sitter symbols extracted (might not be a supported language)")

tools/arxiv/arxiv_manager.py (2)

101-106: Remove unused base_id in validate_arxiv_id (Ruff F841) and align comment.

The regex already accepts optional version suffix; no need to pre-strip. Also avoids the unused variable.
-        # Remove any version suffix for validation
-        base_id = re.sub(r'v\d+$', '', arxiv_id)
+        # Version suffix allowed by regex; no pre-strip needed
143-149: Remove unused local category in get_pdf_path (Ruff F841).

Variable is assigned but not used.
-            category = match.group(1)
-            number = match.group(2)
+            number = match.group(2)
             yymm = number[:4]
             return cls.PDF_BASE_PATH / yymm / f"{base_id.replace('/', '_')}.pdf"

🧹 Nitpick comments (44)

experiments/word2vec_evolution/experiment_config.yaml (4)

38-40: Clarify chunk_overlap units (tokens vs characters vs bytes).

Label the unit to prevent misconfiguration across components.
-  chunk_size: 512
-  chunk_overlap: 50
+  chunk_size: 512        # tokens
+  chunk_overlap: 50      # tokens
47-51: Make bridge thresholds tunable to reduce “no results” risk.

0.85/0.9 thresholds are stringent; expose overrides via CLI/env to ease calibration.

Example (for your runner to read as optional overrides):
 analysis:
   # Bridge detection
   bridge_criteria:
     semantic_similarity_threshold: 0.85
     conveyance_threshold: 0.9
     where_threshold: 0.8  # Proximity in graph
+  params_overrides_env:
+    semantic_similarity_threshold: BRIDGE_SIM_THR
+    conveyance_threshold: BRIDGE_CONVEY_THR
+    where_threshold: BRIDGE_WHERE_THR
81-92: Surface storage connection settings referenced in pipeline.

You store embeddings in ArangoDB here but the config lacks connection hints. Add env‑backed settings (no secrets).
 analysis:
   embedding_model: "jina-v4"
   embedding_dimensions: 2048
+connections:
+  arangodb:
+    url_env: ARANGODB_URL
+    db_env: ARANGODB_DB
+    user_env: ARANGODB_USER
+    password_env: ARANGODB_PASSWORD
+  postgres:
+    dsn_env: POSTGRES_DSN
15-30: Add explicit arXiv URLs for convenience.

Keep ids, but URLs help tooling and reviewers.
   word2vec:
     arxiv_id: "1301.3781"
+    arxiv_url: "https://arxiv.org/abs/1301.3781"
   ...
   doc2vec:
     arxiv_id: "1405.4053"
+    arxiv_url: "https://arxiv.org/abs/1405.4053"
   ...
   code2vec:
     arxiv_id: "1803.09473"
+    arxiv_url: "https://arxiv.org/abs/1803.09473"

tools/arxiv/arxiv_document_manager.py (7)

19-22: Avoid sys.path hacks; rely on package-absolute imports per guidelines.

This is brittle and bypasses proper packaging. The coding guidelines request absolute imports without path mutation.

-import sys
-sys.path.insert(0, str(Path(__file__).parent.parent.parent))
 from core.processors.generic_document_processor import DocumentTask

34-49: Harden base-dir validation (check directory, not just existence).

If a file exists at the path, current logic passes. Validate it’s an actual directory.

-        if not self.pdf_base_dir.exists():
+        if not self.pdf_base_dir.is_dir():
             raise ValueError(f"PDF base directory does not exist: {pdf_base_dir}")

64-71: Reduce duplicate logging on failed preparations.

both this method and _prepare_single_document log failures; this can spam logs. Downgrade here to debug.

         for arxiv_id in arxiv_ids:
             task = self._prepare_single_document(arxiv_id)
             if task:
                 tasks.append(task)
             else:
-                logger.warning(f"Could not prepare document for {arxiv_id}")
+                logger.debug(f"Could not prepare document for {arxiv_id}")

110-119: Align “newest-first” behavior with implementation.

Docstring promises newest-first; current sort is lexicographical by filename. Either update doc or sort by mtime. Also ignore non-YYMM dirs.

-        yymm_dirs = sorted([d for d in self.pdf_base_dir.iterdir() if d.is_dir()], 
-                          reverse=True)
+        yymm_dirs = sorted(
+            [d for d in self.pdf_base_dir.iterdir() if d.is_dir() and d.name.isdigit() and len(d.name) == 4],
+            reverse=True,
+        )
@@
-            pdf_files = sorted(yymm_dir.glob("*.pdf"), reverse=True)
+            pdf_files = sorted(yymm_dir.glob("*.pdf"), key=lambda p: p.stat().st_mtime, reverse=True)

Also applies to: 122-125, 131-131

182-197: Broaden LaTeX source detection to common tarball variants.

arXiv source tarballs often include additional suffixes. Fall back to any “stem*.tar.gz”.

         # Check for LaTeX source
         latex_path = pdf_path.with_suffix('.tex')
         if not latex_path.exists():
             # Try .tar.gz (compressed LaTeX)
             latex_path = pdf_path.with_suffix('.tar.gz')
             if not latex_path.exists():
-                latex_path = None
+                # Fallback: any tarball starting with the stem (e.g., *source*.tar.gz)
+                matches = list(pdf_path.parent.glob(f"{pdf_path.stem}*.tar.gz"))
+                latex_path = matches[0] if matches else None

Also applies to: 198-208

226-237: Validate month in YYMM and keep version check.

Accept only 01–12 months; everything else should be invalid.

         # Check year_month is 4 digits
         if not year_month.isdigit() or len(year_month) != 4:
             return False
+        # Validate month range (01–12)
+        mm = year_month[2:]
+        if not (1 <= int(mm) <= 12):
+            return False
         
         # Check paper number is 4-6 digits, optionally followed by version like v2
         base, _, version = paper_num.partition('v')
         if not base.isdigit() or not (4 <= len(base) <= 6):
             return False
         # If there's a version, validate it's numeric
         if version and not version.isdigit():
             return False

Also applies to: 245-247, 249-255

16-18: Remove unused imports to satisfy Ruff.

Dict, Any, and dataclass aren’t used.

-from typing import List, Optional, Dict, Any
-from dataclasses import dataclass
+from typing import List, Optional

tests/test_overnight_arxiv_2000.py (9)

66-76: Clarify parameter format in docstring.

call this YYMM or YYYYMM to match normalization and CLI.

-        start_year (str): Inclusive YYMM directory name to start searching from (e.g., "1501" = January 2015).
+        start_year (str): Inclusive year-month to start from, in YYMM or YYYYMM
+            (e.g., "1501" or "201501" = January 2015).

86-87: Stabilize sampling for reproducibility.

-always use a local RNG and sort files so order is deterministic when taking all PDFs from a dir._

-    rng = random.Random(seed) if seed is not None else random
+    rng = random.Random(seed) if seed is not None else random.Random()
@@
-            sampled_pdfs = rng.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_files
+            sampled_pdfs = rng.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_files

Additionally (outside selected lines), sort the directory listing once:

# at line 108
pdf_files = sorted(yymm_dir.glob("*.pdf"), key=lambda p: p.name)

Also applies to: 113-113

129-159: Detect LaTeX recursively to handle nested project structures.

many arXiv sources keep .tex files in subfolders; use rglob for accuracy.

-        if latex_dir.exists():
-            # Check for .tex files
-            tex_files = list(latex_dir.glob("*.tex"))
-            if tex_files:
-                latex_count += 1
+        if latex_dir.exists():
+            # Check for .tex files anywhere under the directory
+            if any(latex_dir.rglob("*.tex")):
+                latex_count += 1

162-189: Docstring: add start_year parameter and format details.

keep docs aligned with the function signature and CLI.

     Parameters:
-        num_docs (int): Maximum number of ArXiv papers to include in the test (default 2000).
+        num_docs (int): Maximum number of ArXiv papers to include in the test (default 2000).
+        start_year (str): Inclusive year-month to start from, in YYMM or YYYYMM.

203-203: Expose sampling seed end-to-end for reproducible runs.

forward a seed from CLI into collect_arxiv_papers so the same corpus is selected across runs.

If acceptable, I can add:

run_arxiv_overnight_test(..., seed: int | None = None) and pass it to collect_arxiv_papers
CLI flag --seed to populate it.

371-385: Guard division by zero in short runs.

rare but safer; keeps metrics robust if a tiny corpus finishes instantly.

-            'papers_per_minute': (len(papers) / test_time) * 60,
+            'papers_per_minute': ((len(papers) / test_time) * 60) if test_time > 0 else 0.0,

399-403: Fix failure logging to show readable, bounded error text.

current code slices a list; join messages and truncate characters instead.

-        if failed_docs:
-            logger.warning(f"  Failed Papers: {len(failed_docs)}")
-            for fail in failed_docs[:3]:
-                logger.warning(f"    - {fail['arxiv_id']}: {fail['errors'][:100]}")
+        if failed_docs:
+            logger.warning(f"  Failed Papers: {len(failed_docs)}")
+            for fail in failed_docs[:3]:
+                err = fail.get('errors')
+                err_str = "; ".join(map(str, err)) if isinstance(err, list) else str(err)
+                logger.warning(f"    - {fail['arxiv_id']}: {err_str[:200]}")

431-436: Make the performance target configurable.

read from env with a sane default; optional CLI can be added later.

-    target_rate = 11.3  # papers/minute from PRD
+    target_rate = float(os.getenv("ARXIV_TARGET_RATE", "11.3"))  # papers/minute

462-466: Broaden CLI help to reflect accepted formats.

matches the YYMM/ YYYYMM normalization.

-        help="Starting YYMM for paper selection (default: 1501 = Jan 2015)"
+        help="Start year-month (YYMM or YYYYMM). Default: 1501 (Jan 2015)"

core/processors/chunking_strategies.py (5)

141-143: Validate chunk_size/overlap ranges (negative/zero cases)

Guard against chunk_size <= 0 or negative overlap to avoid undefined behavior.

-        if chunk_overlap >= chunk_size:
-            raise ValueError("Overlap must be less than chunk size")
+        if chunk_size <= 0:
+            raise ValueError("chunk_size must be > 0")
+        if chunk_overlap < 0:
+            raise ValueError("chunk_overlap must be >= 0")
+        if chunk_overlap >= chunk_size:
+            raise ValueError("Overlap must be less than chunk size")

20-20: Remove unused import

numpy is imported but never used in this module.

-import numpy as np

173-200: Avoid O(n^2) start_char computation in token-based chunking

len(' '.join(tokens[:i])) inside the loop is quadratic. Precompute prefix char offsets once.

-        for i in range(0, len(tokens), stride):
+        # Precompute prefix char offsets to avoid quadratic joins
+        if tokens:
+            spaces = [0] + [1] * (len(tokens) - 1)
+            token_lens = [len(t) for t in tokens]
+            prefix_chars = [0]
+            running = 0
+            for tlen, space in zip(token_lens, spaces):
+                running += tlen + space
+                prefix_chars.append(running)
+        for i in range(0, len(tokens), stride):
@@
-            # Calculate character positions (approximate)
-            start_char = len(' '.join(tokens[:i]))
+            # Calculate character positions (approximate)
+            start_char = prefix_chars[i] if tokens else 0

463-465: Validate window_size and step_size > 0

Prevent runtime ValueError from range step=0 and clarify constraints early.

         if step_size > window_size:
             raise ValueError("Step size cannot be larger than window size")
+        if window_size <= 0:
+            raise ValueError("window_size must be > 0")
+        if step_size <= 0:
+            raise ValueError("step_size must be > 0")

17-17: Modernize typing per Ruff hints (Callable from collections.abc, use | unions)

Keep compatibility while addressing UP035/UP007.

-from typing import List, Dict, Any, Optional, Tuple, Callable, Union
+from typing import List, Dict, Any, Optional, Tuple
+from collections.abc import Callable
@@
-        tokenizer: Optional[Union[Callable[[str], List[str]], Any]] = None
+        tokenizer: Optional[Callable[[str], List[str]] | Any] = None

Also applies to: 111-111

core/processors/generic_document_processor.py (3)

183-194: Make extraction future timeout configurable

Long documents/code may exceed 120s. Allow config override to avoid spurious failures.
-                        result = future.result(timeout=120)
+                        result = future.result(timeout=extraction_config.get('future_timeout_seconds', 120))
218-224: Make embedding future timeout configurable

Same rationale as extraction; 300s may be low under heavy GPU load.
-                        result = future.result(timeout=300)
+                        result = future.result(timeout=embedding_config.get('future_timeout_seconds', 300))
Also applies to: 247-258

409-413: Avoid mutating sys.path in worker init; prefer package install/imports

sys.path hacks are brittle in multiprocess contexts. Prefer installing the package or setting PYTHONPATH for the test environment.

tests/test_c_files.py (3)

13-13: Remove unused import

GitHubDocumentManager isn’t used in this test.
-from tools.github.github_document_manager import GitHubDocumentManager
163-163: Add trailing newline

Conform to POSIX/linters.
 if __name__ == "__main__":
     test_c_files()
+
86-107: Optional: make Arango connection and query conditional

Consider skipping DB steps if ARANGO_HOST/ARANGO_PASSWORD are unset or connection fails, to keep the test runnable in CI without DB.

Also applies to: 129-158

tests/test_github_treesitter.py (5)

127-129: Use a temp staging directory for portability and isolation.

Hardcoding /dev/shm is Linux-specific; prefer a temp dir and still clean it up.

-        'staging': {
-            'directory': '/dev/shm/github_test_staging'
-        }
+        'staging': {
+            'directory': os.getenv('STAGING_DIR', tempfile.mkdtemp(prefix='github_test_staging_'))
+        }
@@
-    import shutil
-    shutil.rmtree(config['staging']['directory'], ignore_errors=True)
+    import shutil
+    shutil.rmtree(config['staging']['directory'], ignore_errors=True)

Add the required import:

@@
-import os
-import json
+import os
+import json
+import tempfile

Also applies to: 195-197

35-39: Avoid hardcoding clone_dir; make it configurable with an env default.

Improves portability and avoids requiring a specific mount path.

-    github_manager = GitHubDocumentManager(
-        clone_dir="/bulk-store/git",
-        cleanup=False  # Keep repos for analysis
-    )
+    github_manager = GitHubDocumentManager(
+        clone_dir=os.getenv('GITHUB_CLONE_DIR', '/tmp/github_repos'),
+        cleanup=False  # Keep repos for analysis
+    )

14-14: Avoid fragile sys.path hacks; compute project root robustly.

Keeps absolute imports while preventing accidental wrong paths.

-# Add parent directories to path
-sys.path.insert(0, str(Path(__file__).parent.parent.parent))
+# Ensure project root on sys.path (tests expect package-style imports)
+project_root = Path(__file__).resolve().parents[1]
+if str(project_root) not in sys.path:
+    sys.path.insert(0, str(project_root))

123-125: Make GPU devices configurable via env; default to CPU.

Prevents failures on machines without GPUs.

-            'embedding': {
-                'workers': 1,
-                'gpu_devices': [0]
-            }
+            'embedding': {
+                'workers': 1,
+                'gpu_devices': json.loads(os.getenv('GPU_DEVICES', '[]'))
+            }

202-203: Add trailing newline (ruff W292).

Satisfies linters and POSIX text file conventions.

tools/arxiv/arxiv_manager.py (8)

16-25: Prune unused imports and types to satisfy Ruff and reduce noise.

os, asyncio, asdict, and Union aren’t used. This will trigger Ruff F401/F401/F401/F401.

-import os
 import re
 import json
 import logging
-import asyncio
 from pathlib import Path
 from datetime import datetime, timezone
-from typing import Dict, Any, List, Optional, Tuple, Union
-from dataclasses import dataclass, asdict
+from typing import Dict, Any, List, Optional, Tuple
+from dataclasses import dataclass

276-289: Open metadata with UTF-8 and log with traceback for resilience.

Minor robustness: specify encoding and use exception logging with stack trace.

-                with open(metadata_path, 'r') as f:
+                with open(metadata_path, 'r', encoding='utf-8') as f:
@@
-            except Exception as e:
-                logger.warning(f"Failed to load metadata cache: {e}")
+            except Exception:
+                logger.exception("Failed to load metadata cache")

327-343: Harden metadata coercion for authors/categories.

Metadata often uses authors_parsed and space-delimited categories. Consider normalizing gracefully to avoid type surprises.

Prefer authors = [" ".join(a[:2]).strip() for a in metadata.get("authors_parsed", [])] as first choice, else parse string, else list fallback.
Normalize categories via split() if str; ensure list of str.

If you want, I can draft a small _normalize_metadata() helper and wire it here.

372-377: Potential event-loop blocking.

process_document appears CPU/IO heavy; calling it directly in async context can block. If this runs under asyncio with other tasks, consider offloading to a thread/process pool.

Example pattern:

await asyncio.get_running_loop().run_in_executor(None, functools.partial(self.processor.process_document, ...))

431-435: Guard against None for extraction counts.

If any of these are None, len(None) will raise.

-                'num_chunks': len(result.chunks),
-                'num_tables': len(result.extraction.tables),
-                'num_equations': len(result.extraction.equations),
-                'num_images': len(result.extraction.images),
+                'num_chunks': len(result.chunks),
+                'num_tables': len(getattr(result.extraction, 'tables', []) or []),
+                'num_equations': len(getattr(result.extraction, 'equations', []) or []),
+                'num_images': len(getattr(result.extraction, 'images', []) or []),

457-467: Default structure lists to empty and keep schema stable.

Safer if any extractor component returns None; also future-proof if figures/images differ.

-                structures_doc = {
+                structures_doc = {
                     '_key': paper_info.sanitized_id,
                     'arxiv_id': paper_info.arxiv_id,
-                    'tables': result.extraction.tables,
-                    'equations': result.extraction.equations,
-                    'images': result.extraction.images,
-                    'figures': result.extraction.figures
+                    'tables': getattr(result.extraction, 'tables', []) or [],
+                    'equations': getattr(result.extraction, 'equations', []) or [],
+                    'images': getattr(result.extraction, 'images', []) or [],
+                    'figures': getattr(result.extraction, 'figures', []) or [],
                 }

471-472: Log full traceback on storage failures.

Use logger.exception for better diagnostics.

-        except Exception as e:
-            logger.error(f"Failed to store ArXiv paper {paper_info.arxiv_id}: {e}")
+        except Exception:
+            logger.exception(f"Failed to store ArXiv paper {paper_info.arxiv_id}")
             raise

26-30: Avoid function-local import; import ExtractionResult at module scope.

Saves repeated import and matches style.

 from core.processors.document_processor import (
     DocumentProcessor,
     ProcessingConfig,
-    ProcessingResult
+    ProcessingResult,
+    ExtractionResult,
 )

-                from core.processors.document_processor import ProcessingResult, ExtractionResult
+                # imports available at module scope

Also applies to: 500-501

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4d99984 and 638dabd.

📒 Files selected for processing (8)

core/processors/chunking_strategies.py (12 hunks)
core/processors/generic_document_processor.py (11 hunks)
experiments/word2vec_evolution/experiment_config.yaml (1 hunks)
tests/test_c_files.py (1 hunks)
tests/test_github_treesitter.py (1 hunks)
tests/test_overnight_arxiv_2000.py (6 hunks)
tools/arxiv/arxiv_document_manager.py (8 hunks)
tools/arxiv/arxiv_manager.py (15 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Prefer absolute imports following the shown module structure (e.g., core.framework., tools.arxiv.pipelines.)

Files:

tests/test_github_treesitter.py
tests/test_c_files.py
core/processors/chunking_strategies.py
core/processors/generic_document_processor.py
tests/test_overnight_arxiv_2000.py
tools/arxiv/arxiv_manager.py
tools/arxiv/arxiv_document_manager.py

{tools/arxiv,core/framework}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

{tools/arxiv,core/framework}/**/*.py: Format Python code with Black for tools/arxiv/ and core/framework/
Run Ruff linting on tools/arxiv/ and core/framework/

Files:

tools/arxiv/arxiv_manager.py
tools/arxiv/arxiv_document_manager.py

🧠 Learnings (2)

📚 Learning: 2025-09-02T23:17:37.712Z

Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-02T23:17:37.712Z
Learning: Applies to tools/arxiv/{pipelines/arxiv_pipeline.py,utils/check_db_status.py} : Use ArangoDB collections: arxiv_papers, arxiv_chunks, arxiv_embeddings, arxiv_structures (with specified fields)

Applied to files:

tests/test_github_treesitter.py
tools/arxiv/arxiv_manager.py

📚 Learning: 2025-09-02T23:17:37.712Z

Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-02T23:17:37.712Z
Learning: Applies to tools/arxiv/{pipelines/arxiv_pipeline.py,utils/check_db_status.py} : Store all outputs in ArangoDB with atomic transactions

Applied to files:

tools/arxiv/arxiv_manager.py

🧬 Code graph analysis (5)

tests/test_github_treesitter.py (3)

tools/github/github_document_manager.py (2)

GitHubDocumentManager (52-312)

prepare_repository (91-127)

core/processors/generic_document_processor.py (2)

GenericDocumentProcessor (46-261)

process_documents (91-128)

core/framework/extractors/code_extractor.py (1)

CodeExtractor (18-192)

tests/test_c_files.py (3)

tools/github/github_document_manager.py (1)

GitHubDocumentManager (52-312)

core/processors/generic_document_processor.py (3)

GenericDocumentProcessor (46-261)

DocumentTask (29-43)

process_documents (91-128)

core/framework/extractors/code_extractor.py (1)

CodeExtractor (18-192)

core/processors/chunking_strategies.py (1)

experiments/word2vec_evolution/generate_embeddings.py (1)

chunk_text (30-57)

core/processors/generic_document_processor.py (3)

core/framework/embedders.py (1)

JinaV4Embedder (52-1003)

core/database/arango_db_manager.py (1)

ArangoDBManager (69-252)

tools/arxiv/arxiv_manager.py (1)

sanitized_id (60-67)

tests/test_overnight_arxiv_2000.py (1)

tests/test_overnight_arxiv_phased.py (1)

collect_arxiv_papers (57-101)

🪛 Ruff (0.12.2)

tests/test_github_treesitter.py

203-203: No newline at end of file

Add trailing newline

(W292)

tests/test_c_files.py

163-163: No newline at end of file

Add trailing newline

(W292)

core/processors/chunking_strategies.py

17-17: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)

17-17: typing.List is deprecated, use list instead

(UP035)

17-17: typing.Dict is deprecated, use dict instead

(UP035)

17-17: typing.Tuple is deprecated, use tuple instead

(UP035)

111-111: Use X | Y for type annotations

Convert to X | Y

(UP007)

🪛 YAMLlint (1.37.1)

experiments/word2vec_evolution/experiment_config.yaml

[error] 112-112: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (11)

experiments/word2vec_evolution/experiment_config.yaml (2)

1-12: Overall: concise, readable config.

Good use of document marker, structured sections, and wrapped descriptions.

34-36: No action needed—Jina v4 outputs 2048-dim embeddings as configured. Tests, pipeline code, and docs consistently assert EMBEDDING_DIM = 2048 for model “jina-v4,” so the settings in experiment_config.yaml are correct.

tools/arxiv/arxiv_document_manager.py (1)

52-61: Docstring clarity—LGTM.

The behavior description (skip invalid/missing, preserve order) is clear and helpful.

core/processors/generic_document_processor.py (1)

271-287: Cannot verify GPU sharding via mp.current_process()._identity
The test script failed with a SemLock error in this environment, so please manually confirm that mp.current_process()._identity is reliably non‐empty under your executor. If it isn’t, switch to a simpler round-robin allocator (e.g. an atomic counter) for assigning CUDA_VISIBLE_DEVICES.

tools/arxiv/arxiv_manager.py (7)

26-31: Absolute imports: good move.

Using absolute imports (core.*) aligns with repo guidelines and eases refactors.

61-66: Docstring clarity for sanitized_id looks good.

Explains transformation and usage clearly.

117-132: PDF path docstring is precise and helpful.

No issues.

155-158: LaTeX path docstring reads well.

Clear precedence for main file detection.

183-197: parse_arxiv_id docs LGTM.

Covers new/old formats and returned fields succinctly.

242-249: init docstring is clear about wiring and behavior.

Nice.

480-490: Batch processing behavior documented well.

Ordering guarantee and failure shaping are clear.

coderabbitai · 2025-09-03T02:38:23Z

experiments/word2vec_evolution/experiment_config.yaml

+papers:
+  word2vec:
+    arxiv_id: "1301.3781"
+    title: "Efficient Estimation of Word Representations in Vector Space"
+    year: 2013
+    github_repo: "https://github.com/dav/word2vec"
+
+  doc2vec:
+    arxiv_id: "1405.4053"
+    title: "Distributed Representations of Sentences and Documents"
+    year: 2014
+    github_repo: "https://github.com/bnosac/doc2vec"
+
+  code2vec:
+    arxiv_id: "1803.09473"
+    title: "code2vec: Learning Distributed Representations of Code"
+    year: 2018
+    github_repo: "https://github.com/tech-srl/code2vec"
+


🛠️ Refactor suggestion

Pin GitHub repos by commit for reproducibility.

Unpinned default branches are moving targets; runs won’t be reproducible.

papers: word2vec: arxiv_id: "1301.3781" title: "Efficient Estimation of Word Representations in Vector Space" year: 2013 github_repo: "https://github.com/tmikolov/word2vec" + commit: "<PIN_SHA_7+>" doc2vec: arxiv_id: "1405.4053" title: "Distributed Representations of Sentences and Documents" year: 2014 github_repo: "https://github.com/RaRe-Technologies/gensim" + commit: "<PIN_SHA_7+>" code2vec: arxiv_id: "1803.09473" title: "code2vec: Learning Distributed Representations of Code" year: 2018 github_repo: "https://github.com/tech-srl/code2vec" + commit: "<PIN_SHA_7+>"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

papers:

word2vec:

arxiv_id: "1301.3781"

title: "Efficient Estimation of Word Representations in Vector Space"

year: 2013

github_repo: "https://github.com/dav/word2vec"

doc2vec:

arxiv_id: "1405.4053"

title: "Distributed Representations of Sentences and Documents"

year: 2014

github_repo: "https://github.com/bnosac/doc2vec"

code2vec:

arxiv_id: "1803.09473"

title: "code2vec: Learning Distributed Representations of Code"

year: 2018

github_repo: "https://github.com/tech-srl/code2vec"

papers:

word2vec:

arxiv_id: "1301.3781"

title: "Efficient Estimation of Word Representations in Vector Space"

year: 2013

github_repo: "https://github.com/tmikolov/word2vec"

commit: "<PIN_SHA_7+>"

doc2vec:

arxiv_id: "1405.4053"

title: "Distributed Representations of Sentences and Documents"

year: 2014

github_repo: "https://github.com/RaRe-Technologies/gensim"

commit: "<PIN_SHA_7+>"

code2vec:

arxiv_id: "1803.09473"

title: "code2vec: Learning Distributed Representations of Code"

year: 2018

github_repo: "https://github.com/tech-srl/code2vec"

commit: "<PIN_SHA_7+>"

🤖 Prompt for AI Agents

In experiments/word2vec_evolution/experiment_config.yaml around lines 13 to 31, the github_repo fields point to unpinned default branches which makes runs non-reproducible; update each paper entry to pin the source by adding a commit field (e.g., github_commit: "<full-commit-sha>") or replace github_repo with a github_repo_commit field that includes the repo URL plus the exact commit or tag (full SHA preferable), and ensure any tooling that fetches the repo uses that commit to checkout; repeat for word2vec, doc2vec and code2vec entries.

coderabbitai · 2025-09-03T02:38:23Z

tests/test_github_treesitter.py

+    # Display results
+    logger.info("\nPipeline Results:")
+    logger.info(f"Total processed: {results.get('total_processed', 0)}")
+    logger.info(f"Successful: {results.get('extraction_success', 0) + results.get('embedding_success', 0)}")
+    logger.info(f"Failed: {results.get('extraction_failed', 0) + results.get('embedding_failed', 0)}")
+
+    if results.get('processing_rate'):
+        logger.info(f"Processing rate: {results['processing_rate']:.2f} docs/min")
+


⚠️ Potential issue

Correct results parsing (GenericDocumentProcessor returns nested dicts).

Access extraction/embedding summaries via results['extraction'] and results['embedding'] instead of flattened keys; remove non-existent processing_rate.

- logger.info("\nPipeline Results:") - logger.info(f"Total processed: {results.get('total_processed', 0)}") - logger.info(f"Successful: {results.get('extraction_success', 0) + results.get('embedding_success', 0)}") - logger.info(f"Failed: {results.get('extraction_failed', 0) + results.get('embedding_failed', 0)}") - - if results.get('processing_rate'): - logger.info(f"Processing rate: {results['processing_rate']:.2f} docs/min") + logger.info("\nPipeline Results:") + logger.info(f"Total processed: {results.get('total_processed', 0)}") + extraction = results.get('extraction', {}) + embedding = results.get('embedding', {}) + logger.info(f"Extraction: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed") + logger.info(f"Embedding: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Display results

logger.info("\nPipeline Results:")

logger.info(f"Total processed: {results.get('total_processed', 0)}")

logger.info(f"Successful: {results.get('extraction_success', 0) + results.get('embedding_success', 0)}")

logger.info(f"Failed: {results.get('extraction_failed', 0) + results.get('embedding_failed', 0)}")

if results.get('processing_rate'):

logger.info(f"Processing rate: {results['processing_rate']:.2f} docs/min")

# Display results

logger.info("\nPipeline Results:")

logger.info(f"Total processed: {results.get('total_processed', 0)}")

extraction = results.get('extraction', {})

embedding = results.get('embedding', {})

logger.info(f"Extraction: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed")

logger.info(f"Embedding: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed")

🤖 Prompt for AI Agents

In tests/test_github_treesitter.py around lines 151 to 159, the test is reading flattened keys that don't exist because GenericDocumentProcessor returns nested summaries; update the logging to read results.get('extraction', {}) and results.get('embedding', {}) and pull totals from those dicts (e.g. extraction.get('total_processed', 0), extraction.get('success', 0), extraction.get('failed', 0) and similarly for embedding), compute aggregated counts from those nested values for the "Total processed", "Successful" and "Failed" lines, and remove the non-existent processing_rate check/logging.

coderabbitai · 2025-09-03T02:38:23Z

tests/test_overnight_arxiv_2000.py

+def _normalize_yymm(yymm: str) -> tuple[int, int]:
+    """Normalize YYMM to (year, month) tuple for correct comparison."""
+    if len(yymm) != 4 or not yymm.isdigit():
+        raise ValueError(f"Invalid YYMM: {yymm}")
+    yy, mm = int(yymm[:2]), int(yymm[2:])
+    if not 1 <= mm <= 12:
+        raise ValueError(f"Invalid month in YYMM: {yymm}")
+    # Handle century: 90-99 = 1990s, 00-89 = 2000s
+    year = 1900 + yy if yy >= 90 else 2000 + yy
+    return (year, mm)
+


⚠️ Potential issue

Handle YYYYMM and avoid crashes on 6-digit directory names.

current code breaks when encountering numeric dirs of length 6 (e.g., 201501). Make the normalizer accept YYMM or YYYYMM.

-def _normalize_yymm(yymm: str) -> tuple[int, int]: - """Normalize YYMM to (year, month) tuple for correct comparison.""" - if len(yymm) != 4 or not yymm.isdigit(): - raise ValueError(f"Invalid YYMM: {yymm}") - yy, mm = int(yymm[:2]), int(yymm[2:]) - if not 1 <= mm <= 12: - raise ValueError(f"Invalid month in YYMM: {yymm}") - # Handle century: 90-99 = 1990s, 00-89 = 2000s - year = 1900 + yy if yy >= 90 else 2000 + yy - return (year, mm) +def _normalize_yymm(yymm: str) -> tuple[int, int]: + """Normalize YYMM or YYYYMM to (year, month) for correct comparison.""" + if not (yymm.isdigit() and len(yymm) in (4, 6)): + raise ValueError(f"Invalid YYMM/ YYYYMM: {yymm}") + if len(yymm) == 4: + yy, mm = int(yymm[:2]), int(yymm[2:]) + # 90-99 = 1990s, 00-89 = 2000s + year = 1900 + yy if yy >= 90 else 2000 + yy + else: + year, mm = int(yymm[:4]), int(yymm[4:]) + if not 1 <= mm <= 12: + raise ValueError(f"Invalid month in YYMM/ YYYYMM: {yymm}") + return (year, mm)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _normalize_yymm(yymm: str) -> tuple[int, int]:

"""Normalize YYMM to (year, month) tuple for correct comparison."""

if len(yymm) != 4 or not yymm.isdigit():

raise ValueError(f"Invalid YYMM: {yymm}")

yy, mm = int(yymm[:2]), int(yymm[2:])

if not 1 <= mm <= 12:

raise ValueError(f"Invalid month in YYMM: {yymm}")

# Handle century: 90-99 = 1990s, 00-89 = 2000s

year = 1900 + yy if yy >= 90 else 2000 + yy

return (year, mm)

def _normalize_yymm(yymm: str) -> tuple[int, int]:

"""Normalize YYMM or YYYYMM to (year, month) for correct comparison."""

if not (yymm.isdigit() and len(yymm) in (4, 6)):

raise ValueError(f"Invalid YYMM/ YYYYMM: {yymm}")

if len(yymm) == 4:

yy, mm = int(yymm[:2]), int(yymm[2:])

# 90-99 = 1990s, 00-89 = 2000s

year = 1900 + yy if yy >= 90 else 2000 + yy

else:

year, mm = int(yymm[:4]), int(yymm[4:])

if not 1 <= mm <= 12:

raise ValueError(f"Invalid month in YYMM/ YYYYMM: {yymm}")

return (year, mm)

🤖 Prompt for AI Agents

In tests/test_overnight_arxiv_2000.py around lines 54 to 64, the _normalize_yymm function currently only accepts 4-digit YYMM and fails on 6-digit YYYYMM directory names (e.g., 201501); update the function to accept either 4-digit (YYMM) or 6-digit (YYYYMM) numeric strings, parse accordingly (if length==4 treat as YYMM and apply the existing century rule, if length==6 take the first four digits as the full year and last two as month), validate that month is between 1 and 12, and raise ValueError for any other lengths or non-numeric input; ensure the function still returns (year, month) as ints.

coderabbitai · 2025-09-03T02:38:23Z

tests/test_overnight_arxiv_2000.py

+    yymm_dirs = sorted(
+        [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()],
+        key=lambda d: _normalize_yymm(d.name)
+    )

-    # Filter to start from specified year
-    yymm_dirs = [d for d in yymm_dirs if d.name >= start_year]
+    # Filter to start from specified year (normalize century)
+    start_cutoff = _normalize_yymm(start_year)
+    yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]


🛠️ Refactor suggestion

Filter dir names to 4 or 6 digits to prevent ValueError and sort once.

current list may include 6+ digit numeric names and then crash during normalization.

- yymm_dirs = sorted( - [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()], - key=lambda d: _normalize_yymm(d.name) - ) + yymm_dirs = [d for d in ARXIV_PDF_BASE.iterdir() + if d.is_dir() and d.name.isdigit() and len(d.name) in (4, 6)] + yymm_dirs.sort(key=lambda d: _normalize_yymm(d.name)) @@ - yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff] + yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

yymm_dirs = sorted(

[d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()],

key=lambda d: _normalize_yymm(d.name)

)

# Filter to start from specified year

yymm_dirs = [d for d in yymm_dirs if d.name >= start_year]

# Filter to start from specified year (normalize century)

start_cutoff = _normalize_yymm(start_year)

yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]

yymm_dirs = [d for d in ARXIV_PDF_BASE.iterdir()

if d.is_dir() and d.name.isdigit() and len(d.name) in (4, 6)]

yymm_dirs.sort(key=lambda d: _normalize_yymm(d.name))

# Filter to start from specified year (normalize century)

start_cutoff = _normalize_yymm(start_year)

yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]

🤖 Prompt for AI Agents

In tests/test_overnight_arxiv_2000.py around lines 89 to 96, the directory filtering and sorting can raise ValueError when numeric directory names have more than 4 or 6 digits and also invokes normalization multiple times; restrict the initial list comprehension to directories whose names are all digits and length 4 or 6 (e.g., len(name) in (4,6)), then perform a single sorted(...) call using key=lambda d: _normalize_yymm(d.name) and finally apply the start_year cutoff by comparing _normalize_yymm(d.name) to the normalized start_cutoff so normalization is only called where appropriate.

coderabbitai · 2025-09-03T02:38:24Z

tools/arxiv/arxiv_document_manager.py

+                                        Prepare DocumentTask objects for PDFs located in a specific ArXiv YYMM directory.
+                                        
+                                        Searches the manager's PDF base directory under the given year_month (e.g., "2301"), finds up to `limit` PDF files, and builds DocumentTask objects for each PDF. The ArXiv ID passed to the task is constructed as "{year_month}.{paper_id}" where `paper_id` is derived from the PDF filename stem.
+                                        
+                                        Parameters:
+                                            year_month (str): Two-digit year and month string in YYMM format (e.g., "2301").
+                                            limit (Optional[int]): Maximum number of PDFs to process from the directory. If None, all PDFs are processed.
+                                        
+                                        Returns:
+                                            List[DocumentTask]: A list of prepared DocumentTask objects. Returns an empty list if the directory does not exist or no tasks could be prepared.
+                                        """


🛠️ Refactor suggestion

Fix limit=0 bug and unify arXiv ID derivation.

limit=0 currently processes all PDFs (truthiness check). Should slice when limit is 0.

Derive ID consistently with prepare_recent_documents: use full stem if it already contains a dot, else prefix with YYMM. Avoid split('.')[-1] which can break on extra dots in filenames.

- pdf_files = sorted(dir_path.glob("*.pdf")) - if limit: - pdf_files = pdf_files[:limit] + pdf_files = sorted(dir_path.glob("*.pdf")) + if limit is not None: + pdf_files = pdf_files[:limit] @@ - for pdf_path in pdf_files: - arxiv_id = f"{year_month}.{pdf_path.stem.split('.')[-1]}" + for pdf_path in pdf_files: + stem = pdf_path.stem + arxiv_id = stem if "." in stem else f"{year_month}.{stem}" task = self._prepare_from_path(pdf_path, arxiv_id) if task: tasks.append(task)

Also applies to: 94-103

🤖 Prompt for AI Agents

tools/arxiv/arxiv_document_manager.py lines 78-88 and 94-103: fix two issues—limit handling and arXiv ID derivation. For limit handling, do not rely on truthiness; when limit is not None slice the list of PDF files with files = files[:limit] so limit=0 yields zero files; when limit is None process all files. For ID derivation, stop using split('.')[-1]; obtain the filename stem and if it already contains a dot use the full stem as the arXiv id, otherwise prefix it with the year_month and a dot (f"{year_month}.{stem}"), and apply the same logic in both code regions (lines 78-88 and 94-103) to keep behavior consistent.

coderabbitai · 2025-09-03T02:38:24Z

tools/arxiv/arxiv_document_manager.py

+        """
+        Prepare a DocumentTask for a single arXiv identifier.
+        
+        Given an arXiv ID of the form "YYMM.NNNNN" (or with a version suffix, e.g. "YYMM.NNNNNv1"), this locates the corresponding PDF under the manager's base directory (pdf_base_dir/YYMM/<arxiv_id>.pdf) and, if found, delegates construction of the DocumentTask to _prepare_from_path.
+        
+        Parameters:
+            arxiv_id (str): ArXiv identifier expected to contain a single dot separating the YYMM prefix from the paper number.
+        
+        Returns:
+            Optional[DocumentTask]: A DocumentTask for the PDF when the ID is valid and the PDF exists; otherwise None.
+        """


🛠️ Refactor suggestion

Validate IDs and fallback to versionless file when vN is missing on disk.

Docstrings allow version suffix. If only versionless PDF exists, we should still prepare the task.

- # Parse ArXiv ID to get YYMM directory - parts = arxiv_id.split('.') - if len(parts) != 2: - logger.error(f"Invalid ArXiv ID format: {arxiv_id}") - return None - - year_month = parts[0] - - # Construct PDF path - pdf_path = self.pdf_base_dir / year_month / f"{arxiv_id}.pdf" - - if not pdf_path.exists(): - logger.warning(f"PDF not found: {pdf_path}") - return None - - return self._prepare_from_path(pdf_path, arxiv_id) + # Validate and parse ArXiv ID + if not self.validate_arxiv_id(arxiv_id): + logger.error(f"Invalid ArXiv ID format: {arxiv_id}") + return None + + year_month, paper_num = arxiv_id.split(".") + + # Prefer exact match (including version, if any) + pdf_path = self.pdf_base_dir / year_month / f"{arxiv_id}.pdf" + if pdf_path.exists(): + return self._prepare_from_path(pdf_path, arxiv_id) + + # Fallback: try versionless if a version suffix was provided + base_num, _, _ = paper_num.partition("v") + if base_num != paper_num: + alt_id = f"{year_month}.{base_num}" + alt_pdf_path = self.pdf_base_dir / year_month / f"{alt_id}.pdf" + if alt_pdf_path.exists(): + return self._prepare_from_path(alt_pdf_path, alt_id) + + logger.warning(f"PDF not found: {pdf_path}") + return None

Also applies to: 164-179

coderabbitai · 2025-09-03T02:38:24Z

tools/arxiv/arxiv_manager.py

+        Persist ArXiv processing results and extracted artifacts to the configured ArangoDB collections.
+        
+        If no database manager is configured, the call is a no-op (it logs a warning and returns). When a DB manager is present, this inserts:
+        - a main document into the `arxiv_papers` collection (keyed by paper_info.sanitized_id) containing ArXiv metadata and processing metrics;
+        - one document per text chunk with embeddings into `arxiv_embeddings`;
+        - an optional `arxiv_structures` document containing tables, equations, images, and figures when any are present.
        
-        Maintains separate collections for ArXiv papers to preserve
-        source-specific metadata and enable specialized queries.
+        Errors raised by the DB manager or during document insertion are logged and re-raised.
        """


🛠️ Refactor suggestion

Make DB writes atomic and align collections with design.

Per retrieved learnings, writes to arxiv_papers, arxiv_chunks, arxiv_embeddings, and arxiv_structures should be done in a single Arango transaction to avoid partial state on failures.

Wrap all inserts in one transaction (begin/commit/abort) via ArangoDBManager.

Store chunks in arxiv_chunks (text/positions), and embeddings in arxiv_embeddings referencing the chunk doc by key.

Ensure idempotency (upsert or conflict policy) for re-processing the same paper/version.
Do you want me to sketch a transaction wrapper API for ArangoDBManager and update this method accordingly?

coderabbitai bot and others added 20 commits August 28, 2025 03:46

Merge pull request #7 from r3d91ll/coderabbitai/docstrings/67e8628

4076f48

📝 Add docstrings to `feature/github-integration`

fix: Remove embedding files from git and update .gitignore

9b82f1c

- Remove previously committed *_embeddings.json files from git tracking - Add pattern to .gitignore to prevent future embedding files from being committed - These are large generated files that should not be in version control

Merge branch 'main' into feature/arxiv-enhancements

aaa97aa

feat: Update .gitignore to include logs and new experiment directories

144f6ef

Merge branch 'feature/github_arxiv_tooling' into feature/arxiv-enhanc…

80b0875

…ements

Merge pull request #15 from r3d91ll/feature/arxiv-enhancements

bc144e0

Feature/arxiv enhancements

Update core/framework/extractors/code_extractor.py

8146def

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update core/framework/extractors/code_extractor.py

db0931a

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update docs/implementation/arxiv_metadata_service_implementation.md

ebe5277

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update experiments/word2vec_evolution/process_papers.py

a2f29bb

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update experiments/word2vec_evolution/process_papers.py

4d96755

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

rd91ll added 3 commits September 3, 2025 01:09

fix: Enhance tokenizer interface validation in TokenBasedChunking

4fbc448

fix: Format YAML configuration for improved readability

80596db

fix: Refactor repository processing to use GitHubPipelineManager and …

2af1332

…improve error handling

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

fix: Add newline at the end of experiment_config.yaml for consistency

4d99984

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

rd91ll and others added 2 commits September 3, 2025 02:27

Merge pull request #18 from r3d91ll/pr-17

638dabd

fix: Address CodeRabbit review comments from PR #17

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

r3d91ll merged commit 20674b2 into main Sep 3, 2025
1 check passed

coderabbitai bot mentioned this pull request Sep 4, 2025

Experiement word2vec lineage #26

Merged

This was referenced Sep 13, 2025

feat: Phase 1 - Core Restructure: Embedders and Extractors Migration #36

Closed

feat: Phase 4 - Integration and Testing (Issue #35) #40

Merged

Feature/arango http2 client #52

Merged

	set -e # Exit on error
	set -e # Exit on error
	set -o pipefail # Propagate failures through pipelines

		echo "Clearing previous staging data..."
		rm -rf "$STAGING_DIR"/*

-        # Connect to ArangoDB
-        client = ArangoClient(hosts='http://192.168.1.69:8529')
-        self.db = client.db('academy_store', username='root', password='1luv93ngu1n$')
+import os
+        # Connect to ArangoDB
+        host = os.getenv("ARANGO_HOST", "http://localhost:8529")
+        db_name = os.getenv("ARANGO_DB", "academy_store")
+        user = os.getenv("ARANGO_USER", "root")
+        password = os.getenv("ARANGO_PASSWORD")
+        if not password:
+            raise ValueError("ARANGO_PASSWORD not set")
+        client = ArangoClient(hosts=host)
+        self.db = client.db(db_name, username=user, password=password)

-            output_files = write_outputs(
-                out_dir=output_dir,
-                prefix=output.get('prefix', 'arxiv_ids'),
-                ids=ids,
-                with_pdf=with_pdf,
-                missing_pdf=missing_pdf,
-                stats=stats,
-                monthly=monthly,
-                suffix=stamp,
-                update_symlinks=output.get('update_symlinks', True)
-            )
+            # Determine prefix and symlink behavior up front
+            prefix = output.get("prefix", "arxiv_ids")
+            update_symlinks = output.get("update_symlinks", True)
+            # Actually write the outputs (returns None)
+            write_outputs(
+                out_dir=output_dir,
+                prefix=prefix,
+                ids=ids,
+                with_pdf=with_pdf,
+                missing_pdf=missing_pdf,
+                stats=stats,
+                monthly=monthly,
+                suffix=stamp,
+                update_symlinks=update_symlinks,
+            )
+            # Manually reconstruct the paths that were written
+            output_files = {
+                "master_ts": output_dir / f"{prefix}_sweep_{stamp}.txt",
+                "stats_ts": output_dir / f"{prefix}_sweep_stats_{stamp}.json",
+            }
+            if update_symlinks:
+                output_files["master"] = output_dir / f"{prefix}_sweep.txt"
+                output_files["stats"] = output_dir / f"{prefix}_stats.json"
+            if with_pdf:
+                output_files["with_pdfs_ts"] = output_dir / f"{prefix}_with_pdfs_{stamp}.txt"
+                if update_symlinks:
+                    output_files["with_pdfs"] = output_dir / f"{prefix}_with_pdfs.txt"
+            if missing_pdf:
+                output_files["missing_pdfs_ts"] = output_dir / f"{prefix}_missing_pdfs_{stamp}.txt"
+                if update_symlinks:
+                    output_files["missing_pdfs"] = output_dir / f"{prefix}_missing_pdfs.txt"
+            logger.info("Output files written", files={k: str(v) for k, v in output_files.items()})

Feature/GitHub arxiv tooling #17

Feature/GitHub arxiv tooling #17

Uh oh!

Conversation

r3d91ll commented Sep 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r3d91ll Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

r3d91ll commented Sep 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 3, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot Sep 3, 2025 •

edited

Loading