-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/GitHub arxiv tooling #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Docstrings generation was requested by @r3d91ll. * #6 (comment) The following files were modified: * `core/framework/extractors/code_extractor.py` * `core/framework/extractors/docling_extractor.py` * `core/processors/chunking_strategies.py` * `core/processors/document_processor.py` * `core/processors/generic_document_processor.py` * `tests/monitor_overnight.py` * `tests/monitor_phased.py` * `tests/test_batch_processing.py` * `tests/test_batch_processing_simple.py` * `tests/test_document_processor.py` * `tests/test_overnight_2000.py` * `tests/test_overnight_arxiv_2000.py` * `tests/test_overnight_arxiv_phased.py` * `tools/arxiv/arxiv_document_manager.py` * `tools/arxiv/arxiv_manager.py` * `tools/arxiv/pipelines/arxiv_pipeline_v2.py` * `tools/github/github_document_manager.py` * `tools/github/github_pipeline.py`
📝 Add docstrings to `feature/github-integration`
- Add Tree-sitter symbol extraction for code analysis - Implement graph-based storage architecture in ArangoDB - Add robust and code extractors with proper error handling - Fix all CodeRabbit review issues (PyMuPDF, ProcessPoolExecutor, etc.) - Add comprehensive documentation and ADRs - Successfully process repositories with 100+ files/minute rate 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Update `.gitignore` to include new data files for extended collections. - Modify `collect_ai_papers_extended.py` to allow dynamic target count for paper collection. - Implement `run_pipeline_from_list.py` to process papers from a specified list. - Add `run_test_pipeline.py` for configurable testing of the ArXiv pipeline.
…idge analysis This commit adds a comprehensive experiment analyzing the evolution of word embeddings from word2vec (2013) → doc2vec (2014) → code2vec (2018) to identify theory-practice bridges and measure "conveyance" - the ability to transfer knowledge from papers to implementations. Key Changes: - Add word2vec_evolution experiment with complete pipeline - Fix GitHub pipeline to support config-driven repository processing - Create graph structure for connecting papers with code implementations - Implement conveyance measurement framework (C = W×R×H/T × Ctx^α) Experiment Highlights: - Identified Gensim's doc2vec as "pure conveyance" (α ≈ 2.0) * Implementation created solely from paper without original code * Became de facto standard, empirically proving high conveyance - Processed 3 papers and 3 repositories with embeddings - Created edge collections for theory-practice relationships - Established temporal evolution chain in ArangoDB graph Technical Improvements: - GitHub pipeline now reads repositories from config (consistency with ArXiv) - Added graph collections: paper_implements_theory, semantic_similarity, temporal_evolution, conveyance_bridge - Cleaned up deprecated configs and removed large data files from version control Performance: - Papers: 27 chunks, 2048-dim Jina v4 embeddings - Repositories: 275 files, 1,453 embeddings - Processing: 24.8s papers, 135.9s repositories This provides the foundation for entropy mapping and conveyance analysis to quantify how effectively knowledge transfers from academic theory to practical implementations. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
- Implemented `update_pdf_sizes.py` to scan PDF files and update their sizes in the database. - Created `utils.py` for utility functions including normalization of arXiv IDs. - Developed orchestration module with `orchestrator.py` to manage the search, preview, refine, and process workflow for arXiv metadata. - Added `list_generator.py` to generate paper lists from PostgreSQL based on search configurations. - Introduced `check_existing_papers.py` script to verify which papers are already processed in ArangoDB. - Enhanced logging and error handling across modules for better traceability and debugging.
- Remove previously committed *_embeddings.json files from git tracking - Add pattern to .gitignore to prevent future embedding files from being committed - These are large generated files that should not be in version control
Security fixes: - Remove hardcoded DSN credentials from db.yaml, use environment variables - Fix .gitignore to only exclude test artifacts, not test source files - Fix keywords field in arxiv_search_minimal.yaml (empty list instead of empty string) Import path fixes: - Update all imports from tools.arxiv.pipelines.arango_db_manager to core.database.arango_db_manager - Fix incorrect import in demo_github_pipeline.py These changes address critical security vulnerabilities and broken imports identified by CodeRabbit.
ArangoDB Manager fixes: - Fix with_connection decorator to inject via kwargs (works with methods) - Improve TTL index checking to verify specific field presence - Add overwrite parameter to insert_document with proper warnings - Document data loss risks when overwrite=True Extractor fixes: - Fix tree-sitter generator detection to use node attribute - Fix robust_extractor to use public shutdown API first - Add proper exception handling for executor cleanup These changes improve robustness and address potential issues identified by CodeRabbit.
Feature/arxiv enhancements
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
- Remove all JSON result files that were accidentally tracked - Remove extracted papers and analysis files - Update .gitignore to prevent these files from being added - These are generated output files that should not be in version control
- Merge paper IDs from loaded JSON collection into collected_ids set - Prevents re-collecting papers that are already in existing collections - Handles both 'id' and 'arxiv_id' field names for compatibility - Addresses CodeRabbit review comment in PR #16
WalkthroughAdds a large multi-source processing surface: ArangoDB manager, Postgres-backed ArXiv metadata service and migrations, robust PDF and Tree-sitter code extractors, GenericDocumentProcessor with phased extraction/embedding and checkpointing, GitHub repo processing and graph storage, many CLI/scripts/tests/configs/experiments, docs (ADRs/PRDs), packaging and .gitignore updates, and removal of autogenerated/legacy artifacts. Changes
Sequence Diagram(s)%%{init: {"themeVariables": {"primaryColor":"#f5f7fa","actorBorder":"#244","noteBg":"#f0f4ff"}}}%%
sequenceDiagram
autonumber
actor User
participant GDM as GitHubDocumentManager
participant GDP as GenericDocumentProcessor
participant EX as Code/Tree-sitter Extractor
participant EMB as JinaV4Embedder
participant DB as ArangoDBManager
User->>GDM: prepare_repository(repo_url)
GDM-->>User: tasks (file list)
User->>GDP: process_documents(tasks)
GDP->>EX: extract(file)
EX-->>GDP: symbols/metrics + text
GDP->>EMB: embed (late chunking)
EMB-->>GDP: embeddings
GDP->>DB: begin transaction → insert repo/file/chunk/embedding
DB-->>GDP: commit
GDP-->>User: summary (processed counts)
sequenceDiagram
autonumber
participant DP as DocumentProcessor
participant RE as RobustExtractor
participant DL as Docling Worker (subprocess)
participant PM as PyMuPDF Fallback
DP->>RE: extract(pdf_path)
RE->>DL: run with timeout
alt Docling success
DL-->>RE: content
RE-->>DP: result (docling)
else timeout/error
RE->>PM: fallback extract
PM-->>RE: text-only content
RE-->>DP: result (pymupdf_fallback)
end
sequenceDiagram
autonumber
actor Operator
participant ORCH as ArxivPipelineOrchestrator
participant SQL as Postgres
participant LG as ListGenerator
participant PIPE as ArXiv Pipeline v2
Operator->>ORCH: load search config
ORCH->>SQL: assess_compute_cost(build_where)
SQL-->>ORCH: assessment
ORCH->>Operator: preview/refine
Operator->>ORCH: approve
ORCH->>LG: generate_list()
LG-->>ORCH: paper_ids + files
ORCH->>PIPE: run with list/config
PIPE-->>ORCH: results file
ORCH-->>Operator: completed + next-step cmd
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120+ minutes Possibly related issues
Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 106
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (37)
tests/test_overnight_arxiv_2000.py (5)
54-109: Fix YYMM filtering (lexicographic bug across centuries) and make sampling reproducible.
- Comparing "YYMM" as strings admits 1990s (e.g., "9912") when starting at "1501". Normalize to (YYYY, MM).
- Add optional seed for deterministic runs.
Apply this diff:
-def collect_arxiv_papers(target_count: int = 2000, start_year: str = "1501") -> List[Tuple[Path, str]]: +def collect_arxiv_papers(target_count: int = 2000, start_year: str = "1501", seed: int | None = None) -> List[Tuple[Path, str]]: @@ - Parameters: + Parameters: target_count (int): Maximum number of papers to collect. start_year (str): Inclusive YYMM directory name to start searching from (e.g., "1501" = January 2015). + seed (int | None): Optional RNG seed for reproducible sampling. @@ - papers = [] + papers = [] + rng = random.Random(seed) if seed is not None else random @@ - yymm_dirs = sorted([d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()]) - - # Filter to start from specified year - yymm_dirs = [d for d in yymm_dirs if d.name >= start_year] + yymm_dirs = sorted( + [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()], + key=lambda d: _normalize_yymm(d.name) + ) + # Filter to start from specified year (normalize century) + start_cutoff = _normalize_yymm(start_year) + yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff] @@ - sampled_pdfs = random.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_files + sampled_pdfs = rng.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_filesAdd this helper near the constants:
def _normalize_yymm(yymm: str) -> tuple[int, int]: if len(yymm) != 4 or not yymm.isdigit(): raise ValueError(f"Invalid YYMM: {yymm}") yy, mm = int(yymm[:2]), int(yymm[2:]) if not 1 <= mm <= 12: raise ValueError(f"Invalid month in YYMM: {yymm}") year = 1900 + yy if yy >= 90 else 2000 + yy return (year, mm)
223-244: Import heavy deps inside the function and make GPU use dynamic.
- Localize imports to avoid import-time failures in non-GPU environments.
- Detect CUDA and only enable FP16 when GPU is available.
@@ - # Configure processor - config = ProcessingConfig( + # Local imports to avoid import-time side effects + from core.processors.document_processor import DocumentProcessor, ProcessingConfig + try: + import torch # optional + use_gpu = torch.cuda.is_available() + except Exception: + use_gpu = False + + # Configure processor + config = ProcessingConfig( chunking_strategy=test_config['strategy'], chunk_size_tokens=test_config['chunk_size'], chunk_overlap_tokens=test_config['overlap'], batch_size=test_config['batch_size'], - use_gpu=True, - use_fp16=True, # Use FP16 for memory efficiency + use_gpu=use_gpu, + use_fp16=use_gpu, # Only enable FP16 on GPU use_ocr=False, # Disable OCR for speed extract_tables=True, # Keep table extraction extract_equations=True # Keep equation extraction )
185-189: Wire CLI start-year through to collection.The --start-year flag currently has no effect.
- papers = collect_arxiv_papers(num_docs) + papers = collect_arxiv_papers(num_docs, start_year=start_year)
400-407: Report the actual number of papers processed, not the requested count.This corrects summary inaccuracies when fewer files are found or runs abort early.
- 'total_papers_processed': num_docs, + 'total_papers_processed': len(papers),
459-459: Pass start_year from CLI to the runner.Without this, --start-year is ignored.
- results = run_arxiv_overnight_test(args.docs) + results = run_arxiv_overnight_test(args.docs, start_year=args.start_year)tests/test_overnight_arxiv_phased.py (7)
49-52: Make STAGING_DIR portable and robust (fallback if /dev/shm is unavailable).Hard-coding /dev/shm breaks on macOS/CI and raises FileNotFoundError. Also prefer parents=True.
Apply:
-# Staging directory for inter-phase communication -STAGING_DIR = Path("/dev/shm/overnight_staging") -STAGING_DIR.mkdir(exist_ok=True) +# Staging directory for inter-phase communication +# Prefer env override; fall back to /dev/shm if present, else system temp +default_staging_base = Path("/dev/shm") if Path("/dev/shm").exists() else Path(os.getenv("TMPDIR", "/tmp")) +STAGING_DIR = Path(os.getenv("STAGING_DIR", default_staging_base / "overnight_staging")) +STAGING_DIR.mkdir(parents=True, exist_ok=True)
53-55: Parameterize ARXIV_PDF_BASE via environment for portability.Hard-coded path will fail outside your lab host. Allow override.
-# ArXiv PDF repository -ARXIV_PDF_BASE = Path("/bulk-store/arxiv-data/pdf") +# ArXiv PDF repository +ARXIV_PDF_BASE = Path(os.getenv("ARXIV_PDF_BASE", "/bulk-store/arxiv-data/pdf"))
78-83: Fix incorrect lexicographic comparison of year-month directory names.String comparison can misorder mixed 4- vs 6-digit names (e.g., "9999" > "202001"). Compare numerically and sort by int.
- yymm_dirs = sorted([d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()]) - yymm_dirs = [d for d in yymm_dirs if d.name >= start_year] + # Normalize to numeric comparison; accept 4- or 6-digit directory names + int_start = int(start_year) + yymm_dirs = sorted( + [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()], + key=lambda p: int(p.name) + ) + yymm_dirs = [d for d in yymm_dirs if int(d.name) >= int_start]
133-141: Improve I/O robustness and error logging.Use encoding and log tracebacks for faster diagnosis.
- with open(staging_file, 'w') as f: + with open(staging_file, 'w', encoding='utf-8') as f: json.dump(result, f) - return {'arxiv_id': arxiv_id, 'success': True, 'staged_path': str(staging_file)} + return {'arxiv_id': arxiv_id, 'success': True, 'staged_path': str(staging_file)} @@ - except Exception as e: - logger.error(f"Failed to extract {arxiv_id}: {e}") + except Exception as e: + logger.error(f"Failed to extract {arxiv_id}: {e}", exc_info=True) return {'arxiv_id': arxiv_id, 'success': False, 'error': str(e)}
268-281: Avoid setting CUDA_VISIBLE_DEVICES at runtime inside workers.Setting the env var post-import doesn’t constrain already-initialized CUDA contexts and can confuse device indexing. Rely on torch.cuda.set_device only.
if torch.cuda.is_available(): torch.cuda.set_device(gpu_id) - os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
303-312: Prevent OOM by not returning full chunks payload.Aggregate counts and persist details elsewhere if required.
- return { + return { 'arxiv_id': arxiv_id, 'success': True, 'num_chunks': len(chunks_with_embeddings), - 'chunks': chunks_with_embeddings, # Store if needed 'gpu_used': gpu_id if torch.cuda.is_available() else 'cpu' }
502-516: Compute overall metrics from actual phase outputs, not requested num_docs.Avoids inflated rates when fewer papers are processed successfully.
- total_time = extraction_results['phase_time'] + embedding_results['phase_time'] - overall_rate = num_docs / total_time * 60 + total_time = extraction_results['phase_time'] + embedding_results['phase_time'] + processed_docs = embedding_results.get('total_files', 0) or extraction_results.get('total_papers', 0) + overall_rate = (processed_docs / total_time * 60) if total_time > 0 else 0 @@ - 'total_papers_processed': num_docs, + 'total_papers_processed': processed_docs,tools/arxiv/monitoring/monitor_overnight.py (1)
46-49: Guard against partial JSON reads while file is being written.json.load() can fail if the writer hasn’t finished flushing. Retry instead of crashing.
Apply:
- if result_file.exists(): - with open(result_file) as f: - results = json.load(f) + if result_file.exists(): + try: + with open(result_file) as f: + results = json.load(f) + except json.JSONDecodeError: + # File may be mid-write; try again next cycle + time.sleep(10) + continuetests/test_overnight_2000.py (1)
43-113: --mock flag currently does nothing; monkey patch happens before args are parsed.The embedder is chosen at import time; flipping MOCK_EMBEDDER later won’t reapply the patch. Move the mock class and monkey patch behind a function and call it when args.mock is true.
-# Use real embedder by default for accurate testing -MOCK_EMBEDDER = False # Set to True only for quick debugging - -if MOCK_EMBEDDER: - logger.info("Using MOCK embedder for faster processing") - - class MockEmbedder: - """Mock embedder for testing without loading heavy models.""" - ... - # Monkey patch the embedder - import core.framework.embedders - core.framework.embedders.JinaV4Embedder = MockEmbedder -else: - logger.info("Using REAL Jina embedder - this will be slower") +MOCK_EMBEDDER = False # toggled via CLI + +def apply_mock_embedder(): + logger.info("Using MOCK embedder for faster processing") + class MockEmbedder: + """Mock embedder for testing without loading heavy models.""" + def __init__(self, *args, **kwargs): + self.embedding_dim = 2048 + def embed_texts(self, texts: List[str], batch_size: int = 1): + import numpy as np + return np.random.randn(len(texts), self.embedding_dim).astype(np.float32) + def embed_with_late_chunking(self, text: str): + import numpy as np + from core.framework.embedders import ChunkWithEmbedding + words = text.split() + chunk_size = 256 + chunks = [] + for i in range(0, len(words), chunk_size - 50): + chunk_text = ' '.join(words[i:i+chunk_size]) + chunk = ChunkWithEmbedding( + text=chunk_text, + embedding=np.random.randn(self.embedding_dim).astype(np.float32), + start_char=i * 5, + end_char=(i + len(chunk_text.split())) * 5, + start_token=i, + end_token=min(i + chunk_size, len(words)), + chunk_index=len(chunks), + total_chunks=0, + context_window_used=chunk_size + ) + chunks.append(chunk) + for c in chunks: c.total_chunks = len(chunks) + return chunks + import core.framework.embedders as _emb + _emb.JinaV4Embedder = MockEmbedderAnd in main, after parsing args:
- if args.mock: - MOCK_EMBEDDER = True - logger.info("WARNING: Using MOCK embedder - results won't reflect real performance!") + if args.mock: + MOCK_EMBEDDER = True + apply_mock_embedder() + logger.info("WARNING: Using MOCK embedder - results won't reflect real performance!")tools/arxiv/monitoring/monitor_phased.py (1)
104-110: Fix summary slicing: lines.index(line) can return the first occurrence, not the last.If “FINAL SUMMARY” appears multiple times, you may print the wrong block. Use the last occurrence.
- # Print summary lines - for summary_line in lines[lines.index(line):]: + # Print summary lines from the last occurrence + try: + start = len(lines) - 1 - list(reversed(lines)).index(line) + except ValueError: + start = 0 + for summary_line in lines[start:]: if "Overall Rate:" in summary_line or "Extraction:" in summary_line or "Embedding:" in summary_line: print(summary_line.strip()) returntools/arxiv/arxiv_manager.py (4)
101-112: Docstring says “ignore version suffix,” but code validates original ID; use base_id.Small logic/consistency fix; also removes an unused variable warning.
- base_id = re.sub(r'v\d+$', '', arxiv_id) + base_id = re.sub(r'v\d+$', '', arxiv_id) @@ - if cls.ARXIV_ID_PATTERN.match(arxiv_id): + if cls.ARXIV_ID_PATTERN.match(base_id): return True, None @@ - if cls.OLD_ARXIV_ID_PATTERN.match(arxiv_id): + if cls.OLD_ARXIV_ID_PATTERN.match(base_id): return True, None
413-468: Missing transactional writes and distributed lock; risk of partial writes and duplicate processing.Per learnings, writes should be atomic and guarded by a lock. Use Arango locks (arxiv_locks) and wrap all inserts in a single transaction.
Proposed flow:
- acquire_lock(paper_info.sanitized_id) → skip or retry if locked
- begin_transaction(write_collections=['arxiv_papers','arxiv_chunks','arxiv_embeddings','arxiv_structures']) // or tables/equations/images if you split
- insert arxiv_papers, then chunks and structures within the same tx
- commit; finally release_lock
Do you want me to draft a concrete implementation aligned with core.database.ArangoDBManager’s transaction API?
31-31: Update ArangoDBManager import to new core.database module
In tools/arxiv/arxiv_manager.py (line 31), change the import to:-from tools.arxiv.pipelines.arango_db_manager import ArangoDBManager +from core.database.arango_db_manager import ArangoDBManager
389-391: Removeawaitfrom synchronousinsert_documentcallsIn
_store_arxiv_result(tools/arxiv/arxiv_manager.py),db_manager.insert_documentis defined as a regulardef, notasync def, so awaiting it will raise. Drop theawaitbefore each call to:
self.db_manager.insert_document('arxiv_papers', arxiv_doc)self.db_manager.insert_document('arxiv_embeddings', chunk_doc)self.db_manager.insert_document('arxiv_structures', structures_doc)Occurrences at lines 438–440, 454–455 and 466–467.
tools/github/github_document_manager.py (2)
268-270: Avoid sys.path manipulation; use absolute imports.Modify to import DocumentTask via the package, per guidelines. The sys.path hack is fragile.
- # Import here to avoid circular dependency - import sys - sys.path.insert(0, str(Path(__file__).parent.parent.parent)) - from core.processors.generic_document_processor import DocumentTask + from core.processors.generic_document_processor import DocumentTask
168-188: URL parsing is brittle; handle extra path segments and both HTTPS/SSH forms robustly.Use urllib.parse and split on path safely; current split() will crash on owner/repo/subpath.
- # Handle different URL formats - url = url.strip() - - # Extract owner/repo from various formats - if "github.com" in url: - # https://github.com/owner/repo or git@github.com:owner/repo - parts = url.split("github.com")[-1] - parts = parts.strip(":/").replace(".git", "") - owner, name = parts.split("/") - else: - # Assume owner/repo format - owner, name = url.split("/") + from urllib.parse import urlparse + url = url.strip() + owner = name = None + if "github.com" in url: + if url.startswith("git@"): + path = url.split("github.com:", 1)[-1] + else: + path = urlparse(url).path + parts = [p for p in path.replace(".git", "").strip("/").split("/") if p] + if len(parts) >= 2: + owner, name = parts[0], parts[1] + else: + parts = [p for p in url.split("/") if p] + if len(parts) == 2: + owner, name = parts + if not owner or not name: + raise ValueError(f"Unrecognized GitHub repo identifier: {url}")core/processors/chunking_strategies.py (1)
477-497: SlidingWindow: emit a remainder/single chunk for short textsCurrent guard skips any chunk when len(tokens) <= window_size. Emit a remainder/single chunk so short docs aren’t dropped.
- # Handle remainder if any - if len(tokens) > self.window_size: - last_index = len(chunks) * self.step_size - if last_index < len(tokens): - remaining_tokens = tokens[last_index:] - if len(remaining_tokens) >= self.step_size: # Only if substantial - chunk_text = ' '.join(remaining_tokens) - start_char = len(' '.join(tokens[:last_index])) - - chunk = TextChunk( - text=chunk_text, - start_char=start_char, - end_char=start_char + len(chunk_text), - chunk_index=len(chunks), - metadata={ - 'strategy': 'sliding_window_remainder', - 'token_count': len(remaining_tokens) - } - ) - chunks.append(chunk) + # Handle remainder or produce a single chunk for short inputs + last_index = len(chunks) * self.step_size + if last_index < len(tokens): + remaining_tokens = tokens[last_index:] + if remaining_tokens: # emit any meaningful remainder + chunk_text = ' '.join(remaining_tokens) + start_char = len(' '.join(tokens[:last_index])) + chunk = TextChunk( + text=chunk_text, + start_char=start_char, + end_char=start_char + len(chunk_text), + chunk_index=len(chunks), + metadata={ + 'strategy': 'sliding_window_remainder' if chunks else 'sliding_window_single', + 'window_size': self.window_size, + 'step_size': self.step_size, + 'token_count': len(remaining_tokens) + } + ) + chunks.append(chunk)tools/arxiv/arxiv_document_manager.py (2)
95-97: limit=0 bug: processes all files instead of none
if limit:treats 0 as falsy. Useis not None.- if limit: - pdf_files = pdf_files[:limit] + if limit is not None: + pdf_files = pdf_files[:limit]
246-252: Critical: validate_arxiv_id strips all digits and rejects valid IDs
rstrip('v0123456789')removes all trailing digits; for "2305.12345" it becomes "", failing validation. Parse the optional version suffix instead.- # Check paper number is 4-6 digits (or with version like 12345v2) - paper_num_base = paper_num.rstrip('v0123456789') - if not paper_num_base.isdigit() or not (4 <= len(paper_num_base) <= 6): + # Check paper number is 4–6 digits, optionally followed by version like v2 + base, _, _ = paper_num.partition('v') + if not base.isdigit() or not (4 <= len(base) <= 6): return Falsetools/arxiv/pipelines/arxiv_pipeline_v2.py (3)
167-186: Checkpoint ‘enabled’ should accept booleans and strings. Current logic misinterprets YAMLtrue/false.If config uses YAML booleans,
_should_use_checkpointfalls into theautobranch erroneously.Apply:
def _should_use_checkpoint(self, count: int) -> bool: @@ - if self.checkpoint_enabled == 'true': - return True - elif self.checkpoint_enabled == 'false': - return False - else: # auto mode - return count >= self.checkpoint_auto_threshold + v = self.checkpoint_enabled + # Accept booleans and common string variants + if isinstance(v, bool): + return v + if isinstance(v, str): + s = v.strip().lower() + if s in ('true', 'yes', 'on', '1'): + return True + if s in ('false', 'no', 'off', '0'): + return False + # auto or unknown → treat as auto + return count >= self.checkpoint_auto_threshold
254-267: Checkpoint never updates: gating on results['success'] is wrong.GenericDocumentProcessor returns per-phase results; there’s no top-level ‘success’. Update based on embedding successes.
Apply:
- # Step 3: Update checkpoint if enabled - if use_checkpoint and results.get('success'): - # Add successfully processed documents to checkpoint - if 'embedding' in results and 'success' in results['embedding']: - for doc_id in results['embedding']['success']: - self.processed_ids.add(doc_id) - - # Save checkpoint periodically or at the end - self._save_checkpoint( - extraction_results=results.get('extraction'), - embedding_results=results.get('embedding') - ) - logger.info(f"Checkpoint updated: {len(self.processed_ids)} total documents processed") + # Step 3: Update checkpoint if enabled + if use_checkpoint: + embed_success = results.get('embedding', {}).get('success', []) + if embed_success: + self.processed_ids.update(embed_success) + # Save checkpoint (periodic save_interval can be added later) + self._save_checkpoint( + extraction_results=results.get('extraction'), + embedding_results=results.get('embedding') + ) + logger.info(f"Checkpoint updated: {len(self.processed_ids)} total documents processed")
277-294: Fix reporting: handle missing top-level ‘success’ and zero-division on avg chunks.Avoid KeyError on
results['success']and only compute averages when total_processed > 0.Apply:
- if results['success']: - extraction = results['extraction'] - embedding = results['embedding'] + extraction = results.get('extraction', {}) + embedding = results.get('embedding', {}) + total_processed = results.get('total_processed', 0) + if extraction or embedding: logger.info(f"Extraction: {len(extraction['success'])} success, {len(extraction['failed'])} failed") logger.info(f"Embedding: {len(embedding['success'])} success, {len(embedding['failed'])} failed") - logger.info(f"Total processed: {results['total_processed']} documents") + logger.info(f"Total processed: {total_processed} documents") - if results['total_processed'] > 0: - rate = results['total_processed'] / elapsed * 60 + if total_processed > 0: + rate = total_processed / elapsed * 60 logger.info(f"Processing rate: {rate:.1f} documents/minute") - - if 'chunks_created' in embedding: - avg_chunks = embedding['chunks_created'] / results['total_processed'] - logger.info(f"Average chunks per document: {avg_chunks:.1f}") + if 'chunks_created' in embedding: + avg_chunks = embedding['chunks_created'] / total_processed + logger.info(f"Average chunks per document: {avg_chunks:.1f}")core/framework/extractors/docling_extractor.py (3)
306-315: Flatten structured outputs and include a stableversionto match downstream expectations
DocumentProcessor._extract_contentreads top-leveltables/equations/images/figuresandversion. Here, Docling results are nested understructuresand lackversion, so structured content is silently dropped andextractor_versionbecomesunknown.Apply this diff to surface the structures at top level while keeping
structuresfor backward compatibility and addversion='docling_v2':return { - 'full_text': full_text, - 'markdown': full_text, # Also provide as 'markdown' key - 'structures': structures, - 'metadata': { - 'extractor': 'docling_v2', - 'num_pages': num_pages, - 'processing_time': None # Docling doesn't provide this - } + 'full_text': full_text, + 'markdown': full_text, # Also provide as 'markdown' key + # Flattened keys for downstream compatibility + 'tables': structures.get('tables', []), + 'equations': structures.get('equations', []), + 'images': structures.get('images', []), + 'figures': structures.get('figures', []), + # Retain original nested payload for backward compatibility + 'structures': structures, + 'metadata': { + 'extractor': 'docling_v2', + 'num_pages': num_pages, + 'processing_time': None # Populated by caller + }, + 'version': 'docling_v2' }
176-193: Fix processing_time accounting when falling back after a processing error
processing_timeis computed before running the fallback path, under-reporting total time. Recompute after the fallback completes:- except Exception as e: - # Compute duration before handling exception - duration = (datetime.now() - start_time).total_seconds() + except Exception as e: # Don't use fallback for pre-validation errors (they should fail fast) error_msg = str(e) if any(x in error_msg for x in ["empty", "Invalid PDF header", "Cannot read PDF file"]): logger.error(f"Pre-validation failed for {pdf_path}: {e}") raise RuntimeError(f"Pre-validation failed for {pdf_path}: {e}") from e # Use fallback for processing errors if enabled if self.use_fallback: logger.warning(f"Extraction failed, attempting fallback: {e}") result = self._extract_fallback(pdf_path) - # Add processing time and pdf_path to metadata - result['metadata']['processing_time'] = duration + # Add processing time and pdf_path to metadata (include fallback runtime) + duration = (datetime.now() - start_time).total_seconds() + # Ensure metadata dict exists + result.setdefault('metadata', {}) + result['metadata']['processing_time'] = duration result['metadata']['pdf_path'] = str(pdf_path) return result
388-401: Retain top‐levelpdf_pathor update all downstream usages
Multiple consumers indexextracted['pdf_path'](e.g. tools/arxiv/pipelines/arxiv_pipeline.py:95, experiments/word2vec_evolution/store_in_arango.py:65, tests/test_overnight_arxiv_phased.py:128). Removing that key inextract_batchwill break these; either keep the top‐level duplication for backward compatibility or refactor every reference to usemetadata['pdf_path'].core/processors/document_processor.py (2)
392-403: Honor nestedstructuresfrom extractors for backward compatibilityIf upstream extractors return
structures(as in current Docling path), tables/equations/images are otherwise lost. Merge fromstructureswhen top-level keys are missing:- return ExtractionResult( - full_text=full_text, - tables=docling_result.get('tables', []), - equations=docling_result.get('equations', []), - images=docling_result.get('images', []), - figures=docling_result.get('figures', []), - metadata=docling_result.get('metadata', {}), - latex_source=latex_source, - has_latex=has_latex, - extraction_time=time.time() - start_time, - extractor_version=docling_result.get('version', 'unknown') - ) + structures = docling_result.get('structures', {}) or {} + return ExtractionResult( + full_text=full_text, + tables=docling_result.get('tables') or structures.get('tables', []), + equations=docling_result.get('equations') or structures.get('equations', []), + images=docling_result.get('images') or structures.get('images', []), + figures=docling_result.get('figures') or structures.get('figures', []), + metadata=docling_result.get('metadata', {}), + latex_source=latex_source, + has_latex=has_latex, + extraction_time=time.time() - start_time, + extractor_version=docling_result.get('version', 'unknown') + )
491-505: Make embedding generation robust to runtime errorsIf
embed_textsraises,process_documentfails. Catch and degrade to zero-vector as documented.- # Generate embedding for chunk - # JinaV4Embedder expects a list and returns array of embeddings - embeddings = self.embedder.embed_texts([chunk['text']], batch_size=1) - embedding = embeddings[0] if len(embeddings) > 0 else None + # Generate embedding for chunk (robust to failures) + try: + embeddings = self.embedder.embed_texts([chunk['text']], batch_size=1) + embedding = embeddings[0] if len(embeddings) > 0 else None + except Exception as e: + logger.warning(f"Embedding failed for chunk {i}: {type(e).__name__}: {e}") + embedding = Nonetools/arxiv/scripts/run_pipeline_from_list.py (1)
1-300: Enable Black and fix all lint violations in tools/arxiv
Environment lacks ablackcommand and Ruff still reports 229 violations in tools/arxiv—including blank-line whitespace (W293), trailing whitespace (W291), unused variables (F841), bareexceptclauses (E722), and module‐level import ordering (E402). Install Black (pip install black), ensure it’s on PATH, then re-runruff check tools/arxiv --fix black tools/arxivto auto‐fix and manually address any remaining errors.
tools/arxiv/pipelines/arxiv_pipeline.py (1)
666-675: Align storage with required Arango collections and validate embedding dimensionality.Guidelines require: arxiv_papers, arxiv_chunks, arxiv_embeddings, arxiv_structures; also vectors must be 2048-d with model='jina-v4'.
- Include
arxiv_structuresin the transaction’s write set and store structures in one document.- Assert/guard that each embedding vector has 2048 dims; fail-fast otherwise.
- write_collections = ['arxiv_papers', 'arxiv_chunks', 'arxiv_embeddings'] - structures = doc.get('structures', {}) - if structures.get('equations'): - write_collections.append('arxiv_equations') - if structures.get('tables'): - write_collections.append('arxiv_tables') - if structures.get('images'): - write_collections.append('arxiv_images') + write_collections = ['arxiv_papers', 'arxiv_chunks', 'arxiv_embeddings', 'arxiv_structures'] + structures = doc.get('structures', {})- txn_db.collection('arxiv_embeddings').insert({ + emb_vec = chunk.embedding.tolist() + if len(emb_vec) != 2048: + raise ValueError(f"Unexpected embedding size {len(emb_vec)} for {arxiv_id}; expected 2048") + txn_db.collection('arxiv_embeddings').insert({ '_key': f"{chunk_key}_emb", 'paper_id': sanitized_id, 'chunk_id': chunk_key, - 'vector': chunk.embedding.tolist(), + 'vector': emb_vec, 'model': 'jina-v4', 'embedding_method': 'late_chunking', 'embedding_date': datetime.now().isoformat() }, overwrite=True)- # Store equations if present - for i, eq in enumerate(structures.get('equations', [])): - txn_db.collection('arxiv_equations').insert({ - '_key': f"{sanitized_id}_eq_{i}", - 'paper_id': sanitized_id, - 'equation_index': i, - 'latex': eq.get('latex', ''), - 'label': eq.get('label', ''), - 'type': eq.get('type', 'display') - }, overwrite=True) - - # Store tables if present - for i, tbl in enumerate(structures.get('tables', [])): - txn_db.collection('arxiv_tables').insert({ - '_key': f"{sanitized_id}_table_{i}", - 'paper_id': sanitized_id, - 'table_index': i, - 'caption': tbl.get('caption', ''), - 'headers': tbl.get('headers', []), - 'data_rows': tbl.get('rows', []) - }, overwrite=True) - - # Store images if present - for i, img in enumerate(structures.get('images', [])): - txn_db.collection('arxiv_images').insert({ - '_key': f"{sanitized_id}_img_{i}", - 'paper_id': sanitized_id, - 'image_index': i, - 'caption': img.get('caption', ''), - 'type': img.get('type', 'figure') - }, overwrite=True) + # Store all detected structures in a single document + txn_db.collection('arxiv_structures').insert({ + '_key': sanitized_id, + 'paper_id': sanitized_id, + 'equations': structures.get('equations', []), + 'tables': structures.get('tables', []), + 'images': structures.get('images', []), + 'counts': { + 'equations': len(structures.get('equations', [])), + 'tables': len(structures.get('tables', [])), + 'images': len(structures.get('images', [])), + } + }, overwrite=True)Also applies to: 740-748, 750-781
core/processors/generic_document_processor.py (2)
154-160: Make extraction task timeout configurable.Hard-coded 120s may be too short. Read from config with a sensible default.
- num_workers = extraction_config.get('workers', 8) + num_workers = extraction_config.get('workers', 8) gpu_devices = extraction_config.get('gpu_devices', [0, 1]) + task_timeout = extraction_config.get('task_timeout_seconds', 600) @@ - result = future.result(timeout=120) + result = future.result(timeout=task_timeout)Also applies to: 179-184
409-414: Fix ArangoDBManager import to honor package layout and avoid sys.path hacks.Prefer absolute imports per guidelines; keep a guarded fallback if needed.
- import sys - from pathlib import Path - sys.path.insert(0, str(Path(__file__).parent.parent.parent)) - from tools.arxiv.pipelines.arango_db_manager import ArangoDBManager + try: + from core.database.arango_db_manager import ArangoDBManager # preferred + except Exception: + from tools.arxiv.pipelines.arango_db_manager import ArangoDBManager # fallbacktools/arxiv/db/config.py (1)
1-152: Install Black, updatepyproject.toml, and run Ruff + Black on tools/arxiv/
Moveignore,select, andper-file-ignoresinto thelintsection ofpyproject.toml, then install Black (pip install black), and run:ruff check tools/arxiv --fix black tools/arxiv
| # First, ensure the GitHub graph is set up | ||
| setup_script = Path(__file__).parent.parent.parent / 'tools' / 'github' / 'setup_github_graph.py' | ||
| if setup_script.exists(): | ||
| logger.info("\nEnsuring GitHub graph collections exist...") | ||
| result = subprocess.run( | ||
| [sys.executable, str(setup_script)], | ||
| capture_output=True, | ||
| text=True, | ||
| cwd=str(setup_script.parent) | ||
| ) | ||
| if "created" in result.stdout.lower() or "already exists" in result.stdout.lower(): | ||
| logger.info(" ✓ GitHub graph collections ready") | ||
|
|
||
| # Process each repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check setup_github_graph return code and surface failures
Currently only scanning stdout text; a failing script can be silently ignored.
- result = subprocess.run(
+ result = subprocess.run(
[sys.executable, str(setup_script)],
capture_output=True,
text=True,
cwd=str(setup_script.parent)
)
- if "created" in result.stdout.lower() or "already exists" in result.stdout.lower():
- logger.info(" ✓ GitHub graph collections ready")
+ if result.returncode == 0 and (
+ "created" in result.stdout.lower() or "already exists" in result.stdout.lower()
+ ):
+ logger.info(" ✓ GitHub graph collections ready")
+ else:
+ logger.warning(f" ⚠ GitHub graph setup may have failed (rc={result.returncode}). Inspect output:")
+ logger.warning(result.stdout.strip()[:1000])
+ logger.warning(result.stderr.strip()[:1000])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # First, ensure the GitHub graph is set up | |
| setup_script = Path(__file__).parent.parent.parent / 'tools' / 'github' / 'setup_github_graph.py' | |
| if setup_script.exists(): | |
| logger.info("\nEnsuring GitHub graph collections exist...") | |
| result = subprocess.run( | |
| [sys.executable, str(setup_script)], | |
| capture_output=True, | |
| text=True, | |
| cwd=str(setup_script.parent) | |
| ) | |
| if "created" in result.stdout.lower() or "already exists" in result.stdout.lower(): | |
| logger.info(" ✓ GitHub graph collections ready") | |
| # Process each repository | |
| # First, ensure the GitHub graph is set up | |
| setup_script = Path(__file__).parent.parent.parent / 'tools' / 'github' / 'setup_github_graph.py' | |
| if setup_script.exists(): | |
| logger.info("\nEnsuring GitHub graph collections exist...") | |
| result = subprocess.run( | |
| [sys.executable, str(setup_script)], | |
| capture_output=True, | |
| text=True, | |
| cwd=str(setup_script.parent) | |
| ) | |
| if result.returncode == 0 and ( | |
| "created" in result.stdout.lower() or "already exists" in result.stdout.lower() | |
| ): | |
| logger.info(" ✓ GitHub graph collections ready") | |
| else: | |
| logger.warning(f" ⚠ GitHub graph setup may have failed (rc={result.returncode}). Inspect output:") | |
| logger.warning(result.stdout.strip()[:1000]) | |
| logger.warning(result.stderr.strip()[:1000]) | |
| # Process each repository |
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/process_github_repos.py around lines 111 to
124, the wrapper currently only inspects stdout text from running
setup_github_graph.py and can silently ignore failures; update it to check
subprocess.CompletedProcess.returncode and treat non-zero return codes as
errors: after running subprocess.run, if result.returncode != 0 log an error
including result.stderr (and result.stdout for context), and raise SystemExit or
an exception (or sys.exit(1)) to stop processing; keep the existing success
check for the positive messages but only after confirming returncode == 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coderabbitai fix submitted. please confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r3d91ll I can see that the fix has not yet been applied to the code. Looking at the current diff for experiments/word2vec_evolution/process_github_repos.py, the code still shows the original implementation on lines 111-124:
result = subprocess.run(
[sys.executable, str(setup_script)],
capture_output=True,
text=True,
cwd=str(setup_script.parent)
)
if "created" in result.stdout.lower() or "already exists" in result.stdout.lower():
logger.info(" ✓ GitHub graph collections ready")The suggested improvements are missing:
- No
result.returncodecheck - No error handling for non-zero return codes
- No stderr logging on failures
- No warning when setup fails
To apply the fix, you need to replace the current setup_github_graph section with the improved version that checks the return code and handles failures properly.
| # Fix Python path | ||
| script_dir = Path(__file__).resolve().parent | ||
| project_root = script_dir.parent.parent.parent | ||
| sys.path.insert(0, str(project_root)) | ||
| sys.path.insert(0, str(script_dir.parent)) | ||
|
|
||
| from pipelines.arxiv_pipeline import ACIDPhasedPipeline | ||
| from core.database import ArangoDBManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix imports and satisfy Ruff E402; prefer absolute package paths
Move sys.path adjustments before any imports and use absolute imports.
-# Fix Python path
-script_dir = Path(__file__).resolve().parent
-project_root = script_dir.parent.parent.parent
-sys.path.insert(0, str(project_root))
-sys.path.insert(0, str(script_dir.parent))
-
-from pipelines.arxiv_pipeline import ACIDPhasedPipeline
-from core.database import ArangoDBManager
+# Fix Python path (before imports)
+script_dir = Path(__file__).resolve().parent
+project_root = script_dir.parent.parent.parent
+sys.path.insert(0, str(project_root))
+
+from tools.arxiv.pipelines.arxiv_pipeline import ACIDPhasedPipeline
+from core.database.arango_db_manager import ArangoDBManagerAlso remove unused imports (json, datetime) to keep lint clean.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Fix Python path | |
| script_dir = Path(__file__).resolve().parent | |
| project_root = script_dir.parent.parent.parent | |
| sys.path.insert(0, str(project_root)) | |
| sys.path.insert(0, str(script_dir.parent)) | |
| from pipelines.arxiv_pipeline import ACIDPhasedPipeline | |
| from core.database import ArangoDBManager | |
| # Fix Python path (before imports) | |
| script_dir = Path(__file__).resolve().parent | |
| project_root = script_dir.parent.parent.parent | |
| sys.path.insert(0, str(project_root)) | |
| from tools.arxiv.pipelines.arxiv_pipeline import ACIDPhasedPipeline | |
| from core.database.arango_db_manager import ArangoDBManager |
🧰 Tools
🪛 Ruff (0.12.2)
22-22: Module level import not at top of file
(E402)
23-23: Module level import not at top of file
(E402)
🤖 Prompt for AI Agents
In tools/arxiv/tests/quick_pipeline_test.py around lines 16 to 23, the test
currently performs sys.path modifications after module imports which triggers
Ruff E402 and uses relative or non-absolute imports plus includes unused
imports; move the sys.path.insert(...) calls to the very top of the file before
any import statements, change imports to absolute package paths (e.g., from
pipelines.arxiv_pipeline import ACIDPhasedPipeline -> use the full package
absolute import as appropriate in your project), and remove any unused imports
like json and datetime so the file imports are ordered and lint-clean.
| print(f"\nProcessing {len(available_papers)} available papers...") | ||
|
|
||
| # The pipeline processes from filesystem, not from a list | ||
| # We'll use the 'local' source and let it find papers | ||
| pipeline.run(source='local', count=len(available_papers)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure the pipeline processes the exact test papers
source='local' + count=N may pick arbitrary PDFs. Populate specific_list with available_papers and use that source.
- # The pipeline processes from filesystem, not from a list
- # We'll use the 'local' source and let it find papers
- pipeline.run(source='local', count=len(available_papers))
+ # Force the pipeline to process exactly our IDs
+ pipeline.config.setdefault('processing', {}).setdefault('specific_list', {})['arxiv_ids'] = available_papers
+ pipeline.run(source='specific_list', count=len(available_papers))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f"\nProcessing {len(available_papers)} available papers...") | |
| # The pipeline processes from filesystem, not from a list | |
| # We'll use the 'local' source and let it find papers | |
| pipeline.run(source='local', count=len(available_papers)) | |
| print(f"\nProcessing {len(available_papers)} available papers...") | |
| # Force the pipeline to process exactly our IDs | |
| pipeline.config.setdefault('processing', {}).setdefault('specific_list', {})['arxiv_ids'] = available_papers | |
| pipeline.run(source='specific_list', count=len(available_papers)) | |
🤖 Prompt for AI Agents
In tools/arxiv/tests/quick_pipeline_test.py around lines 96 to 101, the test
currently uses source='local' with count which can pick arbitrary PDFs; instead
populate the pipeline's specific_list with available_papers and call
pipeline.run with source='specific_list' (pass the specific_list argument set to
available_papers) so the pipeline processes exactly those test papers rather
than any local PDFs; remove or ignore the count parameter when using
specific_list.
| # ========================================= | ||
| # Runs the complete test suite for 5000+ papers | ||
|
|
||
| set -e # Exit on error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Propagate failures in pipelines (tee): enable pipefail
Without set -o pipefail, failures in python ... | tee won’t be detected.
-set -e # Exit on error
+set -e # Exit on error
+set -o pipefail # Propagate failures through pipelines📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| set -e # Exit on error | |
| set -e # Exit on error | |
| set -o pipefail # Propagate failures through pipelines |
🤖 Prompt for AI Agents
In tools/arxiv/tests/run_large_scale_test.sh around line 6, the script uses "set
-e" but doesn't enable pipefail so failures in pipelines like "python ... | tee"
can be ignored; update the shell options to include "set -o pipefail" (or change
to "set -euo pipefail" if you also want nounset and errexit behavior) so
pipeline failures are propagated and the script exits on any command in a
pipeline failing.
| echo "Clearing previous staging data..." | ||
| rm -rf "$STAGING_DIR"/* | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Safe rm -rf: guard against accidental “/*”
Harden deletes with parameter expansion.
-rm -rf "$STAGING_DIR"/*
+rm -rf "${STAGING_DIR:?}/"*Repeat the same change at Line 175.
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 90-90: Use "${var:?}" to ensure this never expands to /* .
(SC2115)
🤖 Prompt for AI Agents
In tools/arxiv/tests/run_large_scale_test.sh around lines 89-91 (and again at
line 175), the unguarded rm -rf "$STAGING_DIR"/* can evaluate to rm -rf /* if
STAGING_DIR is empty; replace it with a safe parameter expansion guard so the
command aborts when STAGING_DIR is unset/empty (e.g., use the ${VAR:?} form) and
apply the same change at line 175 to harden both deletes.
| # Run the full test suite | ||
| python test_large_scale_processing.py \ | ||
| --config ../configs/large_scale_test.yaml \ | ||
| --papers "$PAPER_LIST" \ | ||
| --arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log" | ||
|
|
||
| TEST_EXIT_CODE=$? | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capture the Python exit code when piping through tee
$? captures tee’s status. Use ${PIPESTATUS[0]}.
python test_large_scale_processing.py \
--config ../configs/large_scale_test.yaml \
--papers "$PAPER_LIST" \
--arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log"
-TEST_EXIT_CODE=$?
+TEST_EXIT_CODE=${PIPESTATUS[0]}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Run the full test suite | |
| python test_large_scale_processing.py \ | |
| --config ../configs/large_scale_test.yaml \ | |
| --papers "$PAPER_LIST" \ | |
| --arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log" | |
| TEST_EXIT_CODE=$? | |
| # Run the full test suite | |
| python test_large_scale_processing.py \ | |
| --config ../configs/large_scale_test.yaml \ | |
| --papers "$PAPER_LIST" \ | |
| --arango-password "$ARANGO_PASSWORD" 2>&1 | tee "$LOG_DIR/large_scale_test_$(date +%Y%m%d_%H%M%S).log" | |
| TEST_EXIT_CODE=${PIPESTATUS[0]} |
🤖 Prompt for AI Agents
In tools/arxiv/tests/run_large_scale_test.sh around lines 137 to 144, the script
currently captures the exit code with TEST_EXIT_CODE=$? after piping the python
command through tee, which returns tee's status instead of the Python process;
replace that line with TEST_EXIT_CODE=${PIPESTATUS[0]} (or enable pipefail and
use $? accordingly) to capture the Python process exit code, and ensure the
script is executed with bash (shebang or invoked with bash) so ${PIPESTATUS[@]}
is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review continued from previous batch...
| def _extract_with_pymupdf(pdf_path: str) -> Dict[str, Any]: | ||
| """Fallback extraction using PyMuPDF.""" | ||
| if not PYMUPDF_AVAILABLE: | ||
| return None | ||
|
|
||
| try: | ||
| with fitz.open(pdf_path) as doc: | ||
| text_parts = [] | ||
| num_pages = len(doc) | ||
|
|
||
| for page_num, page in enumerate(doc): | ||
| page_text = page.get_text() | ||
| if page_text.strip(): | ||
| text_parts.append(f"--- Page {page_num + 1} ---\n{page_text}") | ||
|
|
||
| full_text = "\n\n".join(text_parts) | ||
|
|
||
| return { | ||
| 'full_text': full_text, | ||
| 'text': full_text, | ||
| 'markdown': full_text, | ||
| 'num_pages': num_pages, | ||
| 'extractor': 'pymupdf_fallback' | ||
| } | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Return a schema-compatible fallback result (include metadata/structures)
Current fallback payload lacks the 'metadata' block used elsewhere and deviates from DoclingExtractor’s shape.
Apply:
-def _extract_with_pymupdf(pdf_path: str) -> Dict[str, Any]:
+def _extract_with_pymupdf(pdf_path: str) -> dict[str, Any]:
@@
- return {
- 'full_text': full_text,
- 'text': full_text,
- 'markdown': full_text,
- 'num_pages': num_pages,
- 'extractor': 'pymupdf_fallback'
- }
+ return {
+ 'full_text': full_text,
+ 'markdown': full_text,
+ 'structures': {},
+ 'metadata': {
+ 'extractor': 'pymupdf_fallback',
+ 'num_pages': num_pages,
+ 'pdf_path': str(pdf_path),
+ 'processing_time': None,
+ 'warning': 'Using fallback extractor - structures not extracted'
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _extract_with_pymupdf(pdf_path: str) -> Dict[str, Any]: | |
| """Fallback extraction using PyMuPDF.""" | |
| if not PYMUPDF_AVAILABLE: | |
| return None | |
| try: | |
| with fitz.open(pdf_path) as doc: | |
| text_parts = [] | |
| num_pages = len(doc) | |
| for page_num, page in enumerate(doc): | |
| page_text = page.get_text() | |
| if page_text.strip(): | |
| text_parts.append(f"--- Page {page_num + 1} ---\n{page_text}") | |
| full_text = "\n\n".join(text_parts) | |
| return { | |
| 'full_text': full_text, | |
| 'text': full_text, | |
| 'markdown': full_text, | |
| 'num_pages': num_pages, | |
| 'extractor': 'pymupdf_fallback' | |
| } | |
| except Exception as e: | |
| def _extract_with_pymupdf(pdf_path: str) -> dict[str, Any]: | |
| """Fallback extraction using PyMuPDF.""" | |
| if not PYMUPDF_AVAILABLE: | |
| return None | |
| try: | |
| with fitz.open(pdf_path) as doc: | |
| text_parts = [] | |
| num_pages = len(doc) | |
| for page_num, page in enumerate(doc): | |
| page_text = page.get_text() | |
| if page_text.strip(): | |
| text_parts.append(f"--- Page {page_num + 1} ---\n{page_text}") | |
| full_text = "\n\n".join(text_parts) | |
| return { | |
| - 'full_text': full_text, | |
| - 'text': full_text, | |
| - 'markdown': full_text, | |
| - 'num_pages': num_pages, | |
| 'full_text': full_text, | |
| 'markdown': full_text, | |
| 'structures': {}, | |
| 'metadata': { | |
| 'extractor': 'pymupdf_fallback', | |
| 'num_pages': num_pages, | |
| 'pdf_path': str(pdf_path), | |
| 'processing_time': None, | |
| 'warning': 'Using fallback extractor - structures not extracted' | |
| } | |
| } | |
| except Exception as e: | |
| … |
🤖 Prompt for AI Agents
In core/framework/extractors/robust_extractor.py around lines 42 to 66, the
pymupdf fallback return lacks the expected DoclingExtractor-shaped keys; update
the returned dict to include a top-level 'metadata' dict (e.g., {'source':
pdf_path, 'num_pages': num_pages, 'extractor': 'pymupdf_fallback'}) and a
'structures' list (e.g., per-page objects with page_number and text) in addition
to the existing text/full_text/markdown fields so the payload matches the rest
of the codebase's schema.
| - **Schema Migration**: Added columns for compute cost assessment | ||
| - `pdf_size_bytes`: Track actual PDF file sizes | ||
| - `processing_complexity`: Estimates processing difficulty | ||
| - `estimated_tokens`: Estimated token count for LLM processing | ||
| - `created_at`, `modified_at`: Timestamps managed by triggers | ||
| - **Migration Script**: `migrations/add_compute_assessment_columns.sql` | ||
| - Successfully applied to existing database with 2.7M papers | ||
| - Includes indexes for performance | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Docs reference a SQL migration not shown in this PR; verify file exists and is chained.
🏁 Script executed:
#!/bin/bash
# Verify referenced migration exists and links to this initial revision
fd -t f -a 'add_compute_assessment_columns.sql' .
fd -t f -a 'tools/arxiv/db/migrations/versions' | sort
rg -n "down_revision\s*=\s*[\"']20250901_000001_initial_schema[\"']" tools/arxiv/db/migrations/versions -SLength of output: 659
🏁 Script executed:
#!/bin/bash
set -e
echo "Finding add_compute_assessment_columns.sql"
find . -type f -name 'add_compute_assessment_columns.sql' || true
echo "Searching codebase for migration name references"
rg -l 'add_compute_assessment_columns' -F . || true
echo "Listing migrations directory under tools/arxiv/db/migrations"
find tools/arxiv/db/migrations -type f -print || trueLength of output: 656
Add missing migration file and versioned script
The docs reference migrations/add_compute_assessment_columns.sql, but that SQL file isn’t in the repo and there’s no Alembic version under tools/arxiv/db/migrations/versions chaining from 20250901_000001_initial_schema.py. Add the SQL migration and a corresponding version script with down_revision='20250901_000001_initial_schema'.
🧰 Tools
🪛 LanguageTool
[grammar] ~12-~12: There might be a mistake here.
Context: ...size_bytes: Track actual PDF file sizes - processing_complexity: Estimates processing difficulty - e...
(QB_NEW_EN)
[grammar] ~13-~13: There might be a mistake here.
Context: ...lexity: Estimates processing difficulty - estimated_tokens`: Estimated token count for LLM processi...
(QB_NEW_EN)
[grammar] ~14-~14: There might be a mistake here.
Context: ...Estimated token count for LLM processing - created_at, modified_at: Timestamps managed by t...
(QB_NEW_EN)
[grammar] ~15-~15: There might be a mistake here.
Context: ...fied_at: Timestamps managed by triggers - **Migration Script**: migrations/add_comp...
(QB_NEW_EN)
[grammar] ~16-~16: There might be a mistake here.
Context: ...ged by triggers - Migration Script: migrations/add_compute_assessment_columns.sql - Successfully applied to existing databas...
(QB_NEW_EN)
[grammar] ~17-~17: There might be a mistake here.
Context: ...ed to existing database with 2.7M papers - Includes indexes for performance ### 2....
(QB_NEW_EN)
🤖 Prompt for AI Agents
In docs/implementation/arxiv_metadata_service_implementation.md around lines 11
to 19, the docs reference a missing migration file and Alembic version script;
add a SQL migration at migrations/add_compute_assessment_columns.sql containing
the DDL to add pdf_size_bytes, processing_complexity, estimated_tokens,
created_at, modified_at plus appropriate indexes and triggers, and create a
corresponding Alembic version script under tools/arxiv/db/migrations/versions
named with a proper timestamp that sets
down_revision='20250901_000001_initial_schema', implements upgrade() to run the
SQL (or emit equivalent SQLAlchemy migration steps) and downgrade() to revert
the changes, register the new revision in Alembic env, commit both files, and
verify by running the alembic upgrade to ensure the migration applies cleanly
against the existing DB.
| - Existing GitHub tools in `/home/todd/olympus/HADES/tools/github/` | ||
| - Tree-sitter for code parsing | ||
| - Jina v4 code adapter for embeddings | ||
| - ArangoDB for graph storage | ||
| - PostgreSQL for metadata | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace local absolute path with repo-relative reference.
Hard-coding /home/todd/... breaks for all other environments.
-- Existing GitHub tools in `/home/todd/olympus/HADES/tools/github/`
+- Existing GitHub tools in `tools/github/`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - Existing GitHub tools in `/home/todd/olympus/HADES/tools/github/` | |
| - Tree-sitter for code parsing | |
| - Jina v4 code adapter for embeddings | |
| - ArangoDB for graph storage | |
| - PostgreSQL for metadata | |
| - Existing GitHub tools in `tools/github/` | |
| - Tree-sitter for code parsing | |
| - Jina v4 code adapter for embeddings | |
| - ArangoDB for graph storage | |
| - PostgreSQL for metadata |
🤖 Prompt for AI Agents
In docs/issues/github_theory_practice_bridge_detection.md around lines 162 to
167, replace the hard-coded absolute path
"/home/todd/olympus/HADES/tools/github/" with a repository-relative reference
(for example "tools/github/" or "./tools/github/") or a placeholder like
"$REPO_ROOT/tools/github/" so the docs work across environments; update the list
entry accordingly and ensure any adjacent instructions reflect the new
repo-relative path convention.
| ## Database Status | ||
| - **ArangoDB** (`academy_store` @ 192.168.1.69:8529) | ||
| - 2,491 total papers (including our 3) | ||
| - 42,511 total chunks | ||
| - 43,511 total embeddings | ||
| - 1 GitHub repository (word2vec only) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redact internal host details and sensitive counts.
Publishing a private IP (192.168.1.69) and internal corpus sizes invites information leakage. Reference env-configured endpoints instead and omit exact totals (or move to an internal report).
Apply this diff:
-## Database Status
-- **ArangoDB** (`academy_store` @ 192.168.1.69:8529)
- - 2,491 total papers (including our 3)
- - 42,511 total chunks
- - 43,511 total embeddings
- - 1 GitHub repository (word2vec only)
+## Database Status (sanitized)
+- ArangoDB database and collections reachable via ARANGO_HOST/ARANGO_PASSWORD env vars.
+- Document, chunk, and embedding totals available internally (omit from public docs).
+- GitHub repository ingestion validated for at least one repository.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Database Status | |
| - **ArangoDB** (`academy_store` @ 192.168.1.69:8529) | |
| - 2,491 total papers (including our 3) | |
| - 42,511 total chunks | |
| - 43,511 total embeddings | |
| - 1 GitHub repository (word2vec only) | |
| ## Database Status (sanitized) | |
| - ArangoDB database and collections reachable via ARANGO_HOST/ARANGO_PASSWORD env vars. | |
| - Document, chunk, and embedding totals available internally (omit from public docs). | |
| - GitHub repository ingestion validated for at least one repository. |
🧰 Tools
🪛 LanguageTool
[grammar] ~82-~82: There might be a mistake here.
Context: ...e analysis (partial) ## Database Status - ArangoDB (academy_store @ 192.168.1....
(QB_NEW_EN)
[grammar] ~83-~83: There might be a mistake here.
Context: ...** (academy_store @ 192.168.1.69:8529) - 2,491 total papers (including our 3) -...
(QB_NEW_EN)
[grammar] ~84-~84: There might be a mistake here.
Context: ... - 2,491 total papers (including our 3) - 42,511 total chunks - 43,511 total emb...
(QB_NEW_EN)
[grammar] ~85-~85: There might be a mistake here.
Context: ...including our 3) - 42,511 total chunks - 43,511 total embeddings - 1 GitHub rep...
(QB_NEW_EN)
[grammar] ~86-~86: There might be a mistake here.
Context: ...total chunks - 43,511 total embeddings - 1 GitHub repository (word2vec only) ## ...
(QB_NEW_EN)
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/experiment_summary.md around lines 82 to 88,
the document exposes an internal IP address and exact dataset counts; replace
the hard-coded host/IP with a reference to an environment-configured endpoint
(e.g., ARO_DB_HOST or configured connection string) and remove or redact precise
totals (use vague descriptors like "≈2.5k papers", "tens of thousands of
chunks/embeddings" or move exact numbers to an internal-only report). Also
remove any repository-specific sensitive identifiers or mark them as internal.
Ensure the public summary contains no raw internal hostnames/IPs or exact
counts.
| response = requests.get(pdf_url, stream=True) | ||
| response.raise_for_status() | ||
|
|
||
| with open(output_path, 'wb') as f: | ||
| for chunk in response.iter_content(chunk_size=8192): | ||
| f.write(chunk) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add request timeout, UA, and skip keep-alive chunks when downloading.
Prevents hangs and follows good HTTP hygiene for arXiv.
- response = requests.get(pdf_url, stream=True)
- response.raise_for_status()
-
- with open(output_path, 'wb') as f:
- for chunk in response.iter_content(chunk_size=8192):
- f.write(chunk)
+ # Use a UA, streaming, and sensible connect/read timeouts
+ headers = {"User-Agent": "HADES-Lab/1.0 (+contact@example.com)"} # set to your contact
+ with requests.get(pdf_url, stream=True, timeout=(5, 30), headers=headers) as response:
+ response.raise_for_status()
+ with open(output_path, 'wb') as f:
+ for chunk in response.iter_content(chunk_size=8192):
+ if chunk: # skip keep-alive chunks
+ f.write(chunk)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| response = requests.get(pdf_url, stream=True) | |
| response.raise_for_status() | |
| with open(output_path, 'wb') as f: | |
| for chunk in response.iter_content(chunk_size=8192): | |
| f.write(chunk) | |
| # Use a UA, streaming, and sensible connect/read timeouts | |
| headers = {"User-Agent": "HADES-Lab/1.0 (+contact@example.com)"} # set to your contact | |
| with requests.get(pdf_url, stream=True, timeout=(5, 30), headers=headers) as response: | |
| response.raise_for_status() | |
| with open(output_path, 'wb') as f: | |
| for chunk in response.iter_content(chunk_size=8192): | |
| if chunk: # skip keep-alive chunks | |
| f.write(chunk) |
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/import_papers.py around lines 55 to 61, the
download request lacks a timeout, a User-Agent header, and doesn't skip
keep-alive (empty) chunks; update the requests.get call to include a timeout
(e.g. timeout=10) and headers with a descriptive User-Agent (and optionally
Connection: close), and when writing response.iter_content(chunk_size=8192) skip
empty chunks by doing if not chunk: continue before f.write(chunk) so keep-alive
fragments are ignored.
| cmd = [ | ||
| sys.executable, | ||
| str(pipeline_script), | ||
| "--config", str(config_path), | ||
| "--count", str(count), | ||
| "--arango-password", arango_password | ||
| ] | ||
|
|
||
| print(f"Running command:") | ||
| print(f" {' '.join(cmd[:3])}") | ||
| print(f" {' '.join(cmd[3:5])}") | ||
| print(f" {' '.join(cmd[5:7])}") | ||
| print(f" --arango-password [HIDDEN]") | ||
| print() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Don’t pass DB password via CLI; use env var. Also fix Ruff’s F541 f-strings.
Process args are visible via ps; switch to environment passing and simplify logging prints (no placeholders → no f-strings).
Apply:
@@
- cmd = [
+ cmd = [
sys.executable,
str(pipeline_script),
"--config", str(config_path),
- "--count", str(count),
- "--arango-password", arango_password
+ "--count", str(count),
]
- print(f"Running command:")
- print(f" {' '.join(cmd[:3])}")
- print(f" {' '.join(cmd[3:5])}")
- print(f" {' '.join(cmd[5:7])}")
- print(f" --arango-password [HIDDEN]")
+ print("Running command:")
+ print(" " + " ".join(cmd))
+ print(" ARANGO_PASSWORD passed via environment")
@@
- result = subprocess.run(
+ env = os.environ.copy()
+ env["ARANGO_PASSWORD"] = arango_password
+ result = subprocess.run(
cmd,
cwd=str(pipeline_dir),
- capture_output=False, # Show output in real-time
- text=True
+ capture_output=False, # Show output in real-time
+ text=True,
+ env=env
)Also applies to: 84-89
🧰 Tools
🪛 Ruff (0.12.2)
72-72: f-string without any placeholders
Remove extraneous f prefix
(F541)
76-76: f-string without any placeholders
Remove extraneous f prefix
(F541)
| def check_database() -> Tuple[bool, List[str]]: | ||
| """Check database connectivity.""" | ||
| issues = [] | ||
|
|
||
| try: | ||
| from core.database import ArangoDBManager | ||
|
|
||
| config = { | ||
| 'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'), | ||
| 'database': 'academy_store', | ||
| 'username': 'root', | ||
| 'password': os.getenv('ARANGO_PASSWORD') | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix import to use absolute module path and avoid fragile package re-exports
from core.database import ArangoDBManager likely fails unless re-exported in core/database/__init__.py. Import directly:
- from core.database import ArangoDBManager
+ from core.database.arango_db_manager import ArangoDBManager📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def check_database() -> Tuple[bool, List[str]]: | |
| """Check database connectivity.""" | |
| issues = [] | |
| try: | |
| from core.database import ArangoDBManager | |
| config = { | |
| 'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'), | |
| 'database': 'academy_store', | |
| 'username': 'root', | |
| 'password': os.getenv('ARANGO_PASSWORD') | |
| } | |
| def check_database() -> Tuple[bool, List[str]]: | |
| """Check database connectivity.""" | |
| issues = [] | |
| try: | |
| from core.database.arango_db_manager import ArangoDBManager | |
| config = { | |
| 'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'), | |
| 'database': 'academy_store', | |
| 'username': 'root', | |
| 'password': os.getenv('ARANGO_PASSWORD') | |
| } |
🤖 Prompt for AI Agents
In tools/arxiv/tests/validate_pipeline.py around lines 102 to 114, the code uses
a re-exported import "from core.database import ArangoDBManager" which is
fragile; replace it with a direct import from the module that actually defines
ArangoDBManager (for example "from core.database.manager import ArangoDBManager"
or the correct defining module), so the test does not rely on package re-exports
and will import reliably.
| required_collections = [ | ||
| 'arxiv_papers', | ||
| 'arxiv_chunks', | ||
| 'arxiv_embeddings', | ||
| 'arxiv_structures' | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Align required collections with the manager’s schema
arxiv_structures isn’t created by _ensure_collections; the actual collections are arxiv_equations, arxiv_tables, and arxiv_images. Update the check:
- required_collections = [
- 'arxiv_papers',
- 'arxiv_chunks',
- 'arxiv_embeddings',
- 'arxiv_structures'
- ]
+ required_collections = [
+ 'arxiv_papers',
+ 'arxiv_chunks',
+ 'arxiv_embeddings',
+ 'arxiv_equations',
+ 'arxiv_tables',
+ 'arxiv_images',
+ ]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| required_collections = [ | |
| 'arxiv_papers', | |
| 'arxiv_chunks', | |
| 'arxiv_embeddings', | |
| 'arxiv_structures' | |
| ] | |
| required_collections = [ | |
| 'arxiv_papers', | |
| 'arxiv_chunks', | |
| 'arxiv_embeddings', | |
| 'arxiv_equations', | |
| 'arxiv_tables', | |
| 'arxiv_images', | |
| ] |
🤖 Prompt for AI Agents
In tools/arxiv/tests/validate_pipeline.py around lines 124 to 129, the
required_collections list incorrectly expects 'arxiv_structures' which is not
created by _ensure_collections; replace it with the actual collections created
by the manager: 'arxiv_equations', 'arxiv_tables', and 'arxiv_images'. Update
the required_collections list to include these three names (and remove
'arxiv_structures'), so the test aligns with the manager schema and validates
the real collections created by _ensure_collections.
| # Process first 5 files as demo | ||
| demo_tasks = tasks[:5] | ||
| logger.info(f"\n Processing {len(demo_tasks)} files as demo...") | ||
|
|
||
| results = processor.process_documents(demo_tasks) | ||
|
|
||
| # Step 3: Show results | ||
| logger.info("\n3. Processing Results:") | ||
| logger.info(f" - Total processed: {results.get('total_processed', 0)}") | ||
| logger.info(f" - Extraction phase: {results.get('extraction_success', 0)} success, {results.get('extraction_failed', 0)} failed") | ||
| logger.info(f" - Embedding phase: {results.get('embedding_success', 0)} success, {results.get('embedding_failed', 0)} failed") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix result keys: use nested extraction/embedding summaries from GenericDocumentProcessor.
Current keys (extraction_success, etc.) don’t exist and always print zeros.
- logger.info("\n3. Processing Results:")
- logger.info(f" - Total processed: {results.get('total_processed', 0)}")
- logger.info(f" - Extraction phase: {results.get('extraction_success', 0)} success, {results.get('extraction_failed', 0)} failed")
- logger.info(f" - Embedding phase: {results.get('embedding_success', 0)} success, {results.get('embedding_failed', 0)} failed")
+ logger.info("\n3. Processing Results:")
+ extraction = results.get('extraction', {}) or {}
+ embedding = results.get('embedding', {}) or {}
+ logger.info(f" - Total processed: {results.get('total_processed', 0)}")
+ logger.info(f" - Extraction phase: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed")
+ logger.info(f" - Embedding phase: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Process first 5 files as demo | |
| demo_tasks = tasks[:5] | |
| logger.info(f"\n Processing {len(demo_tasks)} files as demo...") | |
| results = processor.process_documents(demo_tasks) | |
| # Step 3: Show results | |
| logger.info("\n3. Processing Results:") | |
| logger.info(f" - Total processed: {results.get('total_processed', 0)}") | |
| logger.info(f" - Extraction phase: {results.get('extraction_success', 0)} success, {results.get('extraction_failed', 0)} failed") | |
| logger.info(f" - Embedding phase: {results.get('embedding_success', 0)} success, {results.get('embedding_failed', 0)} failed") | |
| # Step 3: Show results | |
| logger.info("\n3. Processing Results:") | |
| extraction = results.get('extraction', {}) or {} | |
| embedding = results.get('embedding', {}) or {} | |
| logger.info(f" - Total processed: {results.get('total_processed', 0)}") | |
| logger.info(f" - Extraction phase: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed") | |
| logger.info(f" - Embedding phase: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed") |
🤖 Prompt for AI Agents
In tools/github/demo_github_pipeline.py around lines 84 to 95, the log uses
non-existent flat keys like extraction_success and embedding_success which
always print zeros; update the log to read the nested summaries produced by
GenericDocumentProcessor (e.g. results.get("extraction", {}).get("success", 0)
and results.get("extraction", {}).get("failed", 0) and similarly for
results.get("embedding", {}).get("success", 0) / .get("failed", 0)), keeping
fallbacks to 0 so missing keys don’t crash, and adjust the logger.info calls to
use those nested values.
| FOR paper IN arxiv_embeddings | ||
| FOR code IN github_embeddings | ||
| LET similarity = DOT(paper.embedding, code.embedding) | ||
| FILTER similarity > 0.8 | ||
| RETURN { | ||
| paper: paper.title, | ||
| code: code.source_file, | ||
| similarity: similarity | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
AQL uses an undefined DOT() function; the example won’t run on vanilla ArangoDB.
Replace with an available similarity function (e.g., COSINE_SIMILARITY or COSINE_DISTANCE depending on ArangoDB version), or reference a precomputed field.
- LET similarity = DOT(paper.embedding, code.embedding)
+ // Requires ArangoDB vector functions (3.11/3.12+) or your own UDF.
+ LET similarity = 1 - COSINE_DISTANCE(paper.embedding, code.embedding)🌐 Web query:
Does ArangoDB AQL provide built-in cosine similarity or distance functions for vector arrays (version 3.11+)? If so, what are the function names and usage?
💡 Result:
Yes.
Built-in AQL vector functions available in ArangoDB 3.11:
- COSINE_SIMILARITY(x, y) — returns cosine similarity between two numeric arrays (or a 2D array and a 1D array). Example: COSINE_SIMILARITY([0,1], [1,0]) → 0. [1][2]
- L1_DISTANCE(x, y) — Manhattan (L1) distance between two numeric arrays. Example: L1_DISTANCE([-1,-1], [2,2]) → 6. [2]
- L2_DISTANCE(x, y) — Euclidean (L2) distance between two numeric arrays. Example: L2_DISTANCE([1,1], [5,2]) → 4.1231056. [2]
Notes:
- Vector index–accelerated functions for approximate nearest-neighbor searches (e.g., APPROX_NEAR_COSINE and APPROX_NEAR_L2) were introduced later (3.12+) and require a vector index and the experimental vector-index option; those are used in queries like: SORT APPROX_NEAR_COSINE(doc.vector, @q) DESC. [3]
Sources: ArangoDB AQL function docs and release notes. [1][2][3]
Replace undefined DOT() with built-in AQL functions
File: tools/github/README.md lines 139–147
- LET similarity = DOT(paper.embedding, code.embedding)
+ LET similarity = COSINE_SIMILARITY(paper.embedding, code.embedding)Use COSINE_SIMILARITY(x, y) for cosine similarity (ArangoDB 3.11+); for distances, use L1_DISTANCE or L2_DISTANCE; in 3.12+ you can also leverage APPROX_NEAR_COSINE/APPROX_NEAR_L2 for index-accelerated searches.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| FOR paper IN arxiv_embeddings | |
| FOR code IN github_embeddings | |
| LET similarity = DOT(paper.embedding, code.embedding) | |
| FILTER similarity > 0.8 | |
| RETURN { | |
| paper: paper.title, | |
| code: code.source_file, | |
| similarity: similarity | |
| } | |
| FOR paper IN arxiv_embeddings | |
| FOR code IN github_embeddings | |
| LET similarity = COSINE_SIMILARITY(paper.embedding, code.embedding) | |
| FILTER similarity > 0.8 | |
| RETURN { | |
| paper: paper.title, | |
| code: code.source_file, | |
| similarity: similarity | |
| } |
🤖 Prompt for AI Agents
In tools/github/README.md around lines 139 to 147, the query uses an undefined
DOT() function; replace it with the built-in AQL similarity functions: compute
similarity = COSINE_SIMILARITY(paper.embedding, code.embedding) and use FILTER
similarity > 0.8; alternatively for distances use L1_DISTANCE or L2_DISTANCE,
and on ArangoDB 3.12+ consider switching to APPROX_NEAR_COSINE or APPROX_NEAR_L2
for index-accelerated searches and better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review continued from previous batch...
| # Calculate metrics | ||
| metrics = self._calculate_metrics(tree, content) | ||
|
|
||
| # Extract structure | ||
| structure = self._extract_structure(tree, content, language) | ||
|
|
||
| return { | ||
| 'symbols': symbols, | ||
| 'metrics': metrics, | ||
| 'structure': structure, | ||
| 'language': language, | ||
| 'tree_sitter_version': tree_sitter.__version__ if hasattr(tree_sitter, '__version__') else 'unknown' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Metrics use the wrong language; pass the detected language into _calculate_metrics.
Currently, _calculate_metrics infers language from tree.language (which isn’t available on Tree objects), silently falling back to "python". Complexity is miscounted for non-Python files.
Apply this diff:
- metrics = self._calculate_metrics(tree, content)
+ metrics = self._calculate_metrics(tree, content, language)-def _calculate_metrics(self, tree, content: str) -> Dict[str, Any]:
+def _calculate_metrics(self, tree, content: str, language: str) -> Dict[str, Any]:
@@
- # Get language-specific control flow nodes or use a default set
- language = getattr(tree, 'language', 'python')
- cf_nodes = control_flow_nodes.get(language, control_flow_nodes['python'])
+ # Use the detected language passed from caller
+ cf_nodes = control_flow_nodes.get(language, control_flow_nodes['python'])Also applies to: 687-731
🤖 Prompt for AI Agents
In core/framework/extractors/tree_sitter_extractor.py around lines 155-166 (and
the similar block around lines 687-731), _calculate_metrics is being called
without the detected language, causing it to infer language incorrectly from
tree and default to "python"; update calls to pass the detected language
variable (e.g., metrics = self._calculate_metrics(tree, content, language)) and
modify _calculate_metrics to accept a language parameter and use that value
instead of trying to read tree.language or defaulting to "python".
| name_node = node.child_by_field_name('name') | ||
| if name_node: | ||
| # Check for generator by looking for '*' token | ||
| # Rely on the node's generator attribute for JS/TS | ||
| is_generator = bool(getattr(node, 'generator', False)) | ||
|
|
||
| symbols['functions'].append({ | ||
| 'name': content[name_node.start_byte:name_node.end_byte], | ||
| 'line': name_node.start_point[0] + 1, | ||
| 'async': self._has_child_type(node, 'async'), | ||
| 'generator': is_generator | ||
| }) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix JS/TS generator/async detection.
Tree-sitter nodes don’t expose a “generator” boolean. The current getattr(...) will always be False.
Apply this diff:
- if name_node:
- # Check for generator by looking for '*' token
- # Rely on the node's generator attribute for JS/TS
- is_generator = bool(getattr(node, 'generator', False))
-
- symbols['functions'].append({
- 'name': content[name_node.start_byte:name_node.end_byte],
- 'line': name_node.start_point[0] + 1,
- 'async': self._has_child_type(node, 'async'),
- 'generator': is_generator
- })
+ if name_node:
+ prefix = content[node.start_byte:name_node.start_byte]
+ is_async = self._has_child_type(node, 'async') or prefix.lstrip().startswith('async')
+ is_generator = '*' in prefix
+ symbols['functions'].append({
+ 'name': content[name_node.start_byte:name_node.end_byte],
+ 'line': name_node.start_point[0] + 1,
+ 'async': is_async,
+ 'generator': is_generator,
+ })📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| name_node = node.child_by_field_name('name') | |
| if name_node: | |
| # Check for generator by looking for '*' token | |
| # Rely on the node's generator attribute for JS/TS | |
| is_generator = bool(getattr(node, 'generator', False)) | |
| symbols['functions'].append({ | |
| 'name': content[name_node.start_byte:name_node.end_byte], | |
| 'line': name_node.start_point[0] + 1, | |
| 'async': self._has_child_type(node, 'async'), | |
| 'generator': is_generator | |
| }) | |
| name_node = node.child_by_field_name('name') | |
| if name_node: | |
| prefix = content[node.start_byte:name_node.start_byte] | |
| is_async = self._has_child_type(node, 'async') or prefix.lstrip().startswith('async') | |
| is_generator = '*' in prefix | |
| symbols['functions'].append({ | |
| 'name': content[name_node.start_byte:name_node.end_byte], | |
| 'line': name_node.start_point[0] + 1, | |
| 'async': is_async, | |
| 'generator': is_generator, | |
| }) |
🤖 Prompt for AI Agents
In core/framework/extractors/tree_sitter_extractor.py around lines 321 to 333,
the code uses getattr(node, 'generator', False) which is invalid for tree-sitter
nodes and will always be False; replace that with a real check for the '*'
generator token by scanning the node for a child token or character '*' before
the function name (e.g., inspect node.children or the source slice between
node.start_byte and name_node.start_byte for a '*' character) and set
is_generator based on that check (keep the async detection using
self._has_child_type(node, 'async')).
| def __init__(self, config_path: str): | ||
| # Load configuration | ||
| import json | ||
| with open(config_path, 'r') as f: | ||
| self.config = json.load(f) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix example: loads YAML with json.load (incorrect).
Use yaml.safe_load to match the YAML config referenced.
- import json
- with open(config_path, 'r') as f:
- self.config = json.load(f)
+ import yaml
+ with open(config_path) as f:
+ self.config = yaml.safe_load(f)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__(self, config_path: str): | |
| # Load configuration | |
| import json | |
| with open(config_path, 'r') as f: | |
| self.config = json.load(f) | |
| def __init__(self, config_path: str): | |
| # Load configuration | |
| import yaml | |
| with open(config_path) as f: | |
| self.config = yaml.safe_load(f) |
🤖 Prompt for AI Agents
In docs/adr/0005-arxiv-pipeline-v2-architecture.md around lines 51 to 56, the
constructor currently uses json.load to parse a YAML configuration file; change
this to import and use yaml.safe_load so the YAML config is parsed correctly
(add "import yaml" and replace json.load with yaml.safe_load when reading the
file), ensuring the file is still opened in text mode and assigned to
self.config.
| # Connect to ArangoDB | ||
| client = ArangoClient(hosts='http://192.168.1.69:8529') | ||
| self.db = client.db('academy_store', username='root', password='1luv93ngu1n$') | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Secret leak: hardcoded Arango root password and host
Credentials and host must not be in code. Rotate the password immediately and switch to env/config-based loading.
+import os
@@
- client = ArangoClient(hosts='http://192.168.1.69:8529')
- self.db = client.db('academy_store', username='root', password='1luv93ngu1n$')
+ host = os.getenv("ARANGO_HOST", "http://localhost:8529")
+ db_name = os.getenv("ARANGO_DB", "academy_store")
+ user = os.getenv("ARANGO_USER", "root")
+ password = os.getenv("ARANGO_PASSWORD")
+ if not password:
+ raise ValueError("ARANGO_PASSWORD not set")
+ client = ArangoClient(hosts=host)
+ self.db = client.db(db_name, username=user, password=password)If this secret was committed, consider revoking it and auditing access.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Connect to ArangoDB | |
| client = ArangoClient(hosts='http://192.168.1.69:8529') | |
| self.db = client.db('academy_store', username='root', password='1luv93ngu1n$') | |
| import os | |
| # Connect to ArangoDB | |
| host = os.getenv("ARANGO_HOST", "http://localhost:8529") | |
| db_name = os.getenv("ARANGO_DB", "academy_store") | |
| user = os.getenv("ARANGO_USER", "root") | |
| password = os.getenv("ARANGO_PASSWORD") | |
| if not password: | |
| raise ValueError("ARANGO_PASSWORD not set") | |
| client = ArangoClient(hosts=host) | |
| self.db = client.db(db_name, username=user, password=password) |
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/analyze_theory_practice_bridges.py around
lines 28-31, the ArangoDB host and root password are hardcoded; replace these
with configuration-driven loading (e.g., read ARANGO_HOST, ARANGO_USER,
ARANGO_PASSWORD from environment variables or a secured config store), validate
presence and fail fast with a clear error if missing, and remove the plaintext
credentials from the file; after updating, rotate/revoke the exposed credentials
and ensure the repo history and any CI secrets do not contain the old values
(use git-secret / git-filter-repo or similar if necessary).
| for i, p_emb in enumerate(paper_embeddings): | ||
| for j, c_emb in enumerate(code_embeddings): | ||
| # Cosine similarity (1 - cosine distance) | ||
| similarity = 1 - cosine(p_emb, c_emb) | ||
| similarity_matrix[i, j] = similarity | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
NaN-safe cosine similarity
Guard against zero vectors producing NaNs.
- similarity = 1 - cosine(p_emb, c_emb)
+ similarity = 1 - cosine(p_emb, c_emb)
+ if not np.isfinite(similarity):
+ similarity = 0.0
similarity_matrix[i, j] = similarity📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for i, p_emb in enumerate(paper_embeddings): | |
| for j, c_emb in enumerate(code_embeddings): | |
| # Cosine similarity (1 - cosine distance) | |
| similarity = 1 - cosine(p_emb, c_emb) | |
| similarity_matrix[i, j] = similarity | |
| for i, p_emb in enumerate(paper_embeddings): | |
| for j, c_emb in enumerate(code_embeddings): | |
| # Cosine similarity (1 - cosine distance) | |
| similarity = 1 - cosine(p_emb, c_emb) | |
| if not np.isfinite(similarity): | |
| similarity = 0.0 | |
| similarity_matrix[i, j] = similarity |
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/analyze_theory_practice_bridges.py around
lines 160 to 165, the cosine calculation can produce NaN when one or both
vectors are zero; modify the loop to compute cosine similarity with an explicit
norm check: compute the L2 norms of p_emb and c_emb, if either norm is zero set
similarity to 0.0 (or another defined fallback) else compute similarity =
dot(p_emb, c_emb) / (norm_p * norm_c); assign that value to similarity_matrix[i,
j]. Ensure no NaNs are written by handling the zero-norm case before division.
| COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections" | ||
| if [ -d "$COLLECTION_DIR" ]; then | ||
| PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}') | ||
| echo "Found $PAPER_COUNT papers in existing collections" | ||
|
|
||
| if [ "$PAPER_COUNT" -lt 20000 ]; then | ||
| echo "Collection has less than 20,000 papers. Running extended collector..." | ||
| cd "$ARXIV_DIR/scripts" | ||
| python collect_ai_papers_extended.py | ||
| else | ||
| echo -e "${GREEN}✅ Sufficient papers available ($PAPER_COUNT)${NC}" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add validation for paper count parsing
The paper count extraction using awk could fail silently if the file format is unexpected. Consider adding error handling for the arithmetic operation.
Apply this improved version with validation:
COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections"
if [ -d "$COLLECTION_DIR" ]; then
PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}')
+ if ! [[ "$PAPER_COUNT" =~ ^[0-9]+$ ]]; then
+ echo -e "${RED}ERROR: Invalid paper count detected${NC}"
+ exit 1
+ fi
echo "Found $PAPER_COUNT papers in existing collections"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections" | |
| if [ -d "$COLLECTION_DIR" ]; then | |
| PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}') | |
| echo "Found $PAPER_COUNT papers in existing collections" | |
| if [ "$PAPER_COUNT" -lt 20000 ]; then | |
| echo "Collection has less than 20,000 papers. Running extended collector..." | |
| cd "$ARXIV_DIR/scripts" | |
| python collect_ai_papers_extended.py | |
| else | |
| echo -e "${GREEN}✅ Sufficient papers available ($PAPER_COUNT)${NC}" | |
| fi | |
| COLLECTION_DIR="$ARXIV_DIR/scripts/data/arxiv_collections" | |
| if [ -d "$COLLECTION_DIR" ]; then | |
| PAPER_COUNT=$(find "$COLLECTION_DIR" -name "arxiv_ids_*.txt" -exec wc -l {} \; | awk '{sum+=$1} END {print sum}') | |
| if ! [[ "$PAPER_COUNT" =~ ^[0-9]+$ ]]; then | |
| echo -e "${RED}ERROR: Invalid paper count detected${NC}" | |
| exit 1 | |
| fi | |
| echo "Found $PAPER_COUNT papers in existing collections" | |
| if [ "$PAPER_COUNT" -lt 20000 ]; then | |
| echo "Collection has less than 20,000 papers. Running extended collector..." | |
| cd "$ARXIV_DIR/scripts" | |
| python collect_ai_papers_extended.py | |
| else | |
| echo -e "${GREEN}✅ Sufficient papers available ($PAPER_COUNT)${NC}" | |
| fi |
| if [ -d "$STAGING_DIR" ]; then | ||
| rm -rf "$STAGING_DIR"/* | ||
| echo "✅ Staging directory cleaned" | ||
| else | ||
| mkdir -p "$STAGING_DIR" | ||
| echo "✅ Staging directory created" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix potential dangerous expansion in staging directory cleanup
The rm -rf command on line 87 could expand to /* if $STAGING_DIR is unexpectedly empty, potentially deleting all files on the system.
Apply this fix to ensure safe directory cleanup:
# Step 2: Clean up staging area
echo -e "${YELLOW}Step 2: Cleanup${NC}"
echo "Cleaning staging directory..."
STAGING_DIR="/dev/shm/weekend_staging"
if [ -d "$STAGING_DIR" ]; then
- rm -rf "$STAGING_DIR"/*
+ rm -rf "${STAGING_DIR:?}"/*
echo "✅ Staging directory cleaned"
else
mkdir -p "$STAGING_DIR"
echo "✅ Staging directory created"
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [ -d "$STAGING_DIR" ]; then | |
| rm -rf "$STAGING_DIR"/* | |
| echo "✅ Staging directory cleaned" | |
| else | |
| mkdir -p "$STAGING_DIR" | |
| echo "✅ Staging directory created" | |
| fi | |
| if [ -d "$STAGING_DIR" ]; then | |
| rm -rf "${STAGING_DIR:?}"/* | |
| echo "✅ Staging directory cleaned" | |
| else | |
| mkdir -p "$STAGING_DIR" | |
| echo "✅ Staging directory created" | |
| fi |
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 87-87: Use "${var:?}" to ensure this never expands to /* .
(SC2115)
🤖 Prompt for AI Agents
In tools/arxiv/scripts/run_weekend_test.sh around lines 86 to 92, the rm -rf
"$STAGING_DIR"/* can expand to /* if $STAGING_DIR is empty; ensure $STAGING_DIR
is non-empty and existing, then replace the unsafe rm usage with a safe deletion
of only the directory contents (for example check [ -n "$STAGING_DIR" ] and [ -d
"$STAGING_DIR" ] first, then use a command that deletes only its contents such
as find "$STAGING_DIR" -mindepth 1 -delete or find "$STAGING_DIR" -mindepth 1
-maxdepth 1 -exec rm -rf -- {} +) so you never run rm on / or the root glob when
the variable is empty.
| pipeline = ACIDPhasedPipeline( | ||
| config_path=str(self.config_path), | ||
| arango_password=self.arango_password | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify that ACIDPhasedPipeline constructor accepts arango_password parameter
The code passes arango_password to ACIDPhasedPipeline constructor, but this parameter doesn't appear in the constructor signature shown in the relevant code snippets.
🏁 Script executed:
#!/bin/bash
# Check the ACIDPhasedPipeline constructor signature
ast-grep --pattern 'class ACIDPhasedPipeline:
$$$
def __init__(self, $_):
$$$'Length of output: 17365
ACIDPhasedPipeline constructor missing arango_password parameter
The test at tools/arxiv/tests/test_large_scale_processing.py:108–111 calls
ACIDPhasedPipeline(config_path=…, arango_password=…)but the class’s __init__ in tools/arxiv/pipelines/arxiv_pipeline.py only accepts config_path (line 836). Add an arango_password parameter to its signature or remove it from the test.
🤖 Prompt for AI Agents
In tools/arxiv/pipelines/arxiv_pipeline.py around line 836, the
ACIDPhasedPipeline.__init__ only accepts config_path but the test constructs it
with arango_password; update the constructor signature to accept an optional
arango_password parameter (e.g., arango_password=None), assign it to
self.arango_password, and propagate it to any internal Arango/DB client creation
code that requires a password; if no client currently uses it, still store the
attribute to preserve test compatibility.
| results = pipeline.process_papers(batch, count=len(batch)) | ||
|
|
||
| total_processed += results.get('successful', 0) | ||
| total_failed += results.get('failed', 0) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Pipeline's process_papers method signature inconsistency
The code calls pipeline.process_papers() with various arguments including detailed_timing=True, but the pipeline class shown in relevant snippets doesn't have this method. Only the run() method is visible in the ACIDPhasedPipeline class.
Also applies to: 196-198, 365-369
🏁 Script executed:
#!/bin/bash
# Search for process_papers method in ACIDPhasedPipeline class
ast-grep --pattern 'class ACIDPhasedPipeline:
$$$
def process_papers($$$):
$$$'
# Also check if there's a different pipeline class being used
rg -n "def process_papers" --type=pyLength of output: 197
🏁 Script executed:
#!/bin/bash
# Display the definition of process_papers in OnDemandProcessor
sed -n '115,135p' tools/arxiv/utils/on_demand_processor.pyLength of output: 655
🏁 Script executed:
#!/bin/bash
# Show imports and pipeline instantiation in test file
sed -n '1,80p' tools/arxiv/tests/test_large_scale_processing.pyLength of output: 2484
🏁 Script executed:
#!/bin/bash
# Locate and display the ACIDPhasedPipeline class definition and its methods
file=$(rg -l "class ACIDPhasedPipeline" --type=py)
echo "Found pipeline file: $file"
sed -n '1,200p' "$file"Length of output: 6245
ACIDPhasedPipeline lacks process_papers—update tests or implement method
Tests in tools/arxiv/tests/test_large_scale_processing.py (lines 122–126, 196–198, 365–369) invoke pipeline.process_papers, but ACIDPhasedPipeline only defines run(). Either add a process_papers alias/wrapper in the pipeline class or change the tests to call pipeline.run() with the appropriate arguments.
| output_files = write_outputs( | ||
| out_dir=output_dir, | ||
| prefix=output.get('prefix', 'arxiv_ids'), | ||
| ids=ids, | ||
| with_pdf=with_pdf, | ||
| missing_pdf=missing_pdf, | ||
| stats=stats, | ||
| monthly=monthly, | ||
| suffix=stamp, | ||
| update_symlinks=output.get('update_symlinks', True) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: write_outputs returns None; generate_list expects a dict and loses file paths.
This silently makes output_files None. Either change write_outputs to return paths or compute them here after calling it.
- output_files = write_outputs(
- out_dir=output_dir,
- prefix=output.get('prefix', 'arxiv_ids'),
+ prefix = output.get('prefix', 'arxiv_ids')
+ update_symlinks = output.get('update_symlinks', True)
+ write_outputs(
+ out_dir=output_dir,
+ prefix=prefix,
ids=ids,
with_pdf=with_pdf,
missing_pdf=missing_pdf,
stats=stats,
monthly=monthly,
suffix=stamp,
- update_symlinks=output.get('update_symlinks', True)
+ update_symlinks=update_symlinks,
)
-
- logger.info("Output files written", files=output_files)
+ output_files = {
+ "master_ts": output_dir / f"{prefix}_sweep_{stamp}.txt",
+ "stats_ts": output_dir / f"{prefix}_sweep_stats_{stamp}.json",
+ }
+ if update_symlinks:
+ output_files["master"] = output_dir / f"{prefix}_sweep.txt"
+ output_files["stats"] = output_dir / f"{prefix}_stats.json"
+ if with_pdf:
+ output_files["with_pdfs_ts"] = output_dir / f"{prefix}_with_pdfs_{stamp}.txt"
+ if update_symlinks:
+ output_files["with_pdfs"] = output_dir / f"{prefix}_with_pdfs.txt"
+ if missing_pdf:
+ output_files["missing_pdfs_ts"] = output_dir / f"{prefix}_missing_pdfs_{stamp}.txt"
+ if update_symlinks:
+ output_files["missing_pdfs"] = output_dir / f"{prefix}_missing_pdfs.txt"
+
+ logger.info("Output files written", files={k: str(v) for k, v in output_files.items()})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| output_files = write_outputs( | |
| out_dir=output_dir, | |
| prefix=output.get('prefix', 'arxiv_ids'), | |
| ids=ids, | |
| with_pdf=with_pdf, | |
| missing_pdf=missing_pdf, | |
| stats=stats, | |
| monthly=monthly, | |
| suffix=stamp, | |
| update_symlinks=output.get('update_symlinks', True) | |
| ) | |
| # Determine prefix and symlink behavior up front | |
| prefix = output.get("prefix", "arxiv_ids") | |
| update_symlinks = output.get("update_symlinks", True) | |
| # Actually write the outputs (returns None) | |
| write_outputs( | |
| out_dir=output_dir, | |
| prefix=prefix, | |
| ids=ids, | |
| with_pdf=with_pdf, | |
| missing_pdf=missing_pdf, | |
| stats=stats, | |
| monthly=monthly, | |
| suffix=stamp, | |
| update_symlinks=update_symlinks, | |
| ) | |
| # Manually reconstruct the paths that were written | |
| output_files = { | |
| "master_ts": output_dir / f"{prefix}_sweep_{stamp}.txt", | |
| "stats_ts": output_dir / f"{prefix}_sweep_stats_{stamp}.json", | |
| } | |
| if update_symlinks: | |
| output_files["master"] = output_dir / f"{prefix}_sweep.txt" | |
| output_files["stats"] = output_dir / f"{prefix}_stats.json" | |
| if with_pdf: | |
| output_files["with_pdfs_ts"] = output_dir / f"{prefix}_with_pdfs_{stamp}.txt" | |
| if update_symlinks: | |
| output_files["with_pdfs"] = output_dir / f"{prefix}_with_pdfs.txt" | |
| if missing_pdf: | |
| output_files["missing_pdfs_ts"] = output_dir / f"{prefix}_missing_pdfs_{stamp}.txt" | |
| if update_symlinks: | |
| output_files["missing_pdfs"] = output_dir / f"{prefix}_missing_pdfs.txt" | |
| logger.info("Output files written", files={k: str(v) for k, v in output_files.items()}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
core/processors/chunking_strategies.py (2)
86-96: Preserve newlines in _clean_text; current normalization breaks paragraph-aware strategiesCollapsing all whitespace to single spaces removes newlines, which prevents SemanticChunking from detecting paragraphs/sentences and materially changes downstream behavior. Preserve newline characters while still collapsing spaces/tabs.
Apply this diff:
def _clean_text(self, text: str) -> str: - """ - Normalize input text for chunking by collapsing all whitespace to single spaces, removing null characters, and trimming leading/trailing spaces. - - This produces a compact, single-line-safe string suitable for downstream tokenization and chunk boundary calculations. - """ - # Remove excessive whitespace - text = re.sub(r'\s+', ' ', text) - # Remove null characters - text = text.replace('\x00', '') - return text.strip() + """ + Normalize input text for chunking by collapsing runs of spaces/tabs, preserving newlines, + removing null characters, and trimming edges. This keeps paragraph/sentence boundaries intact. + """ + # Normalize line endings + text = text.replace('\r\n', '\n').replace('\r', '\n') + # Collapse spaces/tabs but preserve newlines + text = re.sub(r'[^\S\n]+', ' ', text) + # Remove null characters + text = text.replace('\x00', '') + # Collapse 3+ newlines to exactly two (paragraph boundary), and strip + text = re.sub(r'\n{3,}', '\n\n', text).strip() + return text
476-484: SlidingWindow remainder drops tail < step_size and duplicates tokens; compute tail from last window end
- tail < step_size is currently discarded, losing content.
- Remainder starts at len(chunks) * step_size, which duplicates tokens already covered by the last window; the uncovered tail actually begins at last_start + window_size.
- Also avoid O(n^2) joins for start_char by precomputing prefix chars.
Apply this diff:
def create_chunks(self, text: str, **kwargs) -> List[TextChunk]: @@ text = self._clean_text(text) tokens = text.split() # Simple tokenization + # Precompute cumulative char positions (O(n)) for O(1) offset lookup + prefix_chars = [0] + total = 0 + for idx, tok in enumerate(tokens): + if idx > 0: + total += 1 # space between tokens + total += len(tok) + prefix_chars.append(total) @@ - start_char = len(' '.join(tokens[:i])) + start_char = prefix_chars[i] @@ - # Handle remainder if any - if len(tokens) > self.window_size: - last_index = len(chunks) * self.step_size - if last_index < len(tokens): - remaining_tokens = tokens[last_index:] - if len(remaining_tokens) >= self.step_size: # Only if substantial - chunk_text = ' '.join(remaining_tokens) - start_char = len(' '.join(tokens[:last_index])) - - chunk = TextChunk( - text=chunk_text, - start_char=start_char, - end_char=start_char + len(chunk_text), - chunk_index=len(chunks), - metadata={ - 'strategy': 'sliding_window_remainder', - 'token_count': len(remaining_tokens) - } - ) - chunks.append(chunk) + # Handle uncovered tail after the last full window (if any) + if chunks: + last_start = (len(chunks) - 1) * self.step_size + tail_start = last_start + self.window_size + if tail_start < len(tokens): + remaining_tokens = tokens[tail_start:] + chunk_text = ' '.join(remaining_tokens) + start_char = prefix_chars[tail_start] + chunk = TextChunk( + text=chunk_text, + start_char=start_char, + end_char=start_char + len(chunk_text), + chunk_index=len(chunks), + metadata={ + 'strategy': 'sliding_window_remainder', + 'token_count': len(remaining_tokens) + } + ) + chunks.append(chunk)Also applies to: 501-521
♻️ Duplicate comments (2)
experiments/word2vec_evolution/experiment_config.yaml (1)
112-112: Fix: add a trailing newline at EOF to satisfy YAMLlint/CI.Current error: “no new line character at the end of file.”
Apply:
- - validation_report/ # Metrics and hypothesis testing results + - validation_report/ # Metrics and hypothesis testing results +core/processors/chunking_strategies.py (1)
119-123: Tokenizer object: make convert_tokens_to_string optional; fall back to ' '.join()Requiring convert_tokens_to_string excludes many tokenizer implementations that only expose .tokenize and .decode. Accept .tokenize and use a safe fallback for reconstruction. This mirrors the earlier guidance on the tokenizer interface.
Parameters: @@ - tokenizer: Optional tokenizer that can be either: - - A callable function that takes a string and returns a list of tokens: Callable[[str], List[str]] - - A tokenizer-like object with .tokenize(text) and .convert_tokens_to_string(tokens) methods - - None (uses simple whitespace split) + tokenizer: Optional tokenizer that can be either: + - A callable: (str) -> list[str] + - A tokenizer-like object with .tokenize(text)->list[str] and optionally + .convert_tokens_to_string(tokens)->str (fallback is ' '.join(tokens)) + - None (uses simple whitespace split) @@ - # Validate tokenizer interface if provided + # Validate tokenizer interface if provided if tokenizer is not None: - if not (callable(tokenizer) or - (hasattr(tokenizer, 'tokenize') and hasattr(tokenizer, 'convert_tokens_to_string'))): + if not (callable(tokenizer) or hasattr(tokenizer, 'tokenize')): raise TypeError( - "Tokenizer must be either a callable that returns List[str], " - "or an object with .tokenize() and .convert_tokens_to_string() methods" + "Tokenizer must be either a callable that returns list[str], " + "or an object with .tokenize(str)->list[str]. " + "convert_tokens_to_string(tokens)->str is optional." ) @@ - if self.tokenizer: - if callable(self.tokenizer): - # For callable tokenizers, join tokens with space - chunk_text = ' '.join(chunk_tokens) - else: - # For tokenizer objects, use convert_tokens_to_string method - chunk_text = self.tokenizer.convert_tokens_to_string(chunk_tokens) + if self.tokenizer: + if callable(self.tokenizer): + chunk_text = ' '.join(chunk_tokens) + else: + if hasattr(self.tokenizer, 'convert_tokens_to_string'): + chunk_text = self.tokenizer.convert_tokens_to_string(chunk_tokens) + else: + chunk_text = ' '.join(chunk_tokens) else: chunk_text = ' '.join(chunk_tokens)Also applies to: 131-139, 186-193
🧹 Nitpick comments (6)
experiments/word2vec_evolution/experiment_config.yaml (3)
81-101: Stage ordering may be non-deterministic if a map is parsed unordered.If the runner relies on order, prefer a list of stages with explicit names.
Minimal illustration:
-pipeline: - stages: - 1_import: - - Download papers from ArXiv - - Clone GitHub repositories - - Extract text content from both sources +pipeline: + stages: + - name: import + tasks: + - Download papers from ArXiv + - Clone GitHub repositories + - Extract text content from both sourcesConfirm whether your pipeline preserves YAML map insertion order; if not, adopt the list form.
42-51: Optional: centralize thresholds with YAML anchors to reduce drift.Keeps validation text and criteria consistent.
analysis: # Entropy analysis - entropy_window_size: 100 - entropy_threshold_low: 0.3 # Low entropy = high clarity/bridge point - entropy_threshold_high: 0.7 # High entropy = conceptual complexity + entropy_window_size: 100 + entropy_thresholds: &entropy_thresholds + low: 0.3 # Low entropy = high clarity/bridge point + high: 0.7 # High entropy = conceptual complexity # Bridge detection bridge_criteria: semantic_similarity_threshold: 0.85 conveyance_threshold: 0.9 where_threshold: 0.8 # Proximity in graph + +hypotheses: h1: name: "Theory-Practice Bridge Detection" description: | Papers and their implementation code will show low entropy (high clarity) at specific bridge points where theoretical concepts map directly to implementation patterns. validation: | - Measure entropy at points of high semantic similarity between paper - and code. Expect entropy < 0.3 at true bridge points. + Measure entropy at points of high semantic similarity between paper + and code. Expect entropy < *entropy_thresholds.low at true bridge points.Also applies to: 56-62, 69-71, 75-80
32-46: Add a seed for reproducibility
- Under
analysis, addrandom_seed: 42to stabilize chunking and ordering.- Embedding dimensions (2048) already match the
jina-v4default.core/processors/chunking_strategies.py (3)
174-176: TokenBased start_char is O(n^2); precompute prefix chars to O(1) per chunkRepeated joins inside the loop are quadratic. Precompute a prefix array once.
chunks = [] stride = self.chunk_size - self.chunk_overlap + # Precompute cumulative char positions for O(1) start offsets + prefix_chars = [0] + total = 0 + for idx, tok in enumerate(tokens): + if idx > 0: + total += 1 # space between tokens + total += len(tok) + prefix_chars.append(total) @@ - start_char = len(' '.join(tokens[:i])) + start_char = prefix_chars[i]Also applies to: 196-197
17-17: Modernize typing per Ruff UP035/UP007: use built-in generics and|unionsAdopt built-in generics and import Callable from collections.abc. Improves readability and aligns with current typing best practices.
-from typing import List, Dict, Any, Optional, Tuple, Callable, Union +from collections.abc import Callable +from typing import Any @@ - tokenizer: Optional[Union[Callable[[str], List[str]], Any]] = None + tokenizer: Callable[[str], list[str]] | Any | None = NoneFollow-up (optional): Convert other annotations to built-ins, e.g.,
List[TextChunk] -> list[TextChunk],Dict[str, Any] -> dict[str, Any].Also applies to: 111-111
85-96: Offer: add targeted tests to pin behaviorI can add tests to prevent regressions:
- _clean_text preserves newlines and paragraph boundaries (round-trips).
- SlidingWindow covers all tokens including tails < step_size; no gaps.
- Semantic forced split yields strictly increasing, unique chunk_index values.
Do you want me to open a follow-up PR with these tests?
Also applies to: 464-524, 368-399
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
core/processors/chunking_strategies.py(12 hunks)experiments/word2vec_evolution/experiment_config.yaml(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Prefer absolute imports following the shown module structure (e.g., core.framework., tools.arxiv.pipelines.)
Files:
core/processors/chunking_strategies.py
🧬 Code graph analysis (1)
core/processors/chunking_strategies.py (1)
experiments/word2vec_evolution/generate_embeddings.py (1)
chunk_text(30-57)
🪛 YAMLlint (1.37.1)
experiments/word2vec_evolution/experiment_config.yaml
[error] 112-112: no new line character at the end of file
(new-line-at-end-of-file)
🪛 Ruff (0.12.2)
core/processors/chunking_strategies.py
17-17: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
17-17: typing.List is deprecated, use list instead
(UP035)
17-17: typing.Dict is deprecated, use dict instead
(UP035)
17-17: typing.Tuple is deprecated, use tuple instead
(UP035)
111-111: Use X | Y for type annotations
Convert to X | Y
(UP007)
🔇 Additional comments (3)
experiments/word2vec_evolution/experiment_config.yaml (2)
1-112: Well-structured experiment config.Clear sections, sensible defaults, and readable comments. Nice.
13-31: Sanity check passed: arXiv IDs and GitHub repo URLs resolve successfully.core/processors/chunking_strategies.py (1)
41-46: Docstring and API clarifications: LGTMClearer parameter semantics, return types, and behavior descriptions improve the API surface without changing runtime behavior.
Also applies to: 51-57, 72-82, 114-127, 147-159, 241-248, 256-269, 344-348, 355-362, 402-417, 449-457, 466-475, 540-554
| """ | ||
| Force-split an oversized text into consecutive TextChunk objects of at most `self.max_chunk_size` tokens. | ||
| The text is split on whitespace into words and rejoined into chunks of up to `max_chunk_size` words. `start_char` is used as the initial character offset for the first chunk; subsequent chunks' `start_char`/`end_char` are computed by advancing the offset by the length of the produced chunk text plus a single separating space. Character positions are therefore approximate and do not preserve original spacing or punctuation. | ||
| Parameters: | ||
| text (str): The input text to split. | ||
| start_char (int): Character index in the original document corresponding to the start of `text`. | ||
| Returns: | ||
| List[TextChunk]: A list of TextChunk objects with metadata containing `strategy='semantic_forced_split'` and `token_count`. Chunk indices are zero-based and reflect the order of the produced chunks. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_index resets inside _force_split_text; leads to duplicate/incorrect indices
Chunks produced here start from 0 regardless of prior chunks, causing duplicate indices in the final list. Provide a base_index to offset numbering.
-def _force_split_text(self, text: str, start_char: int) -> List[TextChunk]:
+def _force_split_text(self, text: str, start_char: int, base_index: int = 0) -> List[TextChunk]:
@@
- for i in range(0, len(words), self.max_chunk_size):
+ for i in range(0, len(words), self.max_chunk_size):
chunk_words = words[i:i + self.max_chunk_size]
chunk_text = ' '.join(chunk_words)
@@
- chunks.append(TextChunk(
+ chunks.append(TextChunk(
text=chunk_text,
start_char=start_char,
end_char=start_char + len(chunk_text),
- chunk_index=len(chunks),
+ chunk_index=base_index + len(chunks),
metadata={
'strategy': 'semantic_forced_split',
'token_count': len(chunk_words)
}
))Also update the call site to pass the current offset (in create_chunks):
- chunks.extend(self._force_split_text(para, current_chunk_start))
+ chunks.extend(self._force_split_text(para, current_chunk_start, base_index=len(chunks)))Also applies to: 383-399
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
experiments/word2vec_evolution/process_github_repos.py (1)
100-127: Good: now checking returncode; add a timeout to avoid hangsYou implemented the earlier suggestion to check return codes. Add a reasonable timeout so setup can’t hang the run.
- result = subprocess.run( + result = subprocess.run( [sys.executable, str(setup_script)], capture_output=True, text=True, - cwd=str(setup_script.parent) + cwd=str(setup_script.parent), + timeout=120 )
🧹 Nitpick comments (4)
experiments/word2vec_evolution/process_github_repos.py (4)
22-27: Fix E402 import placement and avoid sys.path hacksMove the GitHubPipelineManager import into main() (after sys.path manipulation) and loosen the type in process_repository to avoid top-level import ordering issues. This removes the Ruff E402 and aligns with absolute-import guidance without polluting module scope.
Apply:
@@ -# Add parent to path for imports -sys.path.insert(0, str(Path(__file__).parent.parent.parent)) - -# Import GitHub pipeline manager for direct use -from tools.github.github_pipeline_manager import GitHubPipelineManager +# Add project root to path for imports +sys.path.insert(0, str(Path(__file__).parent.parent.parent)) @@ -def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager): +from typing import Any + +def process_repository(repo_url: str, name: str, manager: Any):And inside main(), just before constructing manager:
@@ - # Create manager instance + # Create manager instance + from tools.github.github_pipeline_manager import GitHubPipelineManager # local to satisfy E402
128-145: Guard against missing keys in config to avoid KeyErrorIf github_simple.yaml lacks expected sections, access like config['arango'] will KeyError. Add get() checks with clear errors.
- with open(config_path) as f: - config = yaml.safe_load(f) + with open(config_path) as f: + config = yaml.safe_load(f) or {} @@ - config['arango']['password'] = arango_password + if 'arango' not in config: + logger.error("Config missing 'arango' section") + sys.exit(1) + config['arango']['password'] = arango_password + if 'processing' not in config or 'github' not in config['processing']: + logger.error("Config missing 'processing.github' section") + sys.exit(1)
159-172: Persist richer per-repo metrics in results.jsonCapture totals when available. Helps downstream analysis.
- success = process_repository(repo_info['repo'], repo_info['name'], manager) + success = process_repository(repo_info['repo'], repo_info['name'], manager) + # Optionally, pull lightweight stats from manager cache/DB in future results.append({ 'repo': repo_info['repo'], 'name': repo_info['name'], 'paper': repo_info['paper'], 'success': success, 'timestamp': datetime.now().isoformat() })
201-209: Use sys.exit and add trailing newlineSmall polish: prefer sys.exit for scripts; also fix missing newline at EOF (Ruff W292).
-if __name__ == "__main__": - exit(main()) +if __name__ == "__main__": + sys.exit(main()) +
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
experiments/word2vec_evolution/process_github_repos.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Prefer absolute imports following the shown module structure (e.g., core.framework., tools.arxiv.pipelines.)
Files:
experiments/word2vec_evolution/process_github_repos.py
🧬 Code graph analysis (1)
experiments/word2vec_evolution/process_github_repos.py (1)
tools/github/github_pipeline_manager.py (3)
GitHubPipelineManager(23-316)process_repository(53-87)main(319-366)
🪛 Ruff (0.12.2)
experiments/word2vec_evolution/process_github_repos.py
26-26: Module level import not at top of file
(E402)
209-209: No newline at end of file
Add trailing newline
(W292)
🔇 Additional comments (1)
experiments/word2vec_evolution/process_github_repos.py (1)
68-89: Repo URL handling already supported:GitHubDocumentManager._parse_repo_urlexplicitly accepts the shorthand"owner/repo", andprocess_repositorypasses that format end-to-end, so no changes are needed.
| def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager): | ||
| """ | ||
| Process a single GitHub repository using the GitHub pipeline. | ||
| Args: | ||
| repo_url: GitHub URL (owner/repo format) | ||
| name: Friendly name for logging | ||
| manager: GitHubPipelineManager instance | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalize repo URL format to prevent processing failures
GitHubPipelineManager examples expect a full https URL, but the docstring and callers pass owner/repo. Normalize to support both to avoid parse failures at runtime.
def process_repository(repo_url: str, name: str, manager: Any):
@@
- logger.info(f"Processing {name} ({repo_url})...")
+ # Accept 'owner/repo' or full URL and normalize
+ repo_url_norm = repo_url if repo_url.startswith("http") else f"https://github.com/{repo_url}"
+ logger.info(f"Processing {name} ({repo_url_norm})...")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager): | |
| """ | |
| Process a single GitHub repository using the GitHub pipeline. | |
| Args: | |
| repo_url: GitHub URL (owner/repo format) | |
| name: Friendly name for logging | |
| manager: GitHubPipelineManager instance | |
| """ | |
| def process_repository(repo_url: str, name: str, manager: GitHubPipelineManager): | |
| """ | |
| Process a single GitHub repository using the GitHub pipeline. | |
| Args: | |
| repo_url: GitHub URL (owner/repo format) | |
| name: Friendly name for logging | |
| manager: GitHubPipelineManager instance | |
| """ | |
| # Accept 'owner/repo' or full URL and normalize | |
| repo_url_norm = repo_url if repo_url.startswith("http") else f"https://github.com/{repo_url}" | |
| logger.info(f"Processing {name} ({repo_url_norm})...") | |
| # …rest of function unchanged… |
| try: | ||
| # Process the repository directly using the manager | ||
| results = manager.process_repository(repo_url) | ||
|
|
||
| # Check results | ||
| if results and 'stored' in results: | ||
| stored_count = results.get('stored', 0) | ||
| logger.info(f" ✓ {name} processed successfully") | ||
| logger.info(f" Stored {stored_count} embeddings") | ||
|
|
||
| # Log additional stats if available | ||
| if 'repository' in results: | ||
| repo_info = results['repository'] | ||
| logger.info(f" Repository: {repo_info.get('full_name', repo_url)}") | ||
| if 'stats' in repo_info: | ||
| stats = repo_info['stats'] | ||
| logger.info(f" Files: {stats.get('file_count', 'N/A')}") | ||
| logger.info(f" Chunks: {stats.get('chunk_count', 'N/A')}") | ||
|
|
||
| return True | ||
| else: | ||
| logger.error(f" ✗ {name} processing returned no results") | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t gate success on a non-guaranteed 'stored' key; log robust metrics instead
GitHubPipelineManager.main() prints total_processed/total_functions; 'stored' may not exist, leading to false negatives. Consider success when process_repository returns without exception; log known fields when present.
- # Process the repository directly using the manager
- results = manager.process_repository(repo_url)
+ # Process the repository directly using the manager
+ results = manager.process_repository(repo_url_norm)
@@
- # Check results
- if results and 'stored' in results:
- stored_count = results.get('stored', 0)
- logger.info(f" ✓ {name} processed successfully")
- logger.info(f" Stored {stored_count} embeddings")
+ # Consider success if a result dict is returned; log available metrics
+ if isinstance(results, dict):
+ file_count = results.get('total_processed')
+ func_count = results.get('total_functions')
+ emb_count = results.get('embeddings_stored', results.get('stored'))
+ logger.info(f" ✓ {name} processed successfully")
+ if file_count is not None or func_count is not None or emb_count is not None:
+ logger.info(
+ " " +
+ ", ".join(
+ s for s in [
+ f"Files={file_count}" if file_count is not None else None,
+ f"Functions={func_count}" if func_count is not None else None,
+ f"Embeddings={emb_count}" if emb_count is not None else None,
+ ] if s
+ )
+ )
@@
- logger.error(f" ✗ {name} processing returned no results")
+ logger.error(f" ✗ {name} processing returned no results")
return FalseCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/process_github_repos.py around lines 39 to 61,
the current logic treats the absence of a non-guaranteed 'stored' key as a
failure; change it to consider the repository processed successfully if
process_repository returns without raising an exception (i.e., results is not
None/False), and only use 'stored' (and other optional fields like 'repository',
'stats', 'file_count', 'chunk_count') for logging when present; specifically,
replace the strict "if results and 'stored' in results" gate with a check for a
truthy results object, log stored_count if available (fallback to 'N/A' or 0),
log repository and stats when present, and return True for successful
non-exceptional returns while still returning False when results is None/False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
experiments/word2vec_evolution/experiment_config.yaml (1)
113-113: Fix trailing spaces and add newline at EOF to satisfy yamllint.Line 113 has stray spaces and the file lacks a final newline, which will keep yamllint failing.
Apply:
- +
🧹 Nitpick comments (1)
experiments/word2vec_evolution/experiment_config.yaml (1)
47-51: Clarify naming: considerproximity_thresholdoverwhere_threshold.Improves readability; the intent is proximity in the graph.
- where_threshold: 0.8 # Proximity in graph + proximity_threshold: 0.8 # Proximity in graph
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
experiments/word2vec_evolution/experiment_config.yaml(1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
experiments/word2vec_evolution/experiment_config.yaml
[error] 113-113: no new line character at the end of file
(new-line-at-end-of-file)
[error] 113-113: trailing spaces
(trailing-spaces)
🔇 Additional comments (2)
experiments/word2vec_evolution/experiment_config.yaml (2)
81-107: Verify loader supports list-of-strings underpipeline.stages
The experiment runner uses a plainyaml.safe_load(no Pydantic or schema validation) to readpipeline.stages; confirm the code that iterates overconfig['pipeline']['stages']treats each stage value as aList[str]of step names rather than expecting objects withid/params.
34-36: Embedding model/dimension consistency confirmed. The “jina-v4” model outputs 2048-dimensional embeddings per the Jina Embeddings API, and all code paths (defaults in DocumentProcessor, tests, and pipelines) consistently use embedding_model “jina-v4” with embedding_dimensions 2048.
| word2vec: | ||
| arxiv_id: "1301.3781" | ||
| title: "Efficient Estimation of Word Representations in Vector Space" | ||
| year: 2013 | ||
| github_repo: "https://github.com/dav/word2vec" | ||
|
|
||
| doc2vec: | ||
| arxiv_id: "1405.4053" | ||
| title: "Distributed Representations of Sentences and Documents" | ||
| year: 2014 | ||
| github_repo: "https://github.com/bnosac/doc2vec" | ||
|
|
||
| code2vec: | ||
| arxiv_id: "1803.09473" | ||
| title: "code2vec: Learning Distributed Representations of Code" | ||
| year: 2018 | ||
| github_repo: "https://github.com/tech-srl/code2vec" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Sanity-check paper → repo mappings.
Make sure these GitHub repos are the intended canonical implementations for the cited papers to avoid skewed bridge analyses.
🌐 Web query:
Are these canonical GitHub repos for the respective papers?
- 1301.3781 word2vec → github.com/dav/word2vec
- 1405.4053 doc2vec → github.com/bnosac/doc2vec
- 1803.09473 code2vec → github.com/tech-srl/code2vec
💡 Result:
Short answers:
-
1301.3781 (word2vec): The canonical/original code is the Google/ Tomas Mikolov export (tmikolov/word2vec / code.google.com archive). dav/word2vec is a community mirror/clone, not the original canonical repo. [1][2]
-
1405.4053 (doc2vec / Paragraph Vector): bnosac/doc2vec is a third‑party R package/wrapper (not the original paper authors' canonical code). The paper (Le & Mikolov) does not have a single official GitHub repo from the authors; common implementations live in Gensim and other third‑party repos (and some C/C++/wrappers such as hiyijian/doc2vec). [3][4]
-
1803.09473 (code2vec): github.com/tech-srl/code2vec is the authors’ official repository referenced by the paper (official). [5]
Sources:
[1] tmikolov/word2vec (export of original Google code).
[2] dav/word2vec (mirror; README notes “Original from https://code.google.com/p/word2vec/”).
[3] ArXiv: “Distributed Representations of Sentences and Documents” (Le & Mikolov).
[4] bnosac/doc2vec (R wrapper / third‑party implementation; points to other impls).
[5] ArXiv + tech-srl/code2vec (paper links repo).
Use canonical implementations for word2vec and doc2vec
- experiments/word2vec_evolution/experiment_config.yaml (14–31): replace word2vec github_repo with https://github.com/tmikolov/word2vec (official export; dav/word2vec is a community mirror) (github.com)
- doc2vec: authors did not publish an official repo; bnosac/doc2vec is a third-party R wrapper (github.com); for a canonical implementation, use Gensim (https://github.com/RaRe-Technologies/gensim) (en.wikipedia.org)
- code2vec mapping is correct (https://github.com/tech-srl/code2vec) (github.com)
🤖 Prompt for AI Agents
experiments/word2vec_evolution/experiment_config.yaml lines 14-31: the YAML
points to non-canonical repos for word2vec and doc2vec; update word2vec
github_repo to the official upstream https://github.com/tmikolov/word2vec and
replace doc2vec github_repo with the canonical Gensim implementation
https://github.com/RaRe-Technologies/gensim (leave code2vec mapping as-is).
- Fix lexicographic YYMM filtering bug with proper century handling in overnight test scripts - Update ArangoDBManager import paths to use core.database module - Remove incorrect await from synchronous insert_document calls - Fix tokenizer interface to support both callable and object-based tokenizers - Fix validate_arxiv_id bug that was stripping all digits - Clean up YAML formatting in experiment config These changes resolve critical bugs identified in the code review.
fix: Address CodeRabbit review comments from PR #17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
tests/test_overnight_arxiv_2000.py (1)
418-425: Report the actual number of papers processed.set this to len(papers), not the requested num_docs.
- 'total_papers_processed': num_docs, + 'total_papers_processed': len(papers),core/processors/chunking_strategies.py (2)
85-95: Fix: _clean_text collapses newlines, breaking paragraph/sentence-aware strategiesCollapsing all whitespace to a single space erases newlines, so SemanticChunking can no longer detect paragraphs/sentences after calling _clean_text. Preserve newlines while normalizing other whitespace.
def _clean_text(self, text: str) -> str: - # Remove excessive whitespace - text = re.sub(r'\s+', ' ', text) - # Remove null characters - text = text.replace('\x00', '') - return text.strip() + # Normalize line endings and whitespace while preserving paragraph breaks + text = text.replace('\r\n', '\n').replace('\r', '\n') + # Collapse inline whitespace but keep newlines + text = re.sub(r'[ \t\f\v]+', ' ', text) + # Collapse 3+ consecutive newlines to 2 to represent paragraph breaks + text = re.sub(r'\n{3,}', '\n\n', text) + # Remove null characters + text = text.replace('\x00', '') + return text.strip()
503-523: SlidingWindow remainder overlaps last window; compute correct remainder startRemainder should begin at last_start + window_size, not len(chunks) * step_size.
- # Handle remainder if any - if len(tokens) > self.window_size: - last_index = len(chunks) * self.step_size - if last_index < len(tokens): - remaining_tokens = tokens[last_index:] - if len(remaining_tokens) >= self.step_size: # Only if substantial - chunk_text = ' '.join(remaining_tokens) - start_char = len(' '.join(tokens[:last_index])) + # Handle remainder: tokens after the last full window + if len(tokens) > self.window_size and chunks: + last_start = (len(chunks) - 1) * self.step_size + remainder_start = last_start + self.window_size + if remainder_start < len(tokens): + remaining_tokens = tokens[remainder_start:] + if len(remaining_tokens) >= self.step_size: # Only if substantial + chunk_text = ' '.join(remaining_tokens) + start_char = len(' '.join(tokens[:remainder_start]))core/processors/generic_document_processor.py (2)
341-357: Handle extractor returning no content (None/empty) to avoid downstream NoneType errorsWORKER_DOCLING.extract may return None; _embed_and_store expects a dict. Short-circuit with a failure result instead of staging an invalid document.
# Extract PDF extracted = WORKER_DOCLING.extract(task.pdf_path) + if not extracted: + return { + 'success': False, + 'document_id': task.document_id, + 'error': 'Extractor returned no content' + }
429-433: Ensure overridden collections exist (e.g., github_*), or inserts will failArangoDBManager.ensure_collections creates only arxiv* by default. When overriding collections, create them up front.
# Initialize DB manager with collections WORKER_DB_MANAGER = ArangoDBManager(arango_config) WORKER_DB_MANAGER.collections = collections # Override with source-specific collections + # Ensure overridden collections exist + for coll in collections.values(): + if not WORKER_DB_MANAGER.db.has_collection(coll): + WORKER_DB_MANAGER.db.create_collection(coll)tools/arxiv/arxiv_manager.py (2)
442-455: Split chunk storage: separatearxiv_chunksfromarxiv_embeddingsand handle missing embeddings.Aligns with the intended schema; also avoids failures when embeddings are absent.
- # Store chunks with embeddings - for chunk in result.chunks: - chunk_doc = { - '_key': f"{paper_info.sanitized_id}_chunk_{chunk.chunk_index}", - 'paper_id': paper_info.sanitized_id, - 'arxiv_id': paper_info.arxiv_id, - 'chunk_index': chunk.chunk_index, - 'text': chunk.text, - 'embedding': chunk.embedding.tolist(), - 'start_char': chunk.start_char, - 'end_char': chunk.end_char, - 'context_window_used': chunk.context_window_used - } - self.db_manager.insert_document('arxiv_embeddings', chunk_doc) + # Store chunks and embeddings (separate collections) + for chunk in result.chunks: + chunk_key = f"{paper_info.sanitized_id}_chunk_{chunk.chunk_index}" + chunk_only_doc = { + '_key': chunk_key, + 'paper_id': paper_info.sanitized_id, + 'arxiv_id': paper_info.arxiv_id, + 'chunk_index': chunk.chunk_index, + 'text': chunk.text, + 'start_char': chunk.start_char, + 'end_char': chunk.end_char, + 'context_window_used': getattr(chunk, 'context_window_used', None), + } + self.db_manager.insert_document('arxiv_chunks', chunk_only_doc) + + embedding_doc = { + '_key': f"{chunk_key}_emb", + 'paper_id': paper_info.sanitized_id, + 'arxiv_id': paper_info.arxiv_id, + 'chunk_id': chunk_key, + 'chunk_index': chunk.chunk_index, + 'embedding': (chunk.embedding.tolist() if getattr(chunk, 'embedding', None) is not None else None), + } + self.db_manager.insert_document('arxiv_embeddings', embedding_doc)
1-490: Fix linting and formatting issues in tools/arxiv/arxiv_manager.py
Ruff reports 107 issues (72 auto-fixable) across these categories:
- Import organization: unsorted imports and unused imports (
os,asyncio,typing.Union,dataclasses.asdict) (I001, F401)- Type annotations: switch from deprecated
typing.Dict/List/Tupleto built-ins (dict,list,tuple) and useX | Noneinstead ofOptional[X](UP035, UP006, UP045)- Whitespace cleanup: remove blank lines containing spaces (W293) and add trailing newline at EOF (W292)
- Dead code: remove unused locals (
base_id,category) (F841)- Miscellaneous: remove unnecessary
openmode argument (UP015) and replacedatetime.now(timezone.utc)withdatetime.now(datetime.UTC)alias (UP017)Apply
ruff check --fix(add--unsafe-fixesfor type-annotation changes) and Black formatting to resolve these issues.
♻️ Duplicate comments (10)
experiments/word2vec_evolution/experiment_config.yaml (2)
14-31: Use canonical repos for word2vec/doc2vec to avoid skewed mappings.Swap to the upstream export for word2vec and a canonical implementation for doc2vec (Gensim). Code2vec entry is fine as-is.
word2vec: arxiv_id: "1301.3781" title: "Efficient Estimation of Word Representations in Vector Space" year: 2013 - github_repo: "https://github.com/dav/word2vec" + github_repo: "https://github.com/tmikolov/word2vec" doc2vec: arxiv_id: "1405.4053" title: "Distributed Representations of Sentences and Documents" year: 2014 - github_repo: "https://github.com/bnosac/doc2vec" + github_repo: "https://github.com/RaRe-Technologies/gensim"
112-112: Fix YAML lint error: add a trailing newline at EOF.CI is failing with "no new line character at the end of file". Add a single newline after Line 112.
- validation_report/ # Metrics and hypothesis testing results +core/processors/chunking_strategies.py (1)
316-316: Fix duplicate chunk_index when force-splitting large paragraphs_force_split_text starts indices at 0, causing duplicates when its chunks are extended into the main list. Offset indices by the current length of the outer list.
- chunks.extend(self._force_split_text(para, current_chunk_start)) + chunks.extend(self._force_split_text(para, current_chunk_start, base_index=len(chunks)))- def _force_split_text(self, text: str, start_char: int) -> List[TextChunk]: + def _force_split_text(self, text: str, start_char: int, base_index: int = 0) -> List[TextChunk]: @@ - chunks.append(TextChunk( + chunks.append(TextChunk( text=chunk_text, start_char=start_char, end_char=start_char + len(chunk_text), - chunk_index=len(chunks), + chunk_index=base_index + len(chunks), metadata={ 'strategy': 'semantic_forced_split', 'token_count': len(chunk_words) } ))Also applies to: 369-401
tests/test_c_files.py (2)
133-145: Harden AQL against missing fieldsAvoid errors when symbols/metrics are absent.
query = """ FOR doc IN github_papers FILTER doc.language == "c" LIMIT 5 RETURN { id: doc.document_id, language: doc.language, - functions: LENGTH(doc.symbols.functions || []), - structs: LENGTH(doc.symbols.structs || []), - includes: LENGTH(doc.symbols.includes || []), - complexity: doc.code_metrics.complexity + functions: LENGTH(COALESCE(doc.symbols.functions, [])), + structs: LENGTH(COALESCE(doc.symbols.structs, [])), + includes: LENGTH(COALESCE(doc.symbols.includes, [])), + complexity: COALESCE(doc.code_metrics.complexity, 0) } """
124-126: Fix: incorrect result keys when logging processor summaryUse nested extraction/embedding success lists from GenericDocumentProcessor.
- logger.info(f"Processed: {results.get('total_processed', 0)}") - logger.info(f"Extraction success: {results.get('extraction_success', 0)}") - logger.info(f"Embedding success: {results.get('embedding_success', 0)}") + logger.info(f"Processed: {results.get('total_processed', 0)}") + extraction_ok = len(results.get('extraction', {}).get('success', [])) + embedding_ok = len(results.get('embedding', {}).get('success', [])) + logger.info(f"Extraction success: {extraction_ok}") + logger.info(f"Embedding success: {embedding_ok}")tests/test_github_treesitter.py (3)
108-115: Use localhost for ARANGO_HOST default, not a private IP.Avoid environment-specific private IPs in test defaults.
- 'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'), + 'host': os.getenv('ARANGO_HOST', 'http://localhost:8529'),
160-194: Guard ArangoDB query and handle connection errors.Skip DB verification when host/password are not set and catch connectivity errors to keep tests resilient.
- # Check if Tree-sitter metadata was stored - if results.get('embedding_success', 0) > 0: - logger.info("\nChecking stored metadata in ArangoDB...") - from core.database.arango_db_manager import ArangoDBManager - - db_manager = ArangoDBManager(config['arango']) - - # Query for documents with Tree-sitter data - query = """ + # Check if Tree-sitter metadata was stored (only when DB access is configured) + extraction = results.get('extraction', {}) + embedding = results.get('embedding', {}) + can_query_db = bool(config['arango'].get('host') and config['arango'].get('password')) + if can_query_db and len(embedding.get('success', [])) > 0: + logger.info("\nChecking stored metadata in ArangoDB...") + from core.database.arango_db_manager import ArangoDBManager + try: + db_manager = ArangoDBManager(config['arango']) + # Query for documents with Tree-sitter data + query = """ FOR doc IN github_papers FILTER doc.has_tree_sitter == true LIMIT 5 RETURN { id: doc.document_id, language: doc.language, has_symbols: doc.has_tree_sitter, function_count: LENGTH(doc.symbols.functions), class_count: LENGTH(doc.symbols.classes), complexity: doc.code_metrics.complexity } """ - - cursor = db_manager.db.aql.execute(query) - docs_with_symbols = list(cursor) - - if docs_with_symbols: - logger.info(f"Found {len(docs_with_symbols)} documents with Tree-sitter symbols:") - for doc in docs_with_symbols: - logger.info(f" - {doc['id']}: {doc['language']} " - f"({doc['function_count']} functions, " - f"{doc['class_count']} classes, " - f"complexity: {doc['complexity']})") - else: - logger.info("No documents with Tree-sitter symbols found in database") + cursor = db_manager.db.aql.execute(query) + docs_with_symbols = list(cursor) + if docs_with_symbols: + logger.info(f"Found {len(docs_with_symbols)} documents with Tree-sitter symbols:") + for doc in docs_with_symbols: + logger.info(f" - {doc['id']}: {doc['language']} " + f"({doc['function_count']} functions, " + f"{doc['class_count']} classes, " + f"complexity: {doc['complexity']})") + else: + logger.info("No documents with Tree-sitter symbols found in database") + except Exception as e: + logger.warning(f"Skipping DB check due to connection error: {e}") + else: + logger.info("Skipping ArangoDB verification (ARANGO_HOST/ARANGO_PASSWORD not set or no embeddings).")
56-66: Fix: metadata['type'] check skips all files; process tasks directlyGitHubDocumentManager tasks don’t set metadata['type']; this guard prevents any processing.
- # Process first 3 code files + # Process up to 3 files code_count = 0 - for task in tasks[:10]: - if task.metadata.get('type') == 'code': + for task in tasks[:10]: logger.info(f"\n{'='*60}") logger.info(f"Processing: {task.document_id}") logger.info(f"File: {task.pdf_path}") # Extract with Tree-sitter result = code_extractor.extract(task.pdf_path) - if result and result.get('symbols'): + if result and result.get('symbols'): logger.info(f"Language: {result.get('language')}") # Display symbol counts symbols = result['symbols'] logger.info("Symbol Table:") for category, items in symbols.items(): if items: logger.info(f" {category}: {len(items)} items") # Show first 3 items for item in items[:3]: if isinstance(item, dict): name = item.get('name', item.get('statement', str(item))) line = item.get('line', '?') logger.info(f" - {name} (line {line})") # Display metrics if result.get('code_metrics'): metrics = result['code_metrics'] logger.info("Code Metrics:") logger.info(f" Lines of code: {metrics.get('lines_of_code', 0)}") logger.info(f" Complexity: {metrics.get('complexity', 0)}") logger.info(f" Max depth: {metrics.get('max_depth', 0)}") logger.info(f" Node count: {metrics.get('node_count', 0)}") # Display symbol hash if result.get('symbol_hash'): logger.info(f"Symbol hash: {result['symbol_hash'][:16]}...") code_count += 1 if code_count >= 3: break - else: - logger.info("No Tree-sitter symbols extracted (might not be a supported language)") + else: + logger.info("No Tree-sitter symbols extracted (might not be a supported language)")tools/arxiv/arxiv_manager.py (2)
101-106: Remove unused base_id in validate_arxiv_id (Ruff F841) and align comment.The regex already accepts optional version suffix; no need to pre-strip. Also avoids the unused variable.
- # Remove any version suffix for validation - base_id = re.sub(r'v\d+$', '', arxiv_id) + # Version suffix allowed by regex; no pre-strip needed
143-149: Remove unused localcategoryin get_pdf_path (Ruff F841).Variable is assigned but not used.
- category = match.group(1) - number = match.group(2) + number = match.group(2) yymm = number[:4] return cls.PDF_BASE_PATH / yymm / f"{base_id.replace('/', '_')}.pdf"
🧹 Nitpick comments (44)
experiments/word2vec_evolution/experiment_config.yaml (4)
38-40: Clarify chunk_overlap units (tokens vs characters vs bytes).Label the unit to prevent misconfiguration across components.
- chunk_size: 512 - chunk_overlap: 50 + chunk_size: 512 # tokens + chunk_overlap: 50 # tokens
47-51: Make bridge thresholds tunable to reduce “no results” risk.0.85/0.9 thresholds are stringent; expose overrides via CLI/env to ease calibration.
Example (for your runner to read as optional overrides):
analysis: # Bridge detection bridge_criteria: semantic_similarity_threshold: 0.85 conveyance_threshold: 0.9 where_threshold: 0.8 # Proximity in graph + params_overrides_env: + semantic_similarity_threshold: BRIDGE_SIM_THR + conveyance_threshold: BRIDGE_CONVEY_THR + where_threshold: BRIDGE_WHERE_THR
81-92: Surface storage connection settings referenced in pipeline.You store embeddings in ArangoDB here but the config lacks connection hints. Add env‑backed settings (no secrets).
analysis: embedding_model: "jina-v4" embedding_dimensions: 2048 +connections: + arangodb: + url_env: ARANGODB_URL + db_env: ARANGODB_DB + user_env: ARANGODB_USER + password_env: ARANGODB_PASSWORD + postgres: + dsn_env: POSTGRES_DSN
15-30: Add explicit arXiv URLs for convenience.Keep ids, but URLs help tooling and reviewers.
word2vec: arxiv_id: "1301.3781" + arxiv_url: "https://arxiv.org/abs/1301.3781" ... doc2vec: arxiv_id: "1405.4053" + arxiv_url: "https://arxiv.org/abs/1405.4053" ... code2vec: arxiv_id: "1803.09473" + arxiv_url: "https://arxiv.org/abs/1803.09473"tools/arxiv/arxiv_document_manager.py (7)
19-22: Avoid sys.path hacks; rely on package-absolute imports per guidelines.This is brittle and bypasses proper packaging. The coding guidelines request absolute imports without path mutation.
-import sys -sys.path.insert(0, str(Path(__file__).parent.parent.parent)) from core.processors.generic_document_processor import DocumentTask
34-49: Harden base-dir validation (check directory, not just existence).If a file exists at the path, current logic passes. Validate it’s an actual directory.
- if not self.pdf_base_dir.exists(): + if not self.pdf_base_dir.is_dir(): raise ValueError(f"PDF base directory does not exist: {pdf_base_dir}")
64-71: Reduce duplicate logging on failed preparations.both this method and
_prepare_single_documentlog failures; this can spam logs. Downgrade here to debug.for arxiv_id in arxiv_ids: task = self._prepare_single_document(arxiv_id) if task: tasks.append(task) else: - logger.warning(f"Could not prepare document for {arxiv_id}") + logger.debug(f"Could not prepare document for {arxiv_id}")
110-119: Align “newest-first” behavior with implementation.Docstring promises newest-first; current sort is lexicographical by filename. Either update doc or sort by mtime. Also ignore non-YYMM dirs.
- yymm_dirs = sorted([d for d in self.pdf_base_dir.iterdir() if d.is_dir()], - reverse=True) + yymm_dirs = sorted( + [d for d in self.pdf_base_dir.iterdir() if d.is_dir() and d.name.isdigit() and len(d.name) == 4], + reverse=True, + ) @@ - pdf_files = sorted(yymm_dir.glob("*.pdf"), reverse=True) + pdf_files = sorted(yymm_dir.glob("*.pdf"), key=lambda p: p.stat().st_mtime, reverse=True)Also applies to: 122-125, 131-131
182-197: Broaden LaTeX source detection to common tarball variants.arXiv source tarballs often include additional suffixes. Fall back to any “stem*.tar.gz”.
# Check for LaTeX source latex_path = pdf_path.with_suffix('.tex') if not latex_path.exists(): # Try .tar.gz (compressed LaTeX) latex_path = pdf_path.with_suffix('.tar.gz') if not latex_path.exists(): - latex_path = None + # Fallback: any tarball starting with the stem (e.g., *source*.tar.gz) + matches = list(pdf_path.parent.glob(f"{pdf_path.stem}*.tar.gz")) + latex_path = matches[0] if matches else NoneAlso applies to: 198-208
226-237: Validate month in YYMM and keep version check.Accept only 01–12 months; everything else should be invalid.
# Check year_month is 4 digits if not year_month.isdigit() or len(year_month) != 4: return False + # Validate month range (01–12) + mm = year_month[2:] + if not (1 <= int(mm) <= 12): + return False # Check paper number is 4-6 digits, optionally followed by version like v2 base, _, version = paper_num.partition('v') if not base.isdigit() or not (4 <= len(base) <= 6): return False # If there's a version, validate it's numeric if version and not version.isdigit(): return FalseAlso applies to: 245-247, 249-255
16-18: Remove unused imports to satisfy Ruff.
Dict,Any, anddataclassaren’t used.-from typing import List, Optional, Dict, Any -from dataclasses import dataclass +from typing import List, Optionaltests/test_overnight_arxiv_2000.py (9)
66-76: Clarify parameter format in docstring.call this YYMM or YYYYMM to match normalization and CLI.
- start_year (str): Inclusive YYMM directory name to start searching from (e.g., "1501" = January 2015). + start_year (str): Inclusive year-month to start from, in YYMM or YYYYMM + (e.g., "1501" or "201501" = January 2015).
86-87: Stabilize sampling for reproducibility.-always use a local RNG and sort files so order is deterministic when taking all PDFs from a dir._
- rng = random.Random(seed) if seed is not None else random + rng = random.Random(seed) if seed is not None else random.Random() @@ - sampled_pdfs = rng.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_files + sampled_pdfs = rng.sample(pdf_files, sample_size) if sample_size < len(pdf_files) else pdf_filesAdditionally (outside selected lines), sort the directory listing once:
# at line 108 pdf_files = sorted(yymm_dir.glob("*.pdf"), key=lambda p: p.name)Also applies to: 113-113
129-159: Detect LaTeX recursively to handle nested project structures.many arXiv sources keep .tex files in subfolders; use rglob for accuracy.
- if latex_dir.exists(): - # Check for .tex files - tex_files = list(latex_dir.glob("*.tex")) - if tex_files: - latex_count += 1 + if latex_dir.exists(): + # Check for .tex files anywhere under the directory + if any(latex_dir.rglob("*.tex")): + latex_count += 1
162-189: Docstring: add start_year parameter and format details.keep docs aligned with the function signature and CLI.
Parameters: - num_docs (int): Maximum number of ArXiv papers to include in the test (default 2000). + num_docs (int): Maximum number of ArXiv papers to include in the test (default 2000). + start_year (str): Inclusive year-month to start from, in YYMM or YYYYMM.
203-203: Expose sampling seed end-to-end for reproducible runs.forward a seed from CLI into collect_arxiv_papers so the same corpus is selected across runs.
If acceptable, I can add:
- run_arxiv_overnight_test(..., seed: int | None = None) and pass it to collect_arxiv_papers
- CLI flag --seed to populate it.
371-385: Guard division by zero in short runs.rare but safer; keeps metrics robust if a tiny corpus finishes instantly.
- 'papers_per_minute': (len(papers) / test_time) * 60, + 'papers_per_minute': ((len(papers) / test_time) * 60) if test_time > 0 else 0.0,
399-403: Fix failure logging to show readable, bounded error text.current code slices a list; join messages and truncate characters instead.
- if failed_docs: - logger.warning(f" Failed Papers: {len(failed_docs)}") - for fail in failed_docs[:3]: - logger.warning(f" - {fail['arxiv_id']}: {fail['errors'][:100]}") + if failed_docs: + logger.warning(f" Failed Papers: {len(failed_docs)}") + for fail in failed_docs[:3]: + err = fail.get('errors') + err_str = "; ".join(map(str, err)) if isinstance(err, list) else str(err) + logger.warning(f" - {fail['arxiv_id']}: {err_str[:200]}")
431-436: Make the performance target configurable.read from env with a sane default; optional CLI can be added later.
- target_rate = 11.3 # papers/minute from PRD + target_rate = float(os.getenv("ARXIV_TARGET_RATE", "11.3")) # papers/minute
462-466: Broaden CLI help to reflect accepted formats.matches the YYMM/ YYYYMM normalization.
- help="Starting YYMM for paper selection (default: 1501 = Jan 2015)" + help="Start year-month (YYMM or YYYYMM). Default: 1501 (Jan 2015)"core/processors/chunking_strategies.py (5)
141-143: Validate chunk_size/overlap ranges (negative/zero cases)Guard against chunk_size <= 0 or negative overlap to avoid undefined behavior.
- if chunk_overlap >= chunk_size: - raise ValueError("Overlap must be less than chunk size") + if chunk_size <= 0: + raise ValueError("chunk_size must be > 0") + if chunk_overlap < 0: + raise ValueError("chunk_overlap must be >= 0") + if chunk_overlap >= chunk_size: + raise ValueError("Overlap must be less than chunk size")
20-20: Remove unused importnumpy is imported but never used in this module.
-import numpy as np
173-200: Avoid O(n^2) start_char computation in token-based chunkinglen(' '.join(tokens[:i])) inside the loop is quadratic. Precompute prefix char offsets once.
- for i in range(0, len(tokens), stride): + # Precompute prefix char offsets to avoid quadratic joins + if tokens: + spaces = [0] + [1] * (len(tokens) - 1) + token_lens = [len(t) for t in tokens] + prefix_chars = [0] + running = 0 + for tlen, space in zip(token_lens, spaces): + running += tlen + space + prefix_chars.append(running) + for i in range(0, len(tokens), stride): @@ - # Calculate character positions (approximate) - start_char = len(' '.join(tokens[:i])) + # Calculate character positions (approximate) + start_char = prefix_chars[i] if tokens else 0
463-465: Validate window_size and step_size > 0Prevent runtime ValueError from range step=0 and clarify constraints early.
if step_size > window_size: raise ValueError("Step size cannot be larger than window size") + if window_size <= 0: + raise ValueError("window_size must be > 0") + if step_size <= 0: + raise ValueError("step_size must be > 0")
17-17: Modernize typing per Ruff hints (Callable from collections.abc, use | unions)Keep compatibility while addressing UP035/UP007.
-from typing import List, Dict, Any, Optional, Tuple, Callable, Union +from typing import List, Dict, Any, Optional, Tuple +from collections.abc import Callable @@ - tokenizer: Optional[Union[Callable[[str], List[str]], Any]] = None + tokenizer: Optional[Callable[[str], List[str]] | Any] = NoneAlso applies to: 111-111
core/processors/generic_document_processor.py (3)
183-194: Make extraction future timeout configurableLong documents/code may exceed 120s. Allow config override to avoid spurious failures.
- result = future.result(timeout=120) + result = future.result(timeout=extraction_config.get('future_timeout_seconds', 120))
218-224: Make embedding future timeout configurableSame rationale as extraction; 300s may be low under heavy GPU load.
- result = future.result(timeout=300) + result = future.result(timeout=embedding_config.get('future_timeout_seconds', 300))Also applies to: 247-258
409-413: Avoid mutating sys.path in worker init; prefer package install/importssys.path hacks are brittle in multiprocess contexts. Prefer installing the package or setting PYTHONPATH for the test environment.
tests/test_c_files.py (3)
13-13: Remove unused importGitHubDocumentManager isn’t used in this test.
-from tools.github.github_document_manager import GitHubDocumentManager
163-163: Add trailing newlineConform to POSIX/linters.
if __name__ == "__main__": test_c_files() +
86-107: Optional: make Arango connection and query conditionalConsider skipping DB steps if ARANGO_HOST/ARANGO_PASSWORD are unset or connection fails, to keep the test runnable in CI without DB.
Also applies to: 129-158
tests/test_github_treesitter.py (5)
127-129: Use a temp staging directory for portability and isolation.Hardcoding /dev/shm is Linux-specific; prefer a temp dir and still clean it up.
- 'staging': { - 'directory': '/dev/shm/github_test_staging' - } + 'staging': { + 'directory': os.getenv('STAGING_DIR', tempfile.mkdtemp(prefix='github_test_staging_')) + } @@ - import shutil - shutil.rmtree(config['staging']['directory'], ignore_errors=True) + import shutil + shutil.rmtree(config['staging']['directory'], ignore_errors=True)Add the required import:
@@ -import os -import json +import os +import json +import tempfileAlso applies to: 195-197
35-39: Avoid hardcoding clone_dir; make it configurable with an env default.Improves portability and avoids requiring a specific mount path.
- github_manager = GitHubDocumentManager( - clone_dir="/bulk-store/git", - cleanup=False # Keep repos for analysis - ) + github_manager = GitHubDocumentManager( + clone_dir=os.getenv('GITHUB_CLONE_DIR', '/tmp/github_repos'), + cleanup=False # Keep repos for analysis + )
14-14: Avoid fragile sys.path hacks; compute project root robustly.Keeps absolute imports while preventing accidental wrong paths.
-# Add parent directories to path -sys.path.insert(0, str(Path(__file__).parent.parent.parent)) +# Ensure project root on sys.path (tests expect package-style imports) +project_root = Path(__file__).resolve().parents[1] +if str(project_root) not in sys.path: + sys.path.insert(0, str(project_root))
123-125: Make GPU devices configurable via env; default to CPU.Prevents failures on machines without GPUs.
- 'embedding': { - 'workers': 1, - 'gpu_devices': [0] - } + 'embedding': { + 'workers': 1, + 'gpu_devices': json.loads(os.getenv('GPU_DEVICES', '[]')) + }
202-203: Add trailing newline (ruff W292).Satisfies linters and POSIX text file conventions.
tools/arxiv/arxiv_manager.py (8)
16-25: Prune unused imports and types to satisfy Ruff and reduce noise.
os,asyncio,asdict, andUnionaren’t used. This will trigger Ruff F401/F401/F401/F401.-import os import re import json import logging -import asyncio from pathlib import Path from datetime import datetime, timezone -from typing import Dict, Any, List, Optional, Tuple, Union -from dataclasses import dataclass, asdict +from typing import Dict, Any, List, Optional, Tuple +from dataclasses import dataclass
276-289: Open metadata with UTF-8 and log with traceback for resilience.Minor robustness: specify encoding and use exception logging with stack trace.
- with open(metadata_path, 'r') as f: + with open(metadata_path, 'r', encoding='utf-8') as f: @@ - except Exception as e: - logger.warning(f"Failed to load metadata cache: {e}") + except Exception: + logger.exception("Failed to load metadata cache")
327-343: Harden metadata coercion for authors/categories.Metadata often uses
authors_parsedand space-delimitedcategories. Consider normalizing gracefully to avoid type surprises.
- Prefer
authors = [" ".join(a[:2]).strip() for a in metadata.get("authors_parsed", [])]as first choice, else parse string, else list fallback.- Normalize
categoriesviasplit()if str; ensure list of str.If you want, I can draft a small
_normalize_metadata()helper and wire it here.
372-377: Potential event-loop blocking.
process_documentappears CPU/IO heavy; calling it directly inasynccontext can block. If this runs under asyncio with other tasks, consider offloading to a thread/process pool.Example pattern:
await asyncio.get_running_loop().run_in_executor(None, functools.partial(self.processor.process_document, ...))
431-435: Guard against None for extraction counts.If any of these are None,
len(None)will raise.- 'num_chunks': len(result.chunks), - 'num_tables': len(result.extraction.tables), - 'num_equations': len(result.extraction.equations), - 'num_images': len(result.extraction.images), + 'num_chunks': len(result.chunks), + 'num_tables': len(getattr(result.extraction, 'tables', []) or []), + 'num_equations': len(getattr(result.extraction, 'equations', []) or []), + 'num_images': len(getattr(result.extraction, 'images', []) or []),
457-467: Default structure lists to empty and keep schema stable.Safer if any extractor component returns None; also future-proof if figures/images differ.
- structures_doc = { + structures_doc = { '_key': paper_info.sanitized_id, 'arxiv_id': paper_info.arxiv_id, - 'tables': result.extraction.tables, - 'equations': result.extraction.equations, - 'images': result.extraction.images, - 'figures': result.extraction.figures + 'tables': getattr(result.extraction, 'tables', []) or [], + 'equations': getattr(result.extraction, 'equations', []) or [], + 'images': getattr(result.extraction, 'images', []) or [], + 'figures': getattr(result.extraction, 'figures', []) or [], }
471-472: Log full traceback on storage failures.Use
logger.exceptionfor better diagnostics.- except Exception as e: - logger.error(f"Failed to store ArXiv paper {paper_info.arxiv_id}: {e}") + except Exception: + logger.exception(f"Failed to store ArXiv paper {paper_info.arxiv_id}") raise
26-30: Avoid function-local import; import ExtractionResult at module scope.Saves repeated import and matches style.
from core.processors.document_processor import ( DocumentProcessor, ProcessingConfig, - ProcessingResult + ProcessingResult, + ExtractionResult, )- from core.processors.document_processor import ProcessingResult, ExtractionResult + # imports available at module scopeAlso applies to: 500-501
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
core/processors/chunking_strategies.py(12 hunks)core/processors/generic_document_processor.py(11 hunks)experiments/word2vec_evolution/experiment_config.yaml(1 hunks)tests/test_c_files.py(1 hunks)tests/test_github_treesitter.py(1 hunks)tests/test_overnight_arxiv_2000.py(6 hunks)tools/arxiv/arxiv_document_manager.py(8 hunks)tools/arxiv/arxiv_manager.py(15 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Prefer absolute imports following the shown module structure (e.g., core.framework., tools.arxiv.pipelines.)
Files:
tests/test_github_treesitter.pytests/test_c_files.pycore/processors/chunking_strategies.pycore/processors/generic_document_processor.pytests/test_overnight_arxiv_2000.pytools/arxiv/arxiv_manager.pytools/arxiv/arxiv_document_manager.py
{tools/arxiv,core/framework}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
{tools/arxiv,core/framework}/**/*.py: Format Python code with Black for tools/arxiv/ and core/framework/
Run Ruff linting on tools/arxiv/ and core/framework/
Files:
tools/arxiv/arxiv_manager.pytools/arxiv/arxiv_document_manager.py
🧠 Learnings (2)
📚 Learning: 2025-09-02T23:17:37.712Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-02T23:17:37.712Z
Learning: Applies to tools/arxiv/{pipelines/arxiv_pipeline.py,utils/check_db_status.py} : Use ArangoDB collections: arxiv_papers, arxiv_chunks, arxiv_embeddings, arxiv_structures (with specified fields)
Applied to files:
tests/test_github_treesitter.pytools/arxiv/arxiv_manager.py
📚 Learning: 2025-09-02T23:17:37.712Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-02T23:17:37.712Z
Learning: Applies to tools/arxiv/{pipelines/arxiv_pipeline.py,utils/check_db_status.py} : Store all outputs in ArangoDB with atomic transactions
Applied to files:
tools/arxiv/arxiv_manager.py
🧬 Code graph analysis (5)
tests/test_github_treesitter.py (3)
tools/github/github_document_manager.py (2)
GitHubDocumentManager(52-312)prepare_repository(91-127)core/processors/generic_document_processor.py (2)
GenericDocumentProcessor(46-261)process_documents(91-128)core/framework/extractors/code_extractor.py (1)
CodeExtractor(18-192)
tests/test_c_files.py (3)
tools/github/github_document_manager.py (1)
GitHubDocumentManager(52-312)core/processors/generic_document_processor.py (3)
GenericDocumentProcessor(46-261)DocumentTask(29-43)process_documents(91-128)core/framework/extractors/code_extractor.py (1)
CodeExtractor(18-192)
core/processors/chunking_strategies.py (1)
experiments/word2vec_evolution/generate_embeddings.py (1)
chunk_text(30-57)
core/processors/generic_document_processor.py (3)
core/framework/embedders.py (1)
JinaV4Embedder(52-1003)core/database/arango_db_manager.py (1)
ArangoDBManager(69-252)tools/arxiv/arxiv_manager.py (1)
sanitized_id(60-67)
tests/test_overnight_arxiv_2000.py (1)
tests/test_overnight_arxiv_phased.py (1)
collect_arxiv_papers(57-101)
🪛 Ruff (0.12.2)
tests/test_github_treesitter.py
203-203: No newline at end of file
Add trailing newline
(W292)
tests/test_c_files.py
163-163: No newline at end of file
Add trailing newline
(W292)
core/processors/chunking_strategies.py
17-17: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
17-17: typing.List is deprecated, use list instead
(UP035)
17-17: typing.Dict is deprecated, use dict instead
(UP035)
17-17: typing.Tuple is deprecated, use tuple instead
(UP035)
111-111: Use X | Y for type annotations
Convert to X | Y
(UP007)
🪛 YAMLlint (1.37.1)
experiments/word2vec_evolution/experiment_config.yaml
[error] 112-112: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (11)
experiments/word2vec_evolution/experiment_config.yaml (2)
1-12: Overall: concise, readable config.Good use of document marker, structured sections, and wrapped descriptions.
34-36: No action needed—Jina v4 outputs 2048-dim embeddings as configured. Tests, pipeline code, and docs consistently assert EMBEDDING_DIM = 2048 for model “jina-v4,” so the settings in experiment_config.yaml are correct.tools/arxiv/arxiv_document_manager.py (1)
52-61: Docstring clarity—LGTM.The behavior description (skip invalid/missing, preserve order) is clear and helpful.
core/processors/generic_document_processor.py (1)
271-287: Cannot verify GPU sharding viamp.current_process()._identity
The test script failed with aSemLockerror in this environment, so please manually confirm thatmp.current_process()._identityis reliably non‐empty under your executor. If it isn’t, switch to a simpler round-robin allocator (e.g. an atomic counter) for assigningCUDA_VISIBLE_DEVICES.tools/arxiv/arxiv_manager.py (7)
26-31: Absolute imports: good move.Using absolute imports (core.*) aligns with repo guidelines and eases refactors.
61-66: Docstring clarity for sanitized_id looks good.Explains transformation and usage clearly.
117-132: PDF path docstring is precise and helpful.No issues.
155-158: LaTeX path docstring reads well.Clear precedence for main file detection.
183-197: parse_arxiv_id docs LGTM.Covers new/old formats and returned fields succinctly.
242-249: init docstring is clear about wiring and behavior.Nice.
480-490: Batch processing behavior documented well.Ordering guarantee and failure shaping are clear.
| papers: | ||
| word2vec: | ||
| arxiv_id: "1301.3781" | ||
| title: "Efficient Estimation of Word Representations in Vector Space" | ||
| year: 2013 | ||
| github_repo: "https://github.com/dav/word2vec" | ||
|
|
||
| doc2vec: | ||
| arxiv_id: "1405.4053" | ||
| title: "Distributed Representations of Sentences and Documents" | ||
| year: 2014 | ||
| github_repo: "https://github.com/bnosac/doc2vec" | ||
|
|
||
| code2vec: | ||
| arxiv_id: "1803.09473" | ||
| title: "code2vec: Learning Distributed Representations of Code" | ||
| year: 2018 | ||
| github_repo: "https://github.com/tech-srl/code2vec" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Pin GitHub repos by commit for reproducibility.
Unpinned default branches are moving targets; runs won’t be reproducible.
papers:
word2vec:
arxiv_id: "1301.3781"
title: "Efficient Estimation of Word Representations in Vector Space"
year: 2013
github_repo: "https://github.com/tmikolov/word2vec"
+ commit: "<PIN_SHA_7+>"
doc2vec:
arxiv_id: "1405.4053"
title: "Distributed Representations of Sentences and Documents"
year: 2014
github_repo: "https://github.com/RaRe-Technologies/gensim"
+ commit: "<PIN_SHA_7+>"
code2vec:
arxiv_id: "1803.09473"
title: "code2vec: Learning Distributed Representations of Code"
year: 2018
github_repo: "https://github.com/tech-srl/code2vec"
+ commit: "<PIN_SHA_7+>"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| papers: | |
| word2vec: | |
| arxiv_id: "1301.3781" | |
| title: "Efficient Estimation of Word Representations in Vector Space" | |
| year: 2013 | |
| github_repo: "https://github.com/dav/word2vec" | |
| doc2vec: | |
| arxiv_id: "1405.4053" | |
| title: "Distributed Representations of Sentences and Documents" | |
| year: 2014 | |
| github_repo: "https://github.com/bnosac/doc2vec" | |
| code2vec: | |
| arxiv_id: "1803.09473" | |
| title: "code2vec: Learning Distributed Representations of Code" | |
| year: 2018 | |
| github_repo: "https://github.com/tech-srl/code2vec" | |
| papers: | |
| word2vec: | |
| arxiv_id: "1301.3781" | |
| title: "Efficient Estimation of Word Representations in Vector Space" | |
| year: 2013 | |
| github_repo: "https://github.com/tmikolov/word2vec" | |
| commit: "<PIN_SHA_7+>" | |
| doc2vec: | |
| arxiv_id: "1405.4053" | |
| title: "Distributed Representations of Sentences and Documents" | |
| year: 2014 | |
| github_repo: "https://github.com/RaRe-Technologies/gensim" | |
| commit: "<PIN_SHA_7+>" | |
| code2vec: | |
| arxiv_id: "1803.09473" | |
| title: "code2vec: Learning Distributed Representations of Code" | |
| year: 2018 | |
| github_repo: "https://github.com/tech-srl/code2vec" | |
| commit: "<PIN_SHA_7+>" |
🤖 Prompt for AI Agents
In experiments/word2vec_evolution/experiment_config.yaml around lines 13 to 31,
the github_repo fields point to unpinned default branches which makes runs
non-reproducible; update each paper entry to pin the source by adding a commit
field (e.g., github_commit: "<full-commit-sha>") or replace github_repo with a
github_repo_commit field that includes the repo URL plus the exact commit or tag
(full SHA preferable), and ensure any tooling that fetches the repo uses that
commit to checkout; repeat for word2vec, doc2vec and code2vec entries.
| # Display results | ||
| logger.info("\nPipeline Results:") | ||
| logger.info(f"Total processed: {results.get('total_processed', 0)}") | ||
| logger.info(f"Successful: {results.get('extraction_success', 0) + results.get('embedding_success', 0)}") | ||
| logger.info(f"Failed: {results.get('extraction_failed', 0) + results.get('embedding_failed', 0)}") | ||
|
|
||
| if results.get('processing_rate'): | ||
| logger.info(f"Processing rate: {results['processing_rate']:.2f} docs/min") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct results parsing (GenericDocumentProcessor returns nested dicts).
Access extraction/embedding summaries via results['extraction'] and results['embedding'] instead of flattened keys; remove non-existent processing_rate.
- logger.info("\nPipeline Results:")
- logger.info(f"Total processed: {results.get('total_processed', 0)}")
- logger.info(f"Successful: {results.get('extraction_success', 0) + results.get('embedding_success', 0)}")
- logger.info(f"Failed: {results.get('extraction_failed', 0) + results.get('embedding_failed', 0)}")
-
- if results.get('processing_rate'):
- logger.info(f"Processing rate: {results['processing_rate']:.2f} docs/min")
+ logger.info("\nPipeline Results:")
+ logger.info(f"Total processed: {results.get('total_processed', 0)}")
+ extraction = results.get('extraction', {})
+ embedding = results.get('embedding', {})
+ logger.info(f"Extraction: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed")
+ logger.info(f"Embedding: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Display results | |
| logger.info("\nPipeline Results:") | |
| logger.info(f"Total processed: {results.get('total_processed', 0)}") | |
| logger.info(f"Successful: {results.get('extraction_success', 0) + results.get('embedding_success', 0)}") | |
| logger.info(f"Failed: {results.get('extraction_failed', 0) + results.get('embedding_failed', 0)}") | |
| if results.get('processing_rate'): | |
| logger.info(f"Processing rate: {results['processing_rate']:.2f} docs/min") | |
| # Display results | |
| logger.info("\nPipeline Results:") | |
| logger.info(f"Total processed: {results.get('total_processed', 0)}") | |
| extraction = results.get('extraction', {}) | |
| embedding = results.get('embedding', {}) | |
| logger.info(f"Extraction: {len(extraction.get('success', []))} success, {len(extraction.get('failed', []))} failed") | |
| logger.info(f"Embedding: {len(embedding.get('success', []))} success, {len(embedding.get('failed', []))} failed") |
🤖 Prompt for AI Agents
In tests/test_github_treesitter.py around lines 151 to 159, the test is reading
flattened keys that don't exist because GenericDocumentProcessor returns nested
summaries; update the logging to read results.get('extraction', {}) and
results.get('embedding', {}) and pull totals from those dicts (e.g.
extraction.get('total_processed', 0), extraction.get('success', 0),
extraction.get('failed', 0) and similarly for embedding), compute aggregated
counts from those nested values for the "Total processed", "Successful" and
"Failed" lines, and remove the non-existent processing_rate check/logging.
| def _normalize_yymm(yymm: str) -> tuple[int, int]: | ||
| """Normalize YYMM to (year, month) tuple for correct comparison.""" | ||
| if len(yymm) != 4 or not yymm.isdigit(): | ||
| raise ValueError(f"Invalid YYMM: {yymm}") | ||
| yy, mm = int(yymm[:2]), int(yymm[2:]) | ||
| if not 1 <= mm <= 12: | ||
| raise ValueError(f"Invalid month in YYMM: {yymm}") | ||
| # Handle century: 90-99 = 1990s, 00-89 = 2000s | ||
| year = 1900 + yy if yy >= 90 else 2000 + yy | ||
| return (year, mm) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle YYYYMM and avoid crashes on 6-digit directory names.
current code breaks when encountering numeric dirs of length 6 (e.g., 201501). Make the normalizer accept YYMM or YYYYMM.
-def _normalize_yymm(yymm: str) -> tuple[int, int]:
- """Normalize YYMM to (year, month) tuple for correct comparison."""
- if len(yymm) != 4 or not yymm.isdigit():
- raise ValueError(f"Invalid YYMM: {yymm}")
- yy, mm = int(yymm[:2]), int(yymm[2:])
- if not 1 <= mm <= 12:
- raise ValueError(f"Invalid month in YYMM: {yymm}")
- # Handle century: 90-99 = 1990s, 00-89 = 2000s
- year = 1900 + yy if yy >= 90 else 2000 + yy
- return (year, mm)
+def _normalize_yymm(yymm: str) -> tuple[int, int]:
+ """Normalize YYMM or YYYYMM to (year, month) for correct comparison."""
+ if not (yymm.isdigit() and len(yymm) in (4, 6)):
+ raise ValueError(f"Invalid YYMM/ YYYYMM: {yymm}")
+ if len(yymm) == 4:
+ yy, mm = int(yymm[:2]), int(yymm[2:])
+ # 90-99 = 1990s, 00-89 = 2000s
+ year = 1900 + yy if yy >= 90 else 2000 + yy
+ else:
+ year, mm = int(yymm[:4]), int(yymm[4:])
+ if not 1 <= mm <= 12:
+ raise ValueError(f"Invalid month in YYMM/ YYYYMM: {yymm}")
+ return (year, mm)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _normalize_yymm(yymm: str) -> tuple[int, int]: | |
| """Normalize YYMM to (year, month) tuple for correct comparison.""" | |
| if len(yymm) != 4 or not yymm.isdigit(): | |
| raise ValueError(f"Invalid YYMM: {yymm}") | |
| yy, mm = int(yymm[:2]), int(yymm[2:]) | |
| if not 1 <= mm <= 12: | |
| raise ValueError(f"Invalid month in YYMM: {yymm}") | |
| # Handle century: 90-99 = 1990s, 00-89 = 2000s | |
| year = 1900 + yy if yy >= 90 else 2000 + yy | |
| return (year, mm) | |
| def _normalize_yymm(yymm: str) -> tuple[int, int]: | |
| """Normalize YYMM or YYYYMM to (year, month) for correct comparison.""" | |
| if not (yymm.isdigit() and len(yymm) in (4, 6)): | |
| raise ValueError(f"Invalid YYMM/ YYYYMM: {yymm}") | |
| if len(yymm) == 4: | |
| yy, mm = int(yymm[:2]), int(yymm[2:]) | |
| # 90-99 = 1990s, 00-89 = 2000s | |
| year = 1900 + yy if yy >= 90 else 2000 + yy | |
| else: | |
| year, mm = int(yymm[:4]), int(yymm[4:]) | |
| if not 1 <= mm <= 12: | |
| raise ValueError(f"Invalid month in YYMM/ YYYYMM: {yymm}") | |
| return (year, mm) |
🤖 Prompt for AI Agents
In tests/test_overnight_arxiv_2000.py around lines 54 to 64, the _normalize_yymm
function currently only accepts 4-digit YYMM and fails on 6-digit YYYYMM
directory names (e.g., 201501); update the function to accept either 4-digit
(YYMM) or 6-digit (YYYYMM) numeric strings, parse accordingly (if length==4
treat as YYMM and apply the existing century rule, if length==6 take the first
four digits as the full year and last two as month), validate that month is
between 1 and 12, and raise ValueError for any other lengths or non-numeric
input; ensure the function still returns (year, month) as ints.
| yymm_dirs = sorted( | ||
| [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()], | ||
| key=lambda d: _normalize_yymm(d.name) | ||
| ) | ||
|
|
||
| # Filter to start from specified year | ||
| yymm_dirs = [d for d in yymm_dirs if d.name >= start_year] | ||
| # Filter to start from specified year (normalize century) | ||
| start_cutoff = _normalize_yymm(start_year) | ||
| yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Filter dir names to 4 or 6 digits to prevent ValueError and sort once.
current list may include 6+ digit numeric names and then crash during normalization.
- yymm_dirs = sorted(
- [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()],
- key=lambda d: _normalize_yymm(d.name)
- )
+ yymm_dirs = [d for d in ARXIV_PDF_BASE.iterdir()
+ if d.is_dir() and d.name.isdigit() and len(d.name) in (4, 6)]
+ yymm_dirs.sort(key=lambda d: _normalize_yymm(d.name))
@@
- yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]
+ yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| yymm_dirs = sorted( | |
| [d for d in ARXIV_PDF_BASE.iterdir() if d.is_dir() and d.name.isdigit()], | |
| key=lambda d: _normalize_yymm(d.name) | |
| ) | |
| # Filter to start from specified year | |
| yymm_dirs = [d for d in yymm_dirs if d.name >= start_year] | |
| # Filter to start from specified year (normalize century) | |
| start_cutoff = _normalize_yymm(start_year) | |
| yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff] | |
| yymm_dirs = [d for d in ARXIV_PDF_BASE.iterdir() | |
| if d.is_dir() and d.name.isdigit() and len(d.name) in (4, 6)] | |
| yymm_dirs.sort(key=lambda d: _normalize_yymm(d.name)) | |
| # Filter to start from specified year (normalize century) | |
| start_cutoff = _normalize_yymm(start_year) | |
| yymm_dirs = [d for d in yymm_dirs if _normalize_yymm(d.name) >= start_cutoff] |
🤖 Prompt for AI Agents
In tests/test_overnight_arxiv_2000.py around lines 89 to 96, the directory
filtering and sorting can raise ValueError when numeric directory names have
more than 4 or 6 digits and also invokes normalization multiple times; restrict
the initial list comprehension to directories whose names are all digits and
length 4 or 6 (e.g., len(name) in (4,6)), then perform a single sorted(...) call
using key=lambda d: _normalize_yymm(d.name) and finally apply the start_year
cutoff by comparing _normalize_yymm(d.name) to the normalized start_cutoff so
normalization is only called where appropriate.
| Prepare DocumentTask objects for PDFs located in a specific ArXiv YYMM directory. | ||
| Searches the manager's PDF base directory under the given year_month (e.g., "2301"), finds up to `limit` PDF files, and builds DocumentTask objects for each PDF. The ArXiv ID passed to the task is constructed as "{year_month}.{paper_id}" where `paper_id` is derived from the PDF filename stem. | ||
| Parameters: | ||
| year_month (str): Two-digit year and month string in YYMM format (e.g., "2301"). | ||
| limit (Optional[int]): Maximum number of PDFs to process from the directory. If None, all PDFs are processed. | ||
| Returns: | ||
| List[DocumentTask]: A list of prepared DocumentTask objects. Returns an empty list if the directory does not exist or no tasks could be prepared. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix limit=0 bug and unify arXiv ID derivation.
- limit=0 currently processes all PDFs (truthiness check). Should slice when limit is 0.
- Derive ID consistently with
prepare_recent_documents: use full stem if it already contains a dot, else prefix with YYMM. Avoidsplit('.')[-1]which can break on extra dots in filenames.
- pdf_files = sorted(dir_path.glob("*.pdf"))
- if limit:
- pdf_files = pdf_files[:limit]
+ pdf_files = sorted(dir_path.glob("*.pdf"))
+ if limit is not None:
+ pdf_files = pdf_files[:limit]
@@
- for pdf_path in pdf_files:
- arxiv_id = f"{year_month}.{pdf_path.stem.split('.')[-1]}"
+ for pdf_path in pdf_files:
+ stem = pdf_path.stem
+ arxiv_id = stem if "." in stem else f"{year_month}.{stem}"
task = self._prepare_from_path(pdf_path, arxiv_id)
if task:
tasks.append(task)Also applies to: 94-103
🤖 Prompt for AI Agents
tools/arxiv/arxiv_document_manager.py lines 78-88 and 94-103: fix two
issues—limit handling and arXiv ID derivation. For limit handling, do not rely
on truthiness; when limit is not None slice the list of PDF files with files =
files[:limit] so limit=0 yields zero files; when limit is None process all
files. For ID derivation, stop using split('.')[-1]; obtain the filename stem
and if it already contains a dot use the full stem as the arXiv id, otherwise
prefix it with the year_month and a dot (f"{year_month}.{stem}"), and apply the
same logic in both code regions (lines 78-88 and 94-103) to keep behavior
consistent.
| """ | ||
| Prepare a DocumentTask for a single arXiv identifier. | ||
| Given an arXiv ID of the form "YYMM.NNNNN" (or with a version suffix, e.g. "YYMM.NNNNNv1"), this locates the corresponding PDF under the manager's base directory (pdf_base_dir/YYMM/<arxiv_id>.pdf) and, if found, delegates construction of the DocumentTask to _prepare_from_path. | ||
| Parameters: | ||
| arxiv_id (str): ArXiv identifier expected to contain a single dot separating the YYMM prefix from the paper number. | ||
| Returns: | ||
| Optional[DocumentTask]: A DocumentTask for the PDF when the ID is valid and the PDF exists; otherwise None. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Validate IDs and fallback to versionless file when vN is missing on disk.
Docstrings allow version suffix. If only versionless PDF exists, we should still prepare the task.
- # Parse ArXiv ID to get YYMM directory
- parts = arxiv_id.split('.')
- if len(parts) != 2:
- logger.error(f"Invalid ArXiv ID format: {arxiv_id}")
- return None
-
- year_month = parts[0]
-
- # Construct PDF path
- pdf_path = self.pdf_base_dir / year_month / f"{arxiv_id}.pdf"
-
- if not pdf_path.exists():
- logger.warning(f"PDF not found: {pdf_path}")
- return None
-
- return self._prepare_from_path(pdf_path, arxiv_id)
+ # Validate and parse ArXiv ID
+ if not self.validate_arxiv_id(arxiv_id):
+ logger.error(f"Invalid ArXiv ID format: {arxiv_id}")
+ return None
+
+ year_month, paper_num = arxiv_id.split(".")
+
+ # Prefer exact match (including version, if any)
+ pdf_path = self.pdf_base_dir / year_month / f"{arxiv_id}.pdf"
+ if pdf_path.exists():
+ return self._prepare_from_path(pdf_path, arxiv_id)
+
+ # Fallback: try versionless if a version suffix was provided
+ base_num, _, _ = paper_num.partition("v")
+ if base_num != paper_num:
+ alt_id = f"{year_month}.{base_num}"
+ alt_pdf_path = self.pdf_base_dir / year_month / f"{alt_id}.pdf"
+ if alt_pdf_path.exists():
+ return self._prepare_from_path(alt_pdf_path, alt_id)
+
+ logger.warning(f"PDF not found: {pdf_path}")
+ return NoneAlso applies to: 164-179
| Persist ArXiv processing results and extracted artifacts to the configured ArangoDB collections. | ||
| If no database manager is configured, the call is a no-op (it logs a warning and returns). When a DB manager is present, this inserts: | ||
| - a main document into the `arxiv_papers` collection (keyed by paper_info.sanitized_id) containing ArXiv metadata and processing metrics; | ||
| - one document per text chunk with embeddings into `arxiv_embeddings`; | ||
| - an optional `arxiv_structures` document containing tables, equations, images, and figures when any are present. | ||
| Maintains separate collections for ArXiv papers to preserve | ||
| source-specific metadata and enable specialized queries. | ||
| Errors raised by the DB manager or during document insertion are logged and re-raised. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Make DB writes atomic and align collections with design.
Per retrieved learnings, writes to arxiv_papers, arxiv_chunks, arxiv_embeddings, and arxiv_structures should be done in a single Arango transaction to avoid partial state on failures.
- Wrap all inserts in one transaction (begin/commit/abort) via ArangoDBManager.
- Store chunks in
arxiv_chunks(text/positions), and embeddings inarxiv_embeddingsreferencing the chunk doc by key. - Ensure idempotency (upsert or conflict policy) for re-processing the same paper/version.
Do you want me to sketch a transaction wrapper API for ArangoDBManager and update this method accordingly?
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Documentation
Tests
Chores