-
Notifications
You must be signed in to change notification settings - Fork 0
Experiement word2vec lineage #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…pipeline execution, embedding phase, and testing. These changes include the deletion of the following files: - rebuild_postgresql_fixed.py - run_acid_pipeline.sh - run_embedding_phase_only.py - run_pipeline_from_list.py - run_test_pipeline.py - run_weekend_test.sh This cleanup is part of the transition to a new processing architecture and improves maintainability by removing unused code.
- Introduced `arxiv_example.py` to demonstrate citation extraction from ArXiv papers using ArangoDB. - Created `custom_provider_example.py` showcasing custom DocumentProvider and CitationStorage implementations with SQLite. - Added `filesystem_example.py` for citation extraction from local PDF/text files, highlighting the toolkit's versatility.
…r module execution
|
Warning Rate limit exceeded@r3d91ll has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 33 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (9)
WalkthroughRepository reorganizes ArXiv tooling from scripts/ to utils/, updates configs and docs, adjusts output/data paths, removes many legacy scripts under Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as tools/arxiv/utils/lifecycle.py
participant LCM as ArXivLifecycleManager
participant Pipeline as pipelines/arxiv_pipeline.py
participant DB as ArangoDB
User->>CLI: batch <paper_list>
CLI->>LCM: start processing (process/batch)
LCM->>Pipeline: invoke pipeline (--arango-password)
Pipeline->>DB: read/write papers & embeddings
Pipeline-->>LCM: results/metrics
LCM-->>CLI: summary/status
CLI-->>User: completion & next steps
sequenceDiagram
autonumber
participant Dev
participant Extractor as UniversalBibliographyExtractor
participant Provider as DocumentProvider (Arango/FileSystem/WebAPI)
participant Storage as CitationStorage (Arango/JSON/SQLite)
Dev->>Extractor: extract_paper_bibliography(paper_id)
Extractor->>Provider: get_document_text / get_document_chunks
Provider-->>Extractor: paper text / chunks
Extractor->>Extractor: detect references, parse entries
Extractor-->>Dev: List[BibliographyEntry]
Dev->>Storage: store_bibliography_entries(entries)
Storage-->>Dev: success/failure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
- Keep deletions of deprecated scripts in tools/arxiv/scripts/ - Functionality moved to tools/arxiv/utils/ with improvements - Preserve docstring improvements from main in utils files - Maintain architectural improvements from reorganization
|
@coderabbitai we have resolved the merge conflict can you please review |
Docstrings generation was requested by @r3d91ll. * #26 (comment) The following files were modified: * `tools/arxiv/db/export_ids.py` * `tools/arxiv/tests/validate_pipeline.py` * `tools/arxiv/utils/check_papers.py` * `tools/arxiv/utils/merge_lists.py` * `tools/arxiv/utils/rebuild_database.py` * `tools/rag_utils/academic_citation_toolkit.py` * `tools/rag_utils/examples/arxiv_example.py` * `tools/rag_utils/examples/custom_provider_example.py` * `tools/rag_utils/examples/filesystem_example.py`
|
Note Generated docstrings for this pull request at #27 |
📝 Add docstrings to `experiement_word2vec_lineage`
Docstrings generation was requested by @r3d91ll. * #26 (comment) The following files were modified: * `tools/arxiv/utils/rebuild_database.py` * `tools/rag_utils/academic_citation_toolkit.py` * `tools/rag_utils/examples/custom_provider_example.py` * `tools/rag_utils/examples/filesystem_example.py`
|
Note Generated docstrings for this pull request at #28 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
tools/arxiv/utils/detect_latex.py (1)
278-285: Guard against division by zero in summaryIf
resultsis empty, current percentage prints will raise ZeroDivisionError.- print(f"Papers with LaTeX: {stats['has_latex']:,} ({stats['has_latex']/stats['total_papers']*100:.1f}%)") - print(f"Papers without LaTeX: {stats['no_latex']:,} ({stats['no_latex']/stats['total_papers']*100:.1f}%)") - print(f"Unknown status: {stats['unknown']:,} ({stats['unknown']/stats['total_papers']*100:.1f}%)") + total = max(stats['total_papers'], 1) + print(f"Papers with LaTeX: {stats['has_latex']:,} ({stats['has_latex']/total*100:.1f}%)") + print(f"Papers without LaTeX: {stats['no_latex']:,} ({stats['no_latex']/total*100:.1f}%)") + print(f"Unknown status: {stats['unknown']:,} ({stats['unknown']/total*100:.1f}%)")tools/arxiv/utils/check_papers.py (2)
16-21: Fix sys.path root calculation (currently points one level above repo).parents[3] from utils/ escapes the repo; imports may break.
- project_root = Path(__file__).parent.parent.parent.parent # Goes up to HADES-Lab - sys.path.insert(0, str(project_root)) + project_root = Path(__file__).resolve().parents[2] # HADES-Lab/ + if str(project_root) not in sys.path: + sys.path.insert(0, str(project_root))
116-121: Guard percentage calculations against empty lists.If the input file exists but is empty, division by zero occurs.
- print(f" ✅ Already processed: {already_processed:,} ({already_processed/total_papers*100:.1f}%)") - print(f" ❌ Failed previously: {already_failed:,} ({already_failed/total_papers*100:.1f}%)") - print(f" ⏳ Not yet processed: {not_processed:,} ({not_processed/total_papers*100:.1f}%)") + if total_papers > 0: + print(f" ✅ Already processed: {already_processed:,} ({already_processed/total_papers*100:.1f}%)") + print(f" ❌ Failed previously: {already_failed:,} ({already_failed/total_papers*100:.1f}%)") + print(f" ⏳ Not yet processed: {not_processed:,} ({not_processed/total_papers*100:.1f}%)") + else: + print(f" ✅ Already processed: {already_processed:,} (N/A)") + print(f" ❌ Failed previously: {already_failed:,} (N/A)") + print(f" ⏳ Not yet processed: {not_processed:,} (N/A)")tools/arxiv/tests/run_large_scale_test.sh (2)
101-117: set -e prevents custom failure message for small batch; capture exit code explicitlyWith set -e, the script exits immediately on a non-zero Python exit, skipping your friendly error message. Temporarily disable -e, capture code, then re-enable.
-# Run with limited papers first -python test_large_scale_processing.py \ - --config ../configs/large_scale_test.yaml \ - --papers "$PAPER_LIST" \ - --limit 100 - -# Check if small batch succeeded -if [ $? -ne 0 ]; then +set +e +python test_large_scale_processing.py \ + --config ../configs/large_scale_test.yaml \ + --papers "$PAPER_LIST" \ + --limit 100 +small_exit=$? +set -e + +# Check if small batch succeeded +if [ $small_exit -ne 0 ]; then echo -e "${RED}Small batch test failed. Aborting full test.${NC}" exit 1 fi
6-7: Propagate failures through pipelines (tee) with pipefailWithout pipefail, the full run’s exit code reflects tee, not the Python process. Enable pipefail near set -e.
-set -e # Exit on error +set -e # Exit on error +set -o pipefail # Fail on pipeline errors (e.g., python | tee)
🧹 Nitpick comments (53)
tools/arxiv/db/export_ids.py (3)
263-267: Fix pre-2000 yymm year mapping in stats
2000 + int(yymm[:2])mislabels 1990s (e.g., '9912' → 2099). Use a pivot to disambiguate.Apply:
- y = 2000 + int(str(ym)[:2]) - m = int(str(ym)[2:]) + yy = int(str(ym)[:2]) + m = int(str(ym)[2:]) + # Treat 91–99 as 1991–1999; 00–30 as 2000–2030 (adjust pivot as needed) + y = (1900 + yy) if yy >= 91 else (2000 + yy)
271-279: Avoid fragile string replacements when aliasing WHERERebuild the WHERE using the existing helper with
table_alias='p'instead of.replace(...).Example:
- where_p_sql = where_sql.replace("year", "p.year").replace("month", "p.month").replace( - "yymm", "p.yymm" - ).replace("has_pdf", "p.has_pdf").replace("primary_category", "p.primary_category").replace( - "title", "p.title" - ).replace("abstract", "p.abstract") + where_p_sql, _ = build_where( + start_year, end_year, months, yymm_range, categories, keywords, with_pdf, missing_pdf, table_alias="p" + )
1-2: Add theory-anchored module docstringGuideline: include docstrings linking implementation to the theoretical framework.
Example (add at top of file):
""" ArXiv ID export utilities. Maps to Information Reconstructionism: - WHERE: SQL filtering (years/months/categories) selects topological subsets. - WHAT: optional keyword search (tsvector). - CONVEYANCE/TIME: batching, caps, and symlinked outputs for fast downstream use. """CLAUDE.md (1)
197-199: Prefer explicit paths in docs
cd ../utils/can be context-dependent. Suggestcd tools/arxiv/utils/for clarity.-cd ../utils/ +cd tools/arxiv/utils/tools/arxiv/utils/detect_latex.py (3)
268-270: Create parent dir before writing resultsPrevents failures when logs directory doesn’t exist.
- with open(output_file, 'w') as f: + Path(output_file).parent.mkdir(parents=True, exist_ok=True) + with open(output_file, 'w') as f: json.dump(output_data, f, indent=2)
120-121: Use HTTPS and a descriptive User-Agent for arXiv APIImproves security and aligns with arXiv API etiquette.
- url = f"http://export.arxiv.org/api/query?id_list={arxiv_id}" - response = requests.get(url, timeout=30) + url = f"https://export.arxiv.org/api/query?id_list={arxiv_id}" + headers = {"User-Agent": "HADES-Lab LaTeXDetector (contact: your-email@example.com)"} + response = requests.get(url, headers=headers, timeout=30)
305-327: Avoid hard-coded paths; accept CLI argsParameterize sample and output paths with argparse for portability.
If helpful, I can provide a patch to add
--sample-fileand--out-fileflags.tools/arxiv/utils/__init__.py (1)
1-13: Add trailing newline and tie docstring to the stated theoretical framework.Comply with Ruff W292 and the guideline to connect to Information Reconstructionism/Conveyance.
""" ArXiv utilities package. This package contains utility scripts for ArXiv paper processing, database operations, and paper lifecycle management. All tools have been consolidated here for better discoverability. Main utilities: - lifecycle.py: Primary interface for ArXiv paper processing - rebuild_database.py: Database maintenance and reconstruction - check_db_status.py: Database status verification - check_papers.py: Paper validation in ArangoDB + +This package structures utilities to preserve document context and provenance in line +with the Information Reconstructionism/Conveyance framework used across the pipeline. """ +README.md (3)
204-206: Clarify working directory for the LaTeX extraction command.Readers may run this from the repo root. Consider showing an absolute repo-relative path to avoid cwd confusion.
Suggested doc tweak:
- cd tools/arxiv/utils/ - python extract_latex_archives.py # Script processes .tar files + # from repo root: + python tools/arxiv/utils/extract_latex_archives.py # processes .tar files
212-215: Prefer repo-relative invocation to avoid cwd dependency.Same guidance as above; also mention required env var explicitly before running.
- cd tools/arxiv/utils/ - export PGPASSWORD="your-postgres-password" - python rebuild_database.py + export PGPASSWORD="your-postgres-password" + python tools/arxiv/utils/rebuild_database.py
340-344: Make lifecycle invocations cwd-agnostic.Use repo-relative paths or add a preceding cd to tools/arxiv/utils for consistency with CLAUDE.md.
- python lifecycle.py process 2503.10150 + python tools/arxiv/utils/lifecycle.py process 2503.10150 @@ - python lifecycle.py batch paper_list.txt --hirag-extraction + python tools/arxiv/utils/lifecycle.py batch paper_list.txt --hirag-extractiontools/arxiv/utils/check_papers.py (4)
43-47: Deduplicate root resolution and use the already computed project_root.Avoid recomputing with a different depth; it risks drift.
- script_dir = Path(__file__).parent.resolve() - project_root = script_dir.parents[2] # Go up from utils -> arxiv -> tools -> HADES-Lab - data_dir = project_root / "data" / "arxiv_collections" + data_dir = project_root / "data" / "arxiv_collections"
71-77: Normalize ARANGO_HOST with/without scheme to prevent mismatches.Docs export ARANGO_HOST as a bare host; here default includes http:// and port. Normalize to a full URL consistently.
- 'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'), + # Accept "host[:port]" or "http[s]://host:port" + '_host_env': os.getenv('ARANGO_HOST', '192.168.1.69'), + 'host': _host_env if _host_env.startswith('http') else f"http://{_host_env}:8529",
144-161: Make the follow-up command repo-relative to avoid cwd surprises.Current message assumes you’re in utils/.
- print(f" python lifecycle.py batch {unprocessed_file} --count 100") + print(f" python tools/arxiv/utils/lifecycle.py batch {unprocessed_file} --count 100")
1-7: Optional: tie docstring to the Conveyance framework per repo guidelines.Add one line linking this utility’s purpose (status introspection) to Information Reconstructionism/Conveyance.
tools/arxiv/utils/rebuild_database.py (3)
120-124: Use a precise exception in date parsing.Bare
except:risks swallowing unrelated errors; narrow to ValueError.- except: + except ValueError: continue
659-665: Avoid hardcoded, machine-specific log paths in CLI output.Use the computed log_path so users don’t copy a wrong path.
- print(" Monitor: tail -f /home/todd/olympus/HADES-Lab/tools/arxiv/logs/postgresql_rebuild_complete.log") + print(f" Monitor: tail -f {log_path}")
81-89: Add a connect timeout for DB resilience.Prevents indefinite hangs if Postgres is unreachable.
- return psycopg2.connect(**self.pg_config) + return psycopg2.connect(connect_timeout=10, **self.pg_config)tools/arxiv/CLAUDE.md (1)
10-19: Docs read well; one minor improvement for cwd context.Prepend a short note "from tools/arxiv/" before
cd utils/to make the relative../pipelines/step unambiguous.tools/arxiv/utils/merge_lists.py (5)
23-37: Make output_dir robust (default currently depends on caller’s cwd).Defaulting to "../../../..." is fragile. Derive from file or accept None and compute.
-def merge_id_files(*id_files, output_dir: str = "../../../data/arxiv_collections") -> Path: +def merge_id_files(*id_files, output_dir: str | None = None) -> Path: @@ - output_dir = Path(output_dir) + if output_dir is None: + output_dir = Path(__file__).resolve().parents[2] / "data" / "arxiv_collections" + output_dir = Path(output_dir)
55-70: Ruff cleanups: remove redundant mode and use OSError.Minor polish per UP015 and UP024.
- try: - logger.info(f"Loading IDs from {id_path}") - with open(id_path, 'r', encoding='utf-8') as f: + try: + logger.info(f"Loading IDs from {id_path}") + with open(id_path, encoding='utf-8') as f: @@ - except IOError as e: + except OSError as e: logger.error(f"Error reading {id_path}: {e}") continue
90-100: Apply the same robust defaulting for JSON merges.Mirror the id-file behavior.
-def merge_json_collections(*json_files, output_dir: str = "../../../data/arxiv_collections") -> Path: +def merge_json_collections(*json_files, output_dir: str | None = None) -> Path: @@ - output_dir = Path(output_dir) + if output_dir is None: + output_dir = Path(__file__).resolve().parents[2] / "data" / "arxiv_collections" + output_dir = Path(output_dir)
111-114: Specify encoding when reading JSON collections.Avoid locale-dependent decoding issues.
- with open(json_path, 'r') as f: + with open(json_path, encoding='utf-8') as f: data = json.load(f)
192-194: Reflect new default behavior in CLI help.If adopting the file-relative default, update the help text accordingly.
- parser.add_argument('--output-dir', default='../../../data/arxiv_collections', - help='Output directory') + parser.add_argument('--output-dir', + help='Output directory (default: repo_root/data/arxiv_collections)')tools/rag_utils/academic_citation_toolkit.md (1)
5-5: Use consistent “arXiv” capitalization throughout the doc.Multiple instances use “ArXiv”; prefer “arXiv” (brand style).
Also applies to: 66-69, 86-89, 143-151, 262-264
tools/rag_utils/academic_citation_toolkit.py (7)
257-266: Also set UTF-8 when writing citations.- with open(f"{self.output_dir}/citations.json", 'w') as f: + with open(f"{self.output_dir}/citations.json", 'w', encoding='utf-8') as f: json.dump(data, f, indent=2, ensure_ascii=False)
576-576: Clean up lint issues (unused var, f-strings, EOF newline).- arango_password = os.getenv('ARANGO_PASSWORD') + os.getenv('ARANGO_PASSWORD') # Ensures provider raises if missing - if storage.store_bibliography_entries(entries): - print(f" 💾 Stored bibliography entries") + if storage.store_bibliography_entries(entries): + print(" 💾 Stored bibliography entries") - print(f" ❌ No bibliography found") + print(" ❌ No bibliography found")Ensure file ends with a trailing newline.
Also applies to: 614-614, 617-617, 619-620
141-146: Minor: mode='r' is default; can be dropped for style.- with open(file_path, 'r', encoding='utf-8') as f: + with open(file_path, encoding='utf-8') as f: return f.read()
287-304: Precompile hot regexes to reduce repeated compile overhead.Define patterns at module scope and reuse them inside methods. I can provide a patch if desired.
Also applies to: 306-315, 317-326
91-99: Consider preserving paragraph boundaries when joining chunks.Using '\n\n' instead of a space keeps structure for downstream parsing.
- return ' '.join(chunks) if chunks else None + return '\n\n'.join(chunks) if chunks else NoneAlso applies to: 151-159
544-552: In-text citation extraction is unimplemented.Add a simple numeric [n] extractor as a first pass, then map to entries by
entry_number.I can add a minimal implementation with tests on request.
33-36: Modernize typing: prefer built-in generics.Keeps Ruff happy (UP035) and aligns with 3.10+.
-from typing import List, Dict, Optional, Tuple, Union +from typing import Optional +# Use built-in generics: list, dict, tuple instead of typing.List/Dict/Tupletools/arxiv/tests/validate_pipeline.py (2)
11-11: Remove unused import.-import subprocess
13-13: Optional: switch to built-in generics and droptypingwhere possible.-from typing import Tuple, List +from typing import Tuple, List # or use built-ins in annotations: tuple[bool, list[str]]And update annotations to
tuple[bool, list[str]]when you touch this file next.Also applies to: 22-25, 68-83, 174-176
tools/arxiv/utils/lifecycle.py (1)
147-164: Map PROCESSING status in emoji and description dictionariesIf PaperStatus.PROCESSING occurs, the CLI shows “⚪ Unknown status.” Add explicit mapping.
status_emoji = { PaperStatus.ERROR: "❌", PaperStatus.NOT_FOUND: "❓", PaperStatus.METADATA_ONLY: "📋", PaperStatus.DOWNLOADED: "📥", + PaperStatus.PROCESSING: "⏳", PaperStatus.PROCESSED: "⚙️", PaperStatus.HIRAG_INTEGRATED: "🎯" } @@ status_descriptions = { PaperStatus.NOT_FOUND: "Paper not found in system", PaperStatus.METADATA_ONLY: "Metadata available, files not downloaded", PaperStatus.DOWNLOADED: "Files downloaded, not processed", + PaperStatus.PROCESSING: "Processing in progress", PaperStatus.PROCESSED: "Fully processed through ACID pipeline", PaperStatus.HIRAG_INTEGRATED: "Integrated into HiRAG system", PaperStatus.ERROR: "Error occurred during processing" }tools/arxiv/tests/run_large_scale_test.sh (1)
46-49: Adjust Step 1 messaging to reflect discovery of prebuilt lists, not collectionCurrent text says “Collecting… from ArXiv API” but the code only discovers existing lists. Tweak wording to avoid confusion.
-echo -e "\n${GREEN}Step 1: Collecting papers from ArXiv API${NC}" -echo "This will search for papers on AI, RAG, LLMs, and Actor Network Theory" +echo -e "\n${GREEN}Step 1: Discovering existing paper lists${NC}" +echo "Looking for prebuilt arxiv_ids_*.txt lists (AI, RAG, LLMs, ANT)"tools/rag_utils/README.md (1)
1-252: Polish wording and examples; add run-as-module noteMinor grammar/list formatting nits flagged, and examples would benefit from a note that examples should be executed as modules (python -m …) due to package-relative imports.
- Normalize “arXiv” casing and bullet spacing.
- Add: “Run examples as modules, e.g., python -m tools.rag_utils.examples.arxiv_example”.
- Consider running markdownlint and LanguageTool on this file to batch-fix micro issues.
tools/rag_utils/examples/arxiv_example.py (6)
32-36: Parameterize ArangoDB host via env var (default localhost)Avoid hardcoding a private IP; improves portability.
-client = ArangoClient(hosts='http://192.168.1.69:8529') +arango_host = os.getenv('ARANGO_HOST', 'http://localhost:8529') +client = ArangoClient(hosts=arango_host)
52-57: Remove unused loop variable per Ruff B007Title isn’t used in the loop body.
-for paper_id, title in core_papers.items(): +for paper_id in core_papers:
89-90: Drop extraneous f-string- print(f" ❌ No bibliography entries found") + print(" ❌ No bibliography entries found")
123-127: Drop extraneous f-string- print(f" Collection: bibliography_entries") + print(" Collection: bibliography_entries")
14-16: Add helpful import guard when run as a script (mirror filesystem_example)Running this file directly will fail due to package-relative imports. Add the same try/except guidance used in filesystem_example.
# Replace the simple import with: try: from ..academic_citation_toolkit import create_arxiv_citation_toolkit except ImportError as e: if __name__ == "__main__" and (__package__ is None or __package__ == ""): raise SystemExit( "Run as a module:\n python -m tools.rag_utils.examples.arxiv_example" ) from e raise
1-8: Optional: tie docstring to Information Reconstructionism/ConveyanceA one-liner noting how citation extraction supports information conveyance across networks would align with repo guidelines.
tools/rag_utils/examples/filesystem_example.py (2)
134-136: Unnecessary mode argument in open()Reading is default; remove 'r'.
- with open(f"{output_dir}/bibliography.json", 'r') as f: + with open(f"{output_dir}/bibliography.json") as f:
137-167: Drop extraneous f-strings where no interpolation occursCleans up Ruff F541 warnings.
- print(f" 📊 Storage summary:") + print(" 📊 Storage summary:") @@ - print(f" Sample stored entry:") + print(" Sample stored entry:") @@ - print(f" ❌ No bibliography entries found") + print(" ❌ No bibliography entries found") @@ - print(f"\n📂 Output files created:") + print("\n📂 Output files created:")tools/rag_utils/__init__.py (2)
1-11: Add brief theoretical-framework note to the package docstring.Per guidelines, connect implementation to Information Reconstructionism/Conveyance.
Apply:
@@ """ RAG Utils - Universal Academic Tools ==================================== Source-agnostic utilities for building Retrieval-Augmented Generation (RAG) systems from academic corpora. These tools work with any academic paper source: ArXiv, SSRN, PubMed, Harvard Law Library, or any other collection. +Theoretical note (Information Reconstructionism/Conveyance): +these utilities reconstruct citation/bibliography structures from raw texts +and convey them as structured knowledge into downstream RAG pipelines. + Key Modules: - academic_citation_toolkit: Universal citation and bibliography extraction """
64-64: Add trailing newline (Ruff W292).-] +] +tools/rag_utils/examples/custom_provider_example.py (6)
11-15: Modernize type hints and drop unused import.Use built-in generics (list[str]) and remove unused sys import. Also satisfies Ruff UP035.
-import sys -import json +import json import sqlite3 -from typing import List, Optional +from typing import Optional @@ - def get_document_chunks(self, document_id: str) -> List[str]: + def get_document_chunks(self, document_id: str) -> list[str]: @@ - def store_bibliography_entries(self, entries: List[BibliographyEntry]) -> bool: + def store_bibliography_entries(self, entries: list[BibliographyEntry]) -> bool: @@ - def store_citations(self, citations: List[InTextCitation]) -> bool: + def store_citations(self, citations: list[InTextCitation]) -> bool: @@ - def get_document_chunks(self, document_id: str) -> List[str]: + def get_document_chunks(self, document_id: str) -> list[str]:Also applies to: 63-71, 140-146, 173-179, 331-337
357-361: Remove f-strings without placeholders (Ruff F541).- print(f" DocumentProvider: MockAPIDocumentProvider") + print(" DocumentProvider: MockAPIDocumentProvider") @@ - print(f" Extractor: UniversalBibliographyExtractor") + print(" Extractor: UniversalBibliographyExtractor") @@ - print(f" ❌ No bibliography entries found") + print(" ❌ No bibliography entries found") @@ - print(f"📊 Database Statistics:") + print("📊 Database Statistics:") @@ - print(f" Confidence distribution:") + print(" Confidence distribution:") @@ - print(f"\n📂 Files Created:") + print("\n📂 Files Created:") @@ - print(f" Tables: bibliography_entries, in_text_citations") + print(" Tables: bibliography_entries, in_text_citations")Also applies to: 390-397, 401-416, 430-433
44-61: Harden Web API fetch: raise for HTTP errors, normalize return to Optional[str].Improves robustness and keeps return typing consistent.
def get_document_text(self, document_id: str) -> Optional[str]: """Fetch full document text from web API.""" try: import requests - - url = f"{self.api_base_url}/documents/{document_id}/fulltext" - response = requests.get(url, headers=self.headers, timeout=30) - - if response.status_code == 200: - data = response.json() - return data.get('full_text', data.get('content', '')) - else: - print(f"API Error {response.status_code} for document {document_id}") - return None - - except Exception as e: - print(f"Error fetching document {document_id}: {e}") + url = f"{self.api_base_url}/documents/{document_id}/fulltext" + response = requests.get(url, headers=self.headers, timeout=30) + response.raise_for_status() + data = response.json() + text = data.get('full_text') or data.get('content') + return text or None + except requests.RequestException as e: + print(f"HTTP error for document {document_id}: {e}") + return None + except Exception as e: + print(f"Error fetching/decoding document {document_id}: {e}") return None
150-161: Prefer ON CONFLICT DO UPDATE over OR REPLACE to preserve row identity/timestamps.Avoids deleting/reinserting rows (which resets created_at and autoincrement ids).
- INSERT OR REPLACE INTO bibliography_entries + INSERT INTO bibliography_entries (source_paper_id, entry_number, raw_text, title, authors, venue, year, arxiv_id, doi, pmid, ssrn_id, url, confidence) - VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + ON CONFLICT(source_paper_id, entry_number) DO UPDATE SET + raw_text=excluded.raw_text, + title=excluded.title, + authors=excluded.authors, + venue=excluded.venue, + year=excluded.year, + arxiv_id=excluded.arxiv_id, + doi=excluded.doi, + pmid=excluded.pmid, + ssrn_id=excluded.ssrn_id, + url=excluded.url, + confidence=excluded.confidence
3-9: Add brief theoretical-framework context to module docstring.Tie the example to Information Reconstructionism/Conveyance as required.
Demonstrates creating custom DocumentProvider and CitationStorage implementations for the Academic Citation Toolkit. Shows how to extend the toolkit for any academic corpus or storage system. + +Conceptual note (Information Reconstructionism/Conveyance): +this example reconstructs citation structures from raw text and conveys them +into a structured store to support downstream RAG workflows.
440-440: Add trailing newline (Ruff W292).if __name__ == "__main__": main() +
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (36)
.gitignore(1 hunks)CLAUDE.md(2 hunks)README.md(3 hunks)tools/arxiv/CLAUDE.md(3 hunks)tools/arxiv/configs/arxiv_search.yaml(1 hunks)tools/arxiv/configs/arxiv_search_minimal.yaml(1 hunks)tools/arxiv/configs/arxiv_search_nokeywords.yaml(1 hunks)tools/arxiv/configs/arxiv_search_practical.yaml(1 hunks)tools/arxiv/database/__init__.py(0 hunks)tools/arxiv/db/export_ids.py(1 hunks)tools/arxiv/scripts/collect_ai_papers.py(0 hunks)tools/arxiv/scripts/collect_ai_papers_extended.py(0 hunks)tools/arxiv/scripts/pdf_scanner.py(0 hunks)tools/arxiv/scripts/pdf_scanner_fixed.py(0 hunks)tools/arxiv/scripts/rebuild_postgresql.py(0 hunks)tools/arxiv/scripts/rebuild_postgresql_fixed.py(0 hunks)tools/arxiv/scripts/run_embedding_phase_only.py(0 hunks)tools/arxiv/scripts/run_pipeline_from_list.py(0 hunks)tools/arxiv/scripts/run_test_pipeline.py(0 hunks)tools/arxiv/scripts/run_weekend_test.sh(0 hunks)tools/arxiv/tests/run_large_scale_test.sh(1 hunks)tools/arxiv/tests/validate_pipeline.py(1 hunks)tools/arxiv/utils/__init__.py(1 hunks)tools/arxiv/utils/check_papers.py(4 hunks)tools/arxiv/utils/detect_latex.py(1 hunks)tools/arxiv/utils/lifecycle.py(2 hunks)tools/arxiv/utils/merge_lists.py(4 hunks)tools/arxiv/utils/rebuild_database.py(4 hunks)tools/arxiv/utils/run_acid_pipeline.sh(2 hunks)tools/rag_utils/README.md(1 hunks)tools/rag_utils/__init__.py(1 hunks)tools/rag_utils/academic_citation_toolkit.md(1 hunks)tools/rag_utils/academic_citation_toolkit.py(1 hunks)tools/rag_utils/examples/arxiv_example.py(1 hunks)tools/rag_utils/examples/custom_provider_example.py(1 hunks)tools/rag_utils/examples/filesystem_example.py(1 hunks)
💤 Files with no reviewable changes (11)
- tools/arxiv/scripts/run_embedding_phase_only.py
- tools/arxiv/database/init.py
- tools/arxiv/scripts/run_weekend_test.sh
- tools/arxiv/scripts/run_test_pipeline.py
- tools/arxiv/scripts/pdf_scanner.py
- tools/arxiv/scripts/rebuild_postgresql_fixed.py
- tools/arxiv/scripts/run_pipeline_from_list.py
- tools/arxiv/scripts/collect_ai_papers_extended.py
- tools/arxiv/scripts/pdf_scanner_fixed.py
- tools/arxiv/scripts/collect_ai_papers.py
- tools/arxiv/scripts/rebuild_postgresql.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Include docstrings in code that connect implementation to the theoretical framework (Information Reconstructionism/Conveyance)
Within the same module, use relative imports (e.g., from .utils import helper_function)
Files:
tools/arxiv/utils/__init__.pytools/arxiv/utils/detect_latex.pytools/rag_utils/examples/arxiv_example.pytools/arxiv/db/export_ids.pytools/rag_utils/__init__.pytools/arxiv/utils/lifecycle.pytools/rag_utils/examples/custom_provider_example.pytools/arxiv/tests/validate_pipeline.pytools/arxiv/utils/check_papers.pytools/arxiv/utils/merge_lists.pytools/arxiv/utils/rebuild_database.pytools/rag_utils/examples/filesystem_example.pytools/rag_utils/academic_citation_toolkit.py
tools/arxiv/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tools/arxiv/**/*.py: Format code with Black for ArXiv tooling
Run Ruff lint checks on ArXiv tooling
Late chunking: process full documents before chunking to preserve context
Ensure database operations are atomic (success or rollback)
Maintain phase separation: complete extraction before embedding
Process files directly from the filesystem without database queries where specified
Implement error recovery with support for resuming from checkpoints
Preserve document structure and context throughout the processing pipeline
Files:
tools/arxiv/utils/__init__.pytools/arxiv/utils/detect_latex.pytools/arxiv/db/export_ids.pytools/arxiv/utils/lifecycle.pytools/arxiv/tests/validate_pipeline.pytools/arxiv/utils/check_papers.pytools/arxiv/utils/merge_lists.pytools/arxiv/utils/rebuild_database.py
{tools,experiments}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Import from core framework when in tools/ or experiments/ (e.g., from core.framework.embedders import JinaV4Embedder)
Files:
tools/arxiv/utils/__init__.pytools/arxiv/utils/detect_latex.pytools/rag_utils/examples/arxiv_example.pytools/arxiv/db/export_ids.pytools/rag_utils/__init__.pytools/arxiv/utils/lifecycle.pytools/rag_utils/examples/custom_provider_example.pytools/arxiv/tests/validate_pipeline.pytools/arxiv/utils/check_papers.pytools/arxiv/utils/merge_lists.pytools/arxiv/utils/rebuild_database.pytools/rag_utils/examples/filesystem_example.pytools/rag_utils/academic_citation_toolkit.py
{experiments,tools}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
When in experiments/ or other tools/, import modules from tools/ as needed (e.g., from tools.arxiv.pipelines.arxiv_pipeline import AcidPipeline)
Files:
tools/arxiv/utils/__init__.pytools/arxiv/utils/detect_latex.pytools/rag_utils/examples/arxiv_example.pytools/arxiv/db/export_ids.pytools/rag_utils/__init__.pytools/arxiv/utils/lifecycle.pytools/rag_utils/examples/custom_provider_example.pytools/arxiv/tests/validate_pipeline.pytools/arxiv/utils/check_papers.pytools/arxiv/utils/merge_lists.pytools/arxiv/utils/rebuild_database.pytools/rag_utils/examples/filesystem_example.pytools/rag_utils/academic_citation_toolkit.py
{core,tools}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Infrastructure (core/, tools/) must not depend on experiments/
Files:
tools/arxiv/utils/__init__.pytools/arxiv/utils/detect_latex.pytools/rag_utils/examples/arxiv_example.pytools/arxiv/db/export_ids.pytools/rag_utils/__init__.pytools/arxiv/utils/lifecycle.pytools/rag_utils/examples/custom_provider_example.pytools/arxiv/tests/validate_pipeline.pytools/arxiv/utils/check_papers.pytools/arxiv/utils/merge_lists.pytools/arxiv/utils/rebuild_database.pytools/rag_utils/examples/filesystem_example.pytools/rag_utils/academic_citation_toolkit.py
tools/arxiv/configs/**/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
Place ArXiv processing configuration files under tools/arxiv/configs/
Files:
tools/arxiv/configs/arxiv_search_practical.yamltools/arxiv/configs/arxiv_search_nokeywords.yamltools/arxiv/configs/arxiv_search_minimal.yamltools/arxiv/configs/arxiv_search.yaml
🧠 Learnings (15)
📓 Common learnings
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T00:38:36.857Z
Learning: Applies to tools/arxiv/configs/**/*.yaml : Place ArXiv processing configuration files under tools/arxiv/configs/
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T00:38:36.857Z
Learning: Applies to tools/arxiv/**/*.py : Run Ruff lint checks on ArXiv tooling
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/{scripts/lifecycle_cli.py,pipelines/arxiv_pipeline.py} : Use local file storage paths: /bulk-store/arxiv-data/pdf/YYMM for PDFs and latex/YYMM for LaTeX sources
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/scripts/lifecycle_cli.py : Implement ArXiv Lifecycle Manager CLI with subcommands process, status, batch, metadata, executing the unified workflow (PostgreSQL check, download missing content, sync PostgreSQL/ArangoDB, run ACID, generate Jina v4 embeddings, integrate HiRAG)
Applied to files:
tools/arxiv/utils/__init__.pytools/arxiv/utils/lifecycle.pyREADME.mdCLAUDE.mdtools/arxiv/CLAUDE.md
📚 Learning: 2025-09-04T00:38:36.857Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T00:38:36.857Z
Learning: Applies to tools/arxiv/configs/**/*.yaml : Place ArXiv processing configuration files under tools/arxiv/configs/
Applied to files:
tools/arxiv/configs/arxiv_search_practical.yamltools/arxiv/configs/arxiv_search_nokeywords.yamltools/arxiv/configs/arxiv_search_minimal.yaml.gitignoretools/arxiv/configs/arxiv_search.yaml
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/{scripts/lifecycle_cli.py,pipelines/arxiv_pipeline.py} : Use local file storage paths: /bulk-store/arxiv-data/pdf/YYMM for PDFs and latex/YYMM for LaTeX sources
Applied to files:
tools/arxiv/configs/arxiv_search_minimal.yamltools/arxiv/utils/lifecycle.py.gitignoretools/arxiv/utils/check_papers.pyREADME.mdtools/arxiv/configs/arxiv_search.yamltools/arxiv/CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/scripts/lifecycle_cli.py : Lifecycle Manager should query PostgreSQL for metadata/files and perform idempotent operations with audit trail and error recovery
Applied to files:
tools/arxiv/utils/lifecycle.pyREADME.mdCLAUDE.mdtools/arxiv/CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/scripts/lifecycle_cli.py : Expose status levels NOT_FOUND, METADATA_ONLY, DOWNLOADED, PROCESSED, HIRAG_INTEGRATED from lifecycle_cli.py status command (including --json output)
Applied to files:
tools/arxiv/utils/lifecycle.pytools/arxiv/CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/scripts/lifecycle_cli.py : Lifecycle Manager should perform PostgreSQL checks and downloads (not the pipeline) before invoking ACID processing
Applied to files:
README.mdCLAUDE.mdtools/arxiv/utils/run_acid_pipeline.shtools/arxiv/CLAUDE.md
📚 Learning: 2025-09-04T00:38:36.857Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T00:38:36.857Z
Learning: Applies to tools/arxiv/**/*.py : Process files directly from the filesystem without database queries where specified
Applied to files:
README.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : ACID pipeline must be phase-separated: complete Extraction before Embedding
Applied to files:
CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Perform ArangoDB operations atomically (all-or-nothing transactions)
Applied to files:
CLAUDE.mdtools/arxiv/utils/run_acid_pipeline.sh
📚 Learning: 2025-09-04T00:38:36.857Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T00:38:36.857Z
Learning: Applies to tools/arxiv/**/*.py : Ensure database operations are atomic (success or rollback)
Applied to files:
CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Implement checkpointing with an atomic checkpoint file named acid_phased_checkpoint.json to resume on failure
Applied to files:
CLAUDE.mdtools/arxiv/utils/run_acid_pipeline.sh
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Use /dev/shm/acid_staging as the inter-phase staging directory (RamFS) and clean GPU memory between phases
Applied to files:
CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Process PDFs directly from filesystem without database queries (no DB dependency in pipeline)
Applied to files:
CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/{scripts/lifecycle_cli.py,pipelines/arxiv_pipeline.py} : Respect environment variables: ARANGO_PASSWORD, ARANGO_HOST, USE_GPU, CUDA_VISIBLE_DEVICES
Applied to files:
tools/arxiv/utils/run_acid_pipeline.sh
🧬 Code graph analysis (7)
tools/rag_utils/examples/arxiv_example.py (1)
tools/rag_utils/academic_citation_toolkit.py (6)
create_arxiv_citation_toolkit(555-560)main(570-617)extract_paper_bibliography(512-531)store_bibliography_entries(173-175)store_bibliography_entries(193-230)store_bibliography_entries(244-255)
tools/rag_utils/__init__.py (1)
tools/rag_utils/academic_citation_toolkit.py (12)
BibliographyEntry(41-60)InTextCitation(63-76)DocumentProvider(78-98)CitationStorage(160-180)ArangoDocumentProvider(100-131)FileSystemDocumentProvider(133-158)ArangoCitationStorage(182-235)JSONCitationStorage(237-266)UniversalBibliographyExtractor(268-531)UniversalCitationExtractor(533-552)create_arxiv_citation_toolkit(555-560)create_filesystem_citation_toolkit(562-567)
tools/arxiv/utils/lifecycle.py (2)
tools/arxiv/utils/arxiv_lifecycle_manager.py (3)
ArXivLifecycleManager(80-570)PaperStatus(44-52)LifecycleResult(56-77)tools/arxiv/utils/arxiv_api_client.py (1)
ArXivAPIClient(74-462)
tools/rag_utils/examples/custom_provider_example.py (1)
tools/rag_utils/academic_citation_toolkit.py (18)
DocumentProvider(78-98)CitationStorage(160-180)UniversalBibliographyExtractor(268-531)BibliographyEntry(41-60)InTextCitation(63-76)get_document_text(91-93)get_document_text(113-116)get_document_text(141-149)get_document_chunks(96-98)get_document_chunks(118-131)get_document_chunks(151-158)store_bibliography_entries(173-175)store_bibliography_entries(193-230)store_bibliography_entries(244-255)store_citations(178-180)store_citations(232-235)store_citations(257-266)extract_paper_bibliography(512-531)
tools/arxiv/utils/rebuild_database.py (1)
tools/arxiv/utils/lifecycle.py (1)
setup_logging(39-52)
tools/rag_utils/examples/filesystem_example.py (1)
tools/rag_utils/academic_citation_toolkit.py (6)
create_filesystem_citation_toolkit(562-567)main(570-617)extract_paper_bibliography(512-531)store_bibliography_entries(173-175)store_bibliography_entries(193-230)store_bibliography_entries(244-255)
tools/rag_utils/academic_citation_toolkit.py (3)
tools/rag_utils/examples/custom_provider_example.py (7)
get_document_text(44-61)get_document_text(327-329)get_document_chunks(63-71)get_document_chunks(331-336)store_bibliography_entries(140-171)store_citations(173-201)main(338-437)tools/rag_utils/examples/filesystem_example.py (1)
main(63-172)tools/rag_utils/examples/arxiv_example.py (1)
main(17-136)
🪛 Ruff (0.12.2)
tools/arxiv/utils/__init__.py
13-13: No newline at end of file
Add trailing newline
(W292)
tools/rag_utils/examples/arxiv_example.py
52-52: Loop control variable title not used within loop body
(B007)
89-89: f-string without any placeholders
Remove extraneous f prefix
(F541)
125-125: f-string without any placeholders
Remove extraneous f prefix
(F541)
139-139: No newline at end of file
Add trailing newline
(W292)
tools/rag_utils/__init__.py
64-64: No newline at end of file
Add trailing newline
(W292)
tools/rag_utils/examples/custom_provider_example.py
15-15: typing.List is deprecated, use list instead
(UP035)
358-358: f-string without any placeholders
Remove extraneous f prefix
(F541)
360-360: f-string without any placeholders
Remove extraneous f prefix
(F541)
392-392: f-string without any placeholders
Remove extraneous f prefix
(F541)
404-404: f-string without any placeholders
Remove extraneous f prefix
(F541)
411-411: f-string without any placeholders
Remove extraneous f prefix
(F541)
430-430: f-string without any placeholders
Remove extraneous f prefix
(F541)
432-432: f-string without any placeholders
Remove extraneous f prefix
(F541)
440-440: No newline at end of file
Add trailing newline
(W292)
tools/arxiv/utils/merge_lists.py
61-61: Unnecessary mode argument
Remove mode argument
(UP015)
67-67: Replace aliased errors with OSError
Replace IOError with builtin OSError
(UP024)
tools/rag_utils/examples/filesystem_example.py
134-134: Unnecessary mode argument
Remove mode argument
(UP015)
137-137: f-string without any placeholders
Remove extraneous f prefix
(F541)
144-144: f-string without any placeholders
Remove extraneous f prefix
(F541)
153-153: f-string without any placeholders
Remove extraneous f prefix
(F541)
164-164: f-string without any placeholders
Remove extraneous f prefix
(F541)
175-175: No newline at end of file
Add trailing newline
(W292)
tools/rag_utils/academic_citation_toolkit.py
33-33: typing.List is deprecated, use list instead
(UP035)
33-33: typing.Dict is deprecated, use dict instead
(UP035)
33-33: typing.Tuple is deprecated, use tuple instead
(UP035)
145-145: Unnecessary mode argument
Remove mode argument
(UP015)
576-576: Local variable arango_password is assigned to but never used
Remove assignment to unused variable arango_password
(F841)
614-614: f-string without any placeholders
Remove extraneous f prefix
(F541)
617-617: f-string without any placeholders
Remove extraneous f prefix
(F541)
620-620: No newline at end of file
Add trailing newline
(W292)
🪛 LanguageTool
tools/rag_utils/README.md
[grammar] ~9-~9: There might be a mistake here.
Context: ...from: - Computer Science papers (ArXiv) - Economics papers (SSRN) - Medical papers...
(QB_NEW_EN)
[grammar] ~10-~10: There might be a mistake here.
Context: ...pers (ArXiv) - Economics papers (SSRN) - Medical papers (PubMed) - Legal papers (...
(QB_NEW_EN)
[grammar] ~11-~11: There might be a mistake here.
Context: ... papers (SSRN) - Medical papers (PubMed) - Legal papers (Harvard Law Library) - Any...
(QB_NEW_EN)
[grammar] ~12-~12: There might be a mistake here.
Context: ...ed) - Legal papers (Harvard Law Library) - Any academic corpus ## Available Tools ...
(QB_NEW_EN)
[grammar] ~17-~17: There might be a mistake here.
Context: ...Tools ### 🕸️ Academic Citation Toolkit File: academic_citation_toolkit.py ...
(QB_NEW_EN)
[grammar] ~28-~28: There might be a mistake here.
Context: ...d citations, author-year, hybrid formats - Pluggable architecture: Easy to extend...
(QB_NEW_EN)
[grammar] ~50-~50: There might be a mistake here.
Context: ...ks for: - ArXiv computer science papers - SSRN economics papers - PubMed medical...
(QB_NEW_EN)
[grammar] ~51-~51: There might be a mistake here.
Context: ...r science papers - SSRN economics papers - PubMed medical papers - Harvard Law Libr...
(QB_NEW_EN)
[grammar] ~52-~52: There might be a mistake here.
Context: ...onomics papers - PubMed medical papers - Harvard Law Library legal papers ### 2....
(QB_NEW_EN)
[grammar] ~59-~59: There might be a mistake here.
Context: ...*: ArangoDB, filesystem, APIs, databases - Storage Backend: ArangoDB, PostgreSQL,...
(QB_NEW_EN)
[grammar] ~60-~60: There might be a mistake here.
Context: ...ckend**: ArangoDB, PostgreSQL, JSON, CSV - Format Parser: Different citation form...
(QB_NEW_EN)
[grammar] ~67-~67: There might be a mistake here.
Context: ...liography sections** (formal references) - In-text citations (contextual pointers...
(QB_NEW_EN)
[grammar] ~68-~68: There might be a mistake here.
Context: ...n-text citations** (contextual pointers) - Citation networks (paper-to-paper rela...
(QB_NEW_EN)
[grammar] ~69-~69: There might be a mistake here.
Context: ...etworks** (paper-to-paper relationships) - Author networks (collaboration pattern...
(QB_NEW_EN)
[grammar] ~77-~77: There might be a mistake here.
Context: ...) - Geographic region (US vs EU vs Asia) - Time period (1990s vs 2020s) - Publicati...
(QB_NEW_EN)
[grammar] ~78-~78: There might be a mistake here.
Context: ...s Asia) - Time period (1990s vs 2020s) - Publication venue (journal vs conference...
(QB_NEW_EN)
[grammar] ~170-~170: There might be a mistake here.
Context: ...RMES for: - Citation network enrichment - Bibliography metadata extraction - Acade...
(QB_NEW_EN)
[grammar] ~171-~171: There might be a mistake here.
Context: ...hment - Bibliography metadata extraction - Academic relationship mapping ### HADES...
(QB_NEW_EN)
[grammar] ~178-~178: There might be a mistake here.
Context: ...nal analysis (WHERE × WHAT × CONVEYANCE) - Observer-dependent citation networks - C...
(QB_NEW_EN)
[grammar] ~179-~179: There might be a mistake here.
Context: ...) - Observer-dependent citation networks - Context amplification measurement ### H...
(QB_NEW_EN)
[grammar] ~186-~186: There might be a mistake here.
Context: ...terns: - Configuration-driven operation - Reusable across modules - Tool gifting b...
(QB_NEW_EN)
[grammar] ~187-~187: There might be a mistake here.
Context: ...iven operation - Reusable across modules - Tool gifting between modules ## Perform...
(QB_NEW_EN)
[grammar] ~194-~194: There might be a mistake here.
Context: ...tweight**: Processes papers individually - Streaming: No need to load entire corp...
(QB_NEW_EN)
[grammar] ~195-~195: There might be a mistake here.
Context: ... No need to load entire corpus in memory - Configurable: Adjustable chunk sizes a...
(QB_NEW_EN)
[grammar] ~200-~200: There might be a mistake here.
Context: ...phy extraction**: ~1-2 seconds per paper - Citation parsing: ~0.5-1 seconds per p...
(QB_NEW_EN)
[grammar] ~201-~201: There might be a mistake here.
Context: ...tion parsing**: ~0.5-1 seconds per paper - Network construction: Scales with corp...
(QB_NEW_EN)
[grammar] ~202-~202: There might be a mistake here.
Context: ... construction**: Scales with corpus size - Parallelizable: Easy to distribute acr...
(QB_NEW_EN)
[grammar] ~207-~207: There might be a mistake here.
Context: ...itations**: 90%+ for numbered references - Medium confidence for author-year: 70-...
(QB_NEW_EN)
[grammar] ~208-~208: There might be a mistake here.
Context: ...: 70-85% depending on format consistency - Robust error handling: Graceful degrad...
(QB_NEW_EN)
[grammar] ~215-~215: There might be a mistake here.
Context: ...xtractor**: Build collaboration networks - Topic Evolution Tracker: Track concept...
(QB_NEW_EN)
[grammar] ~216-~216: There might be a mistake here.
Context: ...r**: Track concept development over time - Cross-Corpus Linker: Connect papers ac...
(QB_NEW_EN)
[grammar] ~217-~217: There might be a mistake here.
Context: ... Connect papers across different sources - Citation Context Analyzer: Understand ...
(QB_NEW_EN)
[grammar] ~222-~222: There might be a mistake here.
Context: ...cholar API**: Academic graph integration - OpenCitations: Citation database integ...
(QB_NEW_EN)
[grammar] ~223-~223: There might be a mistake here.
Context: ...tations**: Citation database integration - Crossref API: DOI resolution and metad...
(QB_NEW_EN)
[grammar] ~224-~224: There might be a mistake here.
Context: ...ssref API**: DOI resolution and metadata - ORCID API: Author disambiguation ## C...
(QB_NEW_EN)
tools/rag_utils/academic_citation_toolkit.md
[grammar] ~66-~66: There might be a mistake here.
Context: ..." pass ``` Implementations: - ArangoDocumentProvider: For ArangoDB (our ArXiv setup) - `File...
(QB_NEW_EN)
[grammar] ~67-~67: There might be a mistake here.
Context: ...rovider: For ArangoDB (our ArXiv setup) - FileSystemDocumentProvider: For local files #### CitationStorage...
(QB_NEW_EN)
[grammar] ~86-~86: There might be a mistake here.
Context: ..." pass ``` Implementations: - ArangoCitationStorage: For ArangoDB storage - `JSONCitation...
(QB_NEW_EN)
[grammar] ~87-~87: There might be a mistake here.
Context: ...goCitationStorage: For ArangoDB storage - JSONCitationStorage`: For JSON file storage ### Main Proces...
(QB_NEW_EN)
[grammar] ~110-~110: There might be a mistake here.
Context: ... Bibliography Extraction Strategies: 1. Explicit Headers: Look for "References...
(QB_NEW_EN)
[grammar] ~116-~116: There might be a mistake here.
Context: ...disciplines Entry Parsing Features: - Identifier Extraction: ArXiv IDs, DOIs...
(QB_NEW_EN)
[grammar] ~117-~117: There might be a mistake here.
Context: ...*: ArXiv IDs, DOIs, PubMed IDs, SSRN IDs - Metadata Extraction: Authors, titles, ...
(QB_NEW_EN)
[grammar] ~118-~118: There might be a mistake here.
Context: ...action**: Authors, titles, venues, years - Confidence Scoring: Based on extracted...
(QB_NEW_EN)
[grammar] ~119-~119: There might be a mistake here.
Context: ...: Based on extracted information quality - Format Flexibility: Works with differe...
(QB_NEW_EN)
[grammar] ~338-~338: There might be a mistake here.
Context: ... of crashing ``` ### Confidence Scoring Every extracted entry includes a confide...
(QB_NEW_EN)
[grammar] ~365-~365: There might be a mistake here.
Context: ...rformance Optimization ### Memory Usage - Streaming processing: Process papers i...
(QB_NEW_EN)
[grammar] ~366-~366: There might be a mistake here.
Context: ...g**: Process papers individually, don't load entire corpus - **Configurable chunk si...
(QB_NEW_EN)
[grammar] ~370-~370: There might be a mistake here.
Context: ...t text when needed ### Processing Speed - Regex optimization: Pre-compile freque...
(QB_NEW_EN)
🪛 Shellcheck (0.10.0)
tools/arxiv/tests/run_large_scale_test.sh
[warning] 52-52: Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(SC2207)
🔇 Additional comments (18)
tools/arxiv/configs/arxiv_search_practical.yaml (1)
44-44: Output base_dir relocation — LGTMConsistent with repo-wide move to data/arxiv_collections. Please verify all tooling resolves this path relative to repo root (not CWD).
tools/arxiv/db/export_ids.py (1)
304-307: Default --out-dir aligned to data/arxiv_collections — LGTMMatches YAML configs and utils defaults.
tools/arxiv/configs/arxiv_search_minimal.yaml (1)
31-31: Minimal config base_dir update — LGTMIn sync with other configs and CLI defaults.
CLAUDE.md (1)
369-371: Utils path update — LGTMDocs now reference rebuild_database.py under utils/.
tools/arxiv/utils/detect_latex.py (1)
26-26: Remove sys.path hacks — LGTMCleaner, package-relative imports only.
.gitignore (2)
379-383: Safe to remove old script-data ignores — no tracked files found.
388-390: No tracked analysis docs found — the specified internal analysis documents aren’t in version control, so ignoring them is safe.README.md (1)
38-38: Reorg note looks good."utils/ # Utility scripts" aligns with the repo move. No issues.
tools/arxiv/utils/rebuild_database.py (2)
371-381: Verify schema for pdf_size_bytes to avoid UPDATE failures.The UPDATE sets pdf_size_bytes, but the insert schema doesn’t include this column and docs don’t mention it. Confirm the column exists or add a migration.
If the column is missing, either:
- add it:
ALTER TABLE papers ADD COLUMN pdf_size_bytes BIGINT;, or- drop it from the UPDATE (lower fidelity).
I can generate a migration if you confirm the intended schema.
335-341: Nice: robust arXiv ID normalization for PDFs.Regex-based version stripping covers v-suffixed files reliably.
tools/arxiv/CLAUDE.md (1)
22-29: Pipeline section pathing is consistent with the preceding cd.Good consistency between utils → ../pipelines. No changes needed.
tools/arxiv/utils/merge_lists.py (1)
43-49: Good: explicit directory writability check.Nice defensive step with clear error messaging.
tools/arxiv/tests/validate_pipeline.py (2)
217-219: Update to new lifecycle CLI LGTM.Matches the repo reorg to tools/arxiv/utils and lifecycle.py.
217-219: Verify CLI path andprocesscommand exist
The test prints:cd tools/arxiv/utils python lifecycle.py process [arxiv_id]Confirm that
tools/arxiv/utils/lifecycle.pyactually registers aprocesssubcommand (via click, argparse, etc.) and update these instructions if the script or command name has changed.tools/arxiv/utils/lifecycle.py (2)
32-34: Good: switched to proper package-relative imports, removed path hacksThis aligns with our packaging and import guidelines.
287-304: Ignore unnecessary pdf_url change
TheArXivMetadatadataclass declarespdf_url: str = ""and its__post_init__populates it withhttps://arxiv.org/pdf/{arxiv_id}.pdf, sometadata.pdf_urlis always defined and no substitution is needed.Likely an incorrect or invalid review comment.
tools/rag_utils/examples/filesystem_example.py (1)
13-23: Nice: import guard for module execution UXClear, helpful guidance when run directly.
tools/rag_utils/__init__.py (1)
13-35: Public re-exports via relative imports look good.Matches repo guideline to use relative imports within the module and provides a clean public API surface.
| output: | ||
| # Base directory for output files | ||
| base_dir: tools/arxiv/scripts/data/arxiv_collections | ||
| base_dir: data/arxiv_collections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Path update LGTM; verify no stale references to old base_dir.
The move to data/arxiv_collections is consistent with the reorg and with .gitignore’s /data/ rule.
Run to confirm nothing still points at the old path:
🏁 Script executed:
#!/bin/bash
rg -nP 'tools/arxiv/scripts/data/arxiv_collections'Length of output: 377
Update documentation paths
- In
docs/prd/arxiv_metadata_service_prd.md(lines 34, 188), replacetools/arxiv/scripts/data/arxiv_collections/withdata/arxiv_collections/.
🤖 Prompt for AI Agents
In tools/arxiv/configs/arxiv_search_nokeywords.yaml around line 44 the base_dir
is set to data/arxiv_collections which requires updating references in the
documentation; open docs/prd/arxiv_metadata_service_prd.md and on lines 34 and
188 replace any occurrences of tools/arxiv/scripts/data/arxiv_collections/ with
data/arxiv_collections/ so the docs match the config path exactly.
| output: | ||
| # Base directory for output files | ||
| base_dir: tools/arxiv/scripts/data/arxiv_collections | ||
| base_dir: data/arxiv_collections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Path update LGTM; verify downstream code and docs are aligned.
Consistent with reorg to data/arxiv_collections and existing ignore rules.
Quick sweep for stale references:
🏁 Script executed:
#!/bin/bash
rg -nP 'tools/arxiv/scripts/data/arxiv_collections'Length of output: 377
🏁 Script executed:
#!/bin/bash
rg -nP -nC2 'tools/arxiv/scripts/data/arxiv_collections' docs/prd/arxiv_metadata_service_prd.mdLength of output: 518
Update documentation paths in docs/prd/arxiv_metadata_service_prd.md at lines 34 and 188: replace tools/arxiv/scripts/data/arxiv_collections/ with data/arxiv_collections.
🤖 Prompt for AI Agents
In tools/arxiv/configs/arxiv_search.yaml around line 72, the base_dir was
changed to data/arxiv_collections but the documentation still references the old
path; update docs/prd/arxiv_metadata_service_prd.md at lines 34 and 188
replacing occurrences of tools/arxiv/scripts/data/arxiv_collections/ with
data/arxiv_collections so the documentation matches the new config path.
| # Use compgen to safely check for matching files | ||
| PAPER_LIST_FILES=($(compgen -G "../../../data/arxiv_collections/arxiv_ids_*.txt")) | ||
|
|
||
| if [ ${#PAPER_LIST_FILES[@]} -eq 0 ]; then | ||
| echo "No existing paper lists found. You need to create a paper list first." | ||
| echo -e "${YELLOW}Note: Use the lifecycle manager to process papers:${NC}" | ||
| echo "python lifecycle.py batch <paper_list.txt>" | ||
| exit 1 | ||
| else | ||
| # Use the most recent file by modification time | ||
| PAPER_LIST=$(ls -t "${PAPER_LIST_FILES[@]}" | head -1) | ||
| PAPER_COUNT=$(wc -l < "$PAPER_LIST") | ||
| echo -e "${GREEN}Found $PAPER_COUNT papers in $PAPER_LIST${NC}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use mapfile and robust ls to avoid word-splitting and SC2207
Array assignment from command substitution can split on spaces; prefer mapfile. Also pass “--” to ls for correctness.
-# Use compgen to safely check for matching files
-PAPER_LIST_FILES=($(compgen -G "../../../data/arxiv_collections/arxiv_ids_*.txt"))
+# Use compgen to list matching files without glob errors
+mapfile -t PAPER_LIST_FILES < <(compgen -G "../../../data/arxiv_collections/arxiv_ids_*.txt" || true)
@@
- # Use the most recent file by modification time
- PAPER_LIST=$(ls -t "${PAPER_LIST_FILES[@]}" | head -1)
+ # Use the most recent file by modification time
+ PAPER_LIST=$(ls -t -- "${PAPER_LIST_FILES[@]}" | head -1)
PAPER_COUNT=$(wc -l < "$PAPER_LIST")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Use compgen to safely check for matching files | |
| PAPER_LIST_FILES=($(compgen -G "../../../data/arxiv_collections/arxiv_ids_*.txt")) | |
| if [ ${#PAPER_LIST_FILES[@]} -eq 0 ]; then | |
| echo "No existing paper lists found. You need to create a paper list first." | |
| echo -e "${YELLOW}Note: Use the lifecycle manager to process papers:${NC}" | |
| echo "python lifecycle.py batch <paper_list.txt>" | |
| exit 1 | |
| else | |
| # Use the most recent file by modification time | |
| PAPER_LIST=$(ls -t "${PAPER_LIST_FILES[@]}" | head -1) | |
| PAPER_COUNT=$(wc -l < "$PAPER_LIST") | |
| echo -e "${GREEN}Found $PAPER_COUNT papers in $PAPER_LIST${NC}" | |
| # Use compgen to list matching files without glob errors | |
| mapfile -t PAPER_LIST_FILES < <(compgen -G "../../../data/arxiv_collections/arxiv_ids_*.txt" || true) | |
| if [ ${#PAPER_LIST_FILES[@]} -eq 0 ]; then | |
| echo "No existing paper lists found. You need to create a paper list first." | |
| echo -e "${YELLOW}Note: Use the lifecycle manager to process papers:${NC}" | |
| echo "python lifecycle.py batch <paper_list.txt>" | |
| exit 1 | |
| else | |
| # Use the most recent file by modification time | |
| PAPER_LIST=$(ls -t -- "${PAPER_LIST_FILES[@]}" | head -1) | |
| PAPER_COUNT=$(wc -l < "$PAPER_LIST") | |
| echo -e "${GREEN}Found $PAPER_COUNT papers in $PAPER_LIST${NC}" |
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 52-52: Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(SC2207)
🤖 Prompt for AI Agents
In tools/arxiv/tests/run_large_scale_test.sh around lines 51-63, the current
array assignment uses command substitution which causes word-splitting (SC2207)
and the ls call isn’t robust for filenames starting with “-”. Replace the array
assignment with mapfile -t PAPER_LIST_FILES < <(compgen -G
"../../../data/arxiv_collections/arxiv_ids_*.txt") to safely populate the array
without word-splitting, ensure you quote expansions (e.g.
"${PAPER_LIST_FILES[@]}"), and call ls with the end-of-options marker (ls -t --
"${PAPER_LIST_FILES[@]}") when selecting the most recent file; keep the existing
length check and wc -l logic but ensure all expansions are quoted to avoid
splitting issues.
| Usage Examples: | ||
| # Process a single paper | ||
| python lifecycle_cli.py process 2508.21038 | ||
| python lifecycle.py process 2508.21038 | ||
| # Check status of a paper | ||
| python lifecycle_cli.py status 2508.21038 | ||
| python lifecycle.py status 2508.21038 | ||
| # Process multiple papers | ||
| python lifecycle_cli.py batch papers.txt | ||
| python lifecycle.py batch papers.txt | ||
| # Force reprocessing | ||
| python lifecycle_cli.py process 2508.21038 --force | ||
| python lifecycle.py process 2508.21038 --force | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix CLI usage examples — current examples will fail due to relative imports
Because this module uses relative imports (from .arxiv_lifecycle_manager …), invoking it as a file (python lifecycle.py …) will raise “attempted relative import with no known parent package.” Update the examples to module execution.
- python lifecycle.py process 2508.21038
+ python -m tools.arxiv.utils.lifecycle process 2508.21038
@@
- python lifecycle.py status 2508.21038
+ python -m tools.arxiv.utils.lifecycle status 2508.21038
@@
- python lifecycle.py batch papers.txt
+ python -m tools.arxiv.utils.lifecycle batch papers.txt
@@
- python lifecycle.py process 2508.21038 --force
+ python -m tools.arxiv.utils.lifecycle process 2508.21038 --force📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Usage Examples: | |
| # Process a single paper | |
| python lifecycle_cli.py process 2508.21038 | |
| python lifecycle.py process 2508.21038 | |
| # Check status of a paper | |
| python lifecycle_cli.py status 2508.21038 | |
| python lifecycle.py status 2508.21038 | |
| # Process multiple papers | |
| python lifecycle_cli.py batch papers.txt | |
| python lifecycle.py batch papers.txt | |
| # Force reprocessing | |
| python lifecycle_cli.py process 2508.21038 --force | |
| python lifecycle.py process 2508.21038 --force | |
| """ | |
| """ | |
| Usage Examples: | |
| # Process a single paper | |
| python -m tools.arxiv.utils.lifecycle process 2508.21038 | |
| # Check status of a paper | |
| python -m tools.arxiv.utils.lifecycle status 2508.21038 | |
| # Process multiple papers | |
| python -m tools.arxiv.utils.lifecycle batch papers.txt | |
| # Force reprocessing | |
| python -m tools.arxiv.utils.lifecycle process 2508.21038 --force | |
| """ |
🤖 Prompt for AI Agents
In tools/arxiv/utils/lifecycle.py around lines 10 to 22, the CLI usage examples
show running the file directly (python lifecycle.py ...) which fails due to
relative imports; update the examples to use module execution under the package
path (python -m tools.arxiv.utils.lifecycle ...) for each example (process,
status, batch, force) so the module is executed with a proper package context
and relative imports resolve.
| from rag_utils.academic_citation_toolkit import ( | ||
| UniversalBibliographyExtractor, | ||
| DocumentProvider, | ||
| BibliographyEntry | ||
| ) | ||
|
|
||
| class SSRNProvider(DocumentProvider): | ||
| def __init__(self, api_key: str): | ||
| self.api_key = api_key | ||
|
|
||
| def get_document_text(self, document_id: str) -> Optional[str]: | ||
| # Fetch from SSRN API | ||
| response = requests.get(f"https://ssrn.com/api/papers/{document_id}", | ||
| headers={"Authorization": f"Bearer {self.api_key}"}) | ||
| return response.json().get("full_text") if response.ok else None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example won’t run: missing requests import in SSRN provider snippet.
Add the import to prevent NameError.
from rag_utils.academic_citation_toolkit import (
UniversalBibliographyExtractor,
DocumentProvider,
BibliographyEntry
)
+import requests📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from rag_utils.academic_citation_toolkit import ( | |
| UniversalBibliographyExtractor, | |
| DocumentProvider, | |
| BibliographyEntry | |
| ) | |
| class SSRNProvider(DocumentProvider): | |
| def __init__(self, api_key: str): | |
| self.api_key = api_key | |
| def get_document_text(self, document_id: str) -> Optional[str]: | |
| # Fetch from SSRN API | |
| response = requests.get(f"https://ssrn.com/api/papers/{document_id}", | |
| headers={"Authorization": f"Bearer {self.api_key}"}) | |
| return response.json().get("full_text") if response.ok else None | |
| from rag_utils.academic_citation_toolkit import ( | |
| UniversalBibliographyExtractor, | |
| DocumentProvider, | |
| BibliographyEntry | |
| ) | |
| import requests | |
| class SSRNProvider(DocumentProvider): | |
| def __init__(self, api_key: str): | |
| self.api_key = api_key | |
| def get_document_text(self, document_id: str) -> Optional[str]: | |
| # Fetch from SSRN API | |
| response = requests.get( | |
| f"https://ssrn.com/api/papers/{document_id}", | |
| headers={"Authorization": f"Bearer {self.api_key}"} | |
| ) | |
| return response.json().get("full_text") if response.ok else None |
🤖 Prompt for AI Agents
In tools/rag_utils/academic_citation_toolkit.md around lines 171 to 186, the
SSRNProvider example references requests but does not import it; add an import
statement for requests at the top of the snippet (alongside the other imports)
so the example runs without raising NameError.
| logging.basicConfig(level=logging.INFO) | ||
| logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Don’t configure root logging in a library module.
Move basicConfig under main() to avoid affecting host apps importing this module.
-logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)Add in main():
def main():
"""Test the universal citation toolkit."""
+ logging.basicConfig(level=logging.INFO,
+ format="%(asctime)s %(levelname)s %(name)s: %(message)s")Also applies to: 581-585
🤖 Prompt for AI Agents
In tools/rag_utils/academic_citation_toolkit.py around lines 37-38 (and also
lines 581-585), the module currently calls logging.basicConfig at import time
which configures the root logger for any host application; move the basicConfig
call into a main() or if __name__ == "__main__" guard so that root logging is
only configured when the module is run as a script. Remove or replace top-level
basicConfig and ensure logger = logging.getLogger(__name__) remains, then add
logging.basicConfig(...) inside main() before any logging calls when executed
directly.
| # Extract DOI | ||
| doi_match = re.search(r'doi:?\s*([10]\.\d+/[^\s,]+)', entry_text, re.IGNORECASE) | ||
| doi = doi_match.group(1) if doi_match else None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DOI regex is wrong ([10] matches “1” or “0”); fix to the standard DOI form.
Current pattern will miss valid DOIs and match invalid ones.
- doi_match = re.search(r'doi:?\s*([10]\.\d+/[^\s,]+)', entry_text, re.IGNORECASE)
+ doi_match = re.search(r'\b(10\.\d{4,9}/[^\s"<>]+)\b', entry_text, re.IGNORECASE)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Extract DOI | |
| doi_match = re.search(r'doi:?\s*([10]\.\d+/[^\s,]+)', entry_text, re.IGNORECASE) | |
| doi = doi_match.group(1) if doi_match else None | |
| # Extract DOI | |
| doi_match = re.search(r'\b(10\.\d{4,9}/[^\s"<>]+)\b', entry_text, re.IGNORECASE) | |
| doi = doi_match.group(1) if doi_match else None |
🤖 Prompt for AI Agents
In tools/rag_utils/academic_citation_toolkit.py around lines 415 to 418, the DOI
extraction regex uses [10]\.\d+ which incorrectly matches only '1' or '0' before
the dot; replace it with a proper DOI pattern such as using 10\.\d{4,9}/ to
match standard DOIs and capture the suffix up to whitespace or punctuation (for
example: doi:?\s*(10\.\d{4,9}/[^\s,;]+) with re.IGNORECASE), so
doi_match.group(1) returns a valid DOI when present.
Summary by CodeRabbit