Skip to content

Conversation

@r3d91ll
Copy link
Owner

@r3d91ll r3d91ll commented Sep 4, 2025

Summary by CodeRabbit

  • New Features
    • Added a Retrieval-Augmented Generation (RAG) toolkit for citation extraction with example demos and storage backends.
  • Documentation
    • Updated READMEs and guides to reflect reorganized tooling, new commands, and comprehensive toolkit docs.
  • Refactor
    • Moved tooling into a utils package, switched to package-relative usage, and updated default output locations.
  • Chores
    • Removed deprecated scripts, adjusted .gitignore entries, updated configs, and improved logging/path handling.
  • Tests
    • Updated large-scale test/validation flows to use lifecycle-based paper lists and new guidance.

…pipeline execution, embedding phase, and testing. These changes include the deletion of the following files:

- rebuild_postgresql_fixed.py
- run_acid_pipeline.sh
- run_embedding_phase_only.py
- run_pipeline_from_list.py
- run_test_pipeline.py
- run_weekend_test.sh

This cleanup is part of the transition to a new processing architecture and improves maintainability by removing unused code.
- Introduced `arxiv_example.py` to demonstrate citation extraction from ArXiv papers using ArangoDB.
- Created `custom_provider_example.py` showcasing custom DocumentProvider and CitationStorage implementations with SQLite.
- Added `filesystem_example.py` for citation extraction from local PDF/text files, highlighting the toolkit's versatility.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

Warning

Rate limit exceeded

@r3d91ll has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 33 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between e84c2ab and 7fc37dd.

📒 Files selected for processing (9)
  • tools/arxiv/db/export_ids.py (2 hunks)
  • tools/arxiv/tests/validate_pipeline.py (2 hunks)
  • tools/arxiv/utils/check_papers.py (5 hunks)
  • tools/arxiv/utils/merge_lists.py (3 hunks)
  • tools/arxiv/utils/rebuild_database.py (6 hunks)
  • tools/rag_utils/academic_citation_toolkit.py (1 hunks)
  • tools/rag_utils/examples/arxiv_example.py (1 hunks)
  • tools/rag_utils/examples/custom_provider_example.py (1 hunks)
  • tools/rag_utils/examples/filesystem_example.py (1 hunks)

Walkthrough

Repository reorganizes ArXiv tooling from scripts/ to utils/, updates configs and docs, adjusts output/data paths, removes many legacy scripts under tools/arxiv/scripts/, adds a new tools/rag_utils Academic Citation Toolkit (code, docs, and examples), and updates utils for package imports, logging, and CLI usage.

Changes

Cohort / File(s) Summary
Repo housekeeping
\.gitignore
Removed ignores for tools/arxiv/scripts/data/* and tools/arxiv/scripts/data/arxiv_collections/; added ignores for three analysis Markdown files under tools/arxiv/.
Docs updates
README.md, CLAUDE.md, tools/arxiv/CLAUDE.md, tools/rag_utils/README.md, tools/rag_utils/academic_citation_toolkit.md
Updated references from scripts/utils/, replaced lifecycle_cli.py usage with lifecycle.py, added RAG Utils docs and toolkit spec.
Config path updates
tools/arxiv/configs/*arxiv_search*.yaml
Changed output.base_dir from tools/arxiv/scripts/data/arxiv_collectionsdata/arxiv_collections.
Database package cleanup
tools/arxiv/database/__init__.py
Removed module content and two public constants (__version__, DATABASE_NAME).
Legacy scripts removed
tools/arxiv/scripts/* (many)
Deleted numerous legacy scripts: collectors (collect_ai_papers*), PDF scanners (pdf_scanner*.py), PostgreSQL rebuilders (rebuild_postgresql*.py), pipeline runners/tests (run_pipeline_from_list.py, run_test_pipeline.py, run_weekend_test.sh), and embedding-phase runner.
Tests and validation updates
tools/arxiv/tests/run_large_scale_test.sh, tools/arxiv/tests/validate_pipeline.py
Switched discovery and guidance to utils/lifecycle.py flows; updated paper-list discovery logic and user guidance.
ArXiv utils package & helpers
tools/arxiv/utils/*, tools/arxiv/db/export_ids.py
Added tools/arxiv/utils/__init__.py; removed sys.path hacks; converted imports to relative package imports (e.g., lifecycle.py); standardized project-root Path-based data paths; changed defaults/output dirs (merge/exports); added logging setup in rebuild_database.py; updated CLI invocations to accept --arango-password; adjusted export_ids default out-dir to data/arxiv_collections.
RAG Utils package (new)
tools/rag_utils/*
Added tools/rag_utils package: academic_citation_toolkit.py (data models, providers, storages, extractor, factories), __init__.py (public exports), detailed docs, and three example scripts demonstrating Arango, filesystem, and custom-provider workflows.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as tools/arxiv/utils/lifecycle.py
  participant LCM as ArXivLifecycleManager
  participant Pipeline as pipelines/arxiv_pipeline.py
  participant DB as ArangoDB

  User->>CLI: batch <paper_list>
  CLI->>LCM: start processing (process/batch)
  LCM->>Pipeline: invoke pipeline (--arango-password)
  Pipeline->>DB: read/write papers & embeddings
  Pipeline-->>LCM: results/metrics
  LCM-->>CLI: summary/status
  CLI-->>User: completion & next steps
Loading
sequenceDiagram
  autonumber
  participant Dev
  participant Extractor as UniversalBibliographyExtractor
  participant Provider as DocumentProvider (Arango/FileSystem/WebAPI)
  participant Storage as CitationStorage (Arango/JSON/SQLite)

  Dev->>Extractor: extract_paper_bibliography(paper_id)
  Extractor->>Provider: get_document_text / get_document_chunks
  Provider-->>Extractor: paper text / chunks
  Extractor->>Extractor: detect references, parse entries
  Extractor-->>Dev: List[BibliographyEntry]
  Dev->>Storage: store_bibliography_entries(entries)
  Storage-->>Dev: success/failure
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

Poem

In burrows of code I twitch my nose,
Old scripts fade where the new path goes.
Utils bloom, a toolkit springs,
Citations hop on scholarly wings.
From ArXiv fields to JSON nests—
Thump! I ship these tidy tests. 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch experiement_word2vec_lineage

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

- Keep deletions of deprecated scripts in tools/arxiv/scripts/
- Functionality moved to tools/arxiv/utils/ with improvements
- Preserve docstring improvements from main in utils files
- Maintain architectural improvements from reorganization
@r3d91ll
Copy link
Owner Author

r3d91ll commented Sep 4, 2025

@coderabbitai we have resolved the merge conflict can you please review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

@r3d91ll I'll review the changes in PR #26 now that the merge conflict has been resolved.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Docstrings generation was requested by @r3d91ll.

* #26 (comment)

The following files were modified:

* `tools/arxiv/db/export_ids.py`
* `tools/arxiv/tests/validate_pipeline.py`
* `tools/arxiv/utils/check_papers.py`
* `tools/arxiv/utils/merge_lists.py`
* `tools/arxiv/utils/rebuild_database.py`
* `tools/rag_utils/academic_citation_toolkit.py`
* `tools/rag_utils/examples/arxiv_example.py`
* `tools/rag_utils/examples/custom_provider_example.py`
* `tools/rag_utils/examples/filesystem_example.py`
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

Note

Generated docstrings for this pull request at #27

📝 Add docstrings to `experiement_word2vec_lineage`
@r3d91ll r3d91ll merged commit 3a313de into main Sep 4, 2025
coderabbitai bot added a commit that referenced this pull request Sep 4, 2025
Docstrings generation was requested by @r3d91ll.

* #26 (comment)

The following files were modified:

* `tools/arxiv/utils/rebuild_database.py`
* `tools/rag_utils/academic_citation_toolkit.py`
* `tools/rag_utils/examples/custom_provider_example.py`
* `tools/rag_utils/examples/filesystem_example.py`
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

Note

Generated docstrings for this pull request at #28

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
tools/arxiv/utils/detect_latex.py (1)

278-285: Guard against division by zero in summary

If results is empty, current percentage prints will raise ZeroDivisionError.

-        print(f"Papers with LaTeX: {stats['has_latex']:,} ({stats['has_latex']/stats['total_papers']*100:.1f}%)")
-        print(f"Papers without LaTeX: {stats['no_latex']:,} ({stats['no_latex']/stats['total_papers']*100:.1f}%)")
-        print(f"Unknown status: {stats['unknown']:,} ({stats['unknown']/stats['total_papers']*100:.1f}%)")
+        total = max(stats['total_papers'], 1)
+        print(f"Papers with LaTeX: {stats['has_latex']:,} ({stats['has_latex']/total*100:.1f}%)")
+        print(f"Papers without LaTeX: {stats['no_latex']:,} ({stats['no_latex']/total*100:.1f}%)")
+        print(f"Unknown status: {stats['unknown']:,} ({stats['unknown']/total*100:.1f}%)")
tools/arxiv/utils/check_papers.py (2)

16-21: Fix sys.path root calculation (currently points one level above repo).

parents[3] from utils/ escapes the repo; imports may break.

- project_root = Path(__file__).parent.parent.parent.parent  # Goes up to HADES-Lab
- sys.path.insert(0, str(project_root))
+ project_root = Path(__file__).resolve().parents[2]  # HADES-Lab/
+ if str(project_root) not in sys.path:
+     sys.path.insert(0, str(project_root))

116-121: Guard percentage calculations against empty lists.

If the input file exists but is empty, division by zero occurs.

- print(f"  ✅ Already processed:     {already_processed:,} ({already_processed/total_papers*100:.1f}%)")
- print(f"  ❌ Failed previously:     {already_failed:,} ({already_failed/total_papers*100:.1f}%)")
- print(f"  ⏳ Not yet processed:     {not_processed:,} ({not_processed/total_papers*100:.1f}%)")
+ if total_papers > 0:
+     print(f"  ✅ Already processed:     {already_processed:,} ({already_processed/total_papers*100:.1f}%)")
+     print(f"  ❌ Failed previously:     {already_failed:,} ({already_failed/total_papers*100:.1f}%)")
+     print(f"  ⏳ Not yet processed:     {not_processed:,} ({not_processed/total_papers*100:.1f}%)")
+ else:
+     print(f"  ✅ Already processed:     {already_processed:,} (N/A)")
+     print(f"  ❌ Failed previously:     {already_failed:,} (N/A)")
+     print(f"  ⏳ Not yet processed:     {not_processed:,} (N/A)")
tools/arxiv/tests/run_large_scale_test.sh (2)

101-117: set -e prevents custom failure message for small batch; capture exit code explicitly

With set -e, the script exits immediately on a non-zero Python exit, skipping your friendly error message. Temporarily disable -e, capture code, then re-enable.

-# Run with limited papers first
-python test_large_scale_processing.py \
-    --config ../configs/large_scale_test.yaml \
-    --papers "$PAPER_LIST" \
-    --limit 100
-
-# Check if small batch succeeded
-if [ $? -ne 0 ]; then
+set +e
+python test_large_scale_processing.py \
+    --config ../configs/large_scale_test.yaml \
+    --papers "$PAPER_LIST" \
+    --limit 100
+small_exit=$?
+set -e
+
+# Check if small batch succeeded
+if [ $small_exit -ne 0 ]; then
     echo -e "${RED}Small batch test failed. Aborting full test.${NC}"
     exit 1
 fi

6-7: Propagate failures through pipelines (tee) with pipefail

Without pipefail, the full run’s exit code reflects tee, not the Python process. Enable pipefail near set -e.

-set -e  # Exit on error
+set -e  # Exit on error
+set -o pipefail  # Fail on pipeline errors (e.g., python | tee)
🧹 Nitpick comments (53)
tools/arxiv/db/export_ids.py (3)

263-267: Fix pre-2000 yymm year mapping in stats

2000 + int(yymm[:2]) mislabels 1990s (e.g., '9912' → 2099). Use a pivot to disambiguate.

Apply:

-                    y = 2000 + int(str(ym)[:2])
-                    m = int(str(ym)[2:])
+                    yy = int(str(ym)[:2])
+                    m = int(str(ym)[2:])
+                    # Treat 91–99 as 1991–1999; 00–30 as 2000–2030 (adjust pivot as needed)
+                    y = (1900 + yy) if yy >= 91 else (2000 + yy)

271-279: Avoid fragile string replacements when aliasing WHERE

Rebuild the WHERE using the existing helper with table_alias='p' instead of .replace(...).

Example:

-            where_p_sql = where_sql.replace("year", "p.year").replace("month", "p.month").replace(
-                "yymm", "p.yymm"
-            ).replace("has_pdf", "p.has_pdf").replace("primary_category", "p.primary_category").replace(
-                "title", "p.title"
-            ).replace("abstract", "p.abstract")
+            where_p_sql, _ = build_where(
+                start_year, end_year, months, yymm_range, categories, keywords, with_pdf, missing_pdf, table_alias="p"
+            )

1-2: Add theory-anchored module docstring

Guideline: include docstrings linking implementation to the theoretical framework.

Example (add at top of file):

"""
ArXiv ID export utilities.

Maps to Information Reconstructionism:
- WHERE: SQL filtering (years/months/categories) selects topological subsets.
- WHAT: optional keyword search (tsvector).
- CONVEYANCE/TIME: batching, caps, and symlinked outputs for fast downstream use.
"""
CLAUDE.md (1)

197-199: Prefer explicit paths in docs

cd ../utils/ can be context-dependent. Suggest cd tools/arxiv/utils/ for clarity.

-cd ../utils/
+cd tools/arxiv/utils/
tools/arxiv/utils/detect_latex.py (3)

268-270: Create parent dir before writing results

Prevents failures when logs directory doesn’t exist.

-        with open(output_file, 'w') as f:
+        Path(output_file).parent.mkdir(parents=True, exist_ok=True)
+        with open(output_file, 'w') as f:
             json.dump(output_data, f, indent=2)

120-121: Use HTTPS and a descriptive User-Agent for arXiv API

Improves security and aligns with arXiv API etiquette.

-            url = f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
-            response = requests.get(url, timeout=30)
+            url = f"https://export.arxiv.org/api/query?id_list={arxiv_id}"
+            headers = {"User-Agent": "HADES-Lab LaTeXDetector (contact: your-email@example.com)"}
+            response = requests.get(url, headers=headers, timeout=30)

305-327: Avoid hard-coded paths; accept CLI args

Parameterize sample and output paths with argparse for portability.

If helpful, I can provide a patch to add --sample-file and --out-file flags.

tools/arxiv/utils/__init__.py (1)

1-13: Add trailing newline and tie docstring to the stated theoretical framework.

Comply with Ruff W292 and the guideline to connect to Information Reconstructionism/Conveyance.

 """
 ArXiv utilities package.

 This package contains utility scripts for ArXiv paper processing, database operations,
 and paper lifecycle management. All tools have been consolidated here for better
 discoverability.

 Main utilities:
 - lifecycle.py: Primary interface for ArXiv paper processing
 - rebuild_database.py: Database maintenance and reconstruction
 - check_db_status.py: Database status verification
 - check_papers.py: Paper validation in ArangoDB
+
+This package structures utilities to preserve document context and provenance in line
+with the Information Reconstructionism/Conveyance framework used across the pipeline.
 """
+
README.md (3)

204-206: Clarify working directory for the LaTeX extraction command.

Readers may run this from the repo root. Consider showing an absolute repo-relative path to avoid cwd confusion.

Suggested doc tweak:

- cd tools/arxiv/utils/
- python extract_latex_archives.py  # Script processes .tar files
+ # from repo root:
+ python tools/arxiv/utils/extract_latex_archives.py  # processes .tar files

212-215: Prefer repo-relative invocation to avoid cwd dependency.

Same guidance as above; also mention required env var explicitly before running.

- cd tools/arxiv/utils/
- export PGPASSWORD="your-postgres-password"
- python rebuild_database.py
+ export PGPASSWORD="your-postgres-password"
+ python tools/arxiv/utils/rebuild_database.py

340-344: Make lifecycle invocations cwd-agnostic.

Use repo-relative paths or add a preceding cd to tools/arxiv/utils for consistency with CLAUDE.md.

- python lifecycle.py process 2503.10150
+ python tools/arxiv/utils/lifecycle.py process 2503.10150
@@
- python lifecycle.py batch paper_list.txt --hirag-extraction
+ python tools/arxiv/utils/lifecycle.py batch paper_list.txt --hirag-extraction
tools/arxiv/utils/check_papers.py (4)

43-47: Deduplicate root resolution and use the already computed project_root.

Avoid recomputing with a different depth; it risks drift.

- script_dir = Path(__file__).parent.resolve()
- project_root = script_dir.parents[2]  # Go up from utils -> arxiv -> tools -> HADES-Lab
- data_dir = project_root / "data" / "arxiv_collections"
+ data_dir = project_root / "data" / "arxiv_collections"

71-77: Normalize ARANGO_HOST with/without scheme to prevent mismatches.

Docs export ARANGO_HOST as a bare host; here default includes http:// and port. Normalize to a full URL consistently.

- 'host': os.getenv('ARANGO_HOST', 'http://192.168.1.69:8529'),
+ # Accept "host[:port]" or "http[s]://host:port"
+ '_host_env': os.getenv('ARANGO_HOST', '192.168.1.69'),
+ 'host': _host_env if _host_env.startswith('http') else f"http://{_host_env}:8529",

144-161: Make the follow-up command repo-relative to avoid cwd surprises.

Current message assumes you’re in utils/.

- print(f"   python lifecycle.py batch {unprocessed_file} --count 100")
+ print(f"   python tools/arxiv/utils/lifecycle.py batch {unprocessed_file} --count 100")

1-7: Optional: tie docstring to the Conveyance framework per repo guidelines.

Add one line linking this utility’s purpose (status introspection) to Information Reconstructionism/Conveyance.

tools/arxiv/utils/rebuild_database.py (3)

120-124: Use a precise exception in date parsing.

Bare except: risks swallowing unrelated errors; narrow to ValueError.

-                except:
+                except ValueError:
                     continue

659-665: Avoid hardcoded, machine-specific log paths in CLI output.

Use the computed log_path so users don’t copy a wrong path.

- print("   Monitor: tail -f /home/todd/olympus/HADES-Lab/tools/arxiv/logs/postgresql_rebuild_complete.log")
+ print(f"   Monitor: tail -f {log_path}")

81-89: Add a connect timeout for DB resilience.

Prevents indefinite hangs if Postgres is unreachable.

-        return psycopg2.connect(**self.pg_config)
+        return psycopg2.connect(connect_timeout=10, **self.pg_config)
tools/arxiv/CLAUDE.md (1)

10-19: Docs read well; one minor improvement for cwd context.

Prepend a short note "from tools/arxiv/" before cd utils/ to make the relative ../pipelines/ step unambiguous.

tools/arxiv/utils/merge_lists.py (5)

23-37: Make output_dir robust (default currently depends on caller’s cwd).

Defaulting to "../../../..." is fragile. Derive from file or accept None and compute.

-def merge_id_files(*id_files, output_dir: str = "../../../data/arxiv_collections") -> Path:
+def merge_id_files(*id_files, output_dir: str | None = None) -> Path:
@@
-    output_dir = Path(output_dir)
+    if output_dir is None:
+        output_dir = Path(__file__).resolve().parents[2] / "data" / "arxiv_collections"
+    output_dir = Path(output_dir)

55-70: Ruff cleanups: remove redundant mode and use OSError.

Minor polish per UP015 and UP024.

-        try:
-            logger.info(f"Loading IDs from {id_path}")
-            with open(id_path, 'r', encoding='utf-8') as f:
+        try:
+            logger.info(f"Loading IDs from {id_path}")
+            with open(id_path, encoding='utf-8') as f:
@@
-        except IOError as e:
+        except OSError as e:
             logger.error(f"Error reading {id_path}: {e}")
             continue

90-100: Apply the same robust defaulting for JSON merges.

Mirror the id-file behavior.

-def merge_json_collections(*json_files, output_dir: str = "../../../data/arxiv_collections") -> Path:
+def merge_json_collections(*json_files, output_dir: str | None = None) -> Path:
@@
-    output_dir = Path(output_dir)
+    if output_dir is None:
+        output_dir = Path(__file__).resolve().parents[2] / "data" / "arxiv_collections"
+    output_dir = Path(output_dir)

111-114: Specify encoding when reading JSON collections.

Avoid locale-dependent decoding issues.

-            with open(json_path, 'r') as f:
+            with open(json_path, encoding='utf-8') as f:
                 data = json.load(f)

192-194: Reflect new default behavior in CLI help.

If adopting the file-relative default, update the help text accordingly.

-    parser.add_argument('--output-dir', default='../../../data/arxiv_collections', 
-                       help='Output directory')
+    parser.add_argument('--output-dir',
+                        help='Output directory (default: repo_root/data/arxiv_collections)')
tools/rag_utils/academic_citation_toolkit.md (1)

5-5: Use consistent “arXiv” capitalization throughout the doc.

Multiple instances use “ArXiv”; prefer “arXiv” (brand style).

Also applies to: 66-69, 86-89, 143-151, 262-264

tools/rag_utils/academic_citation_toolkit.py (7)

257-266: Also set UTF-8 when writing citations.

-            with open(f"{self.output_dir}/citations.json", 'w') as f:
+            with open(f"{self.output_dir}/citations.json", 'w', encoding='utf-8') as f:
                 json.dump(data, f, indent=2, ensure_ascii=False)

576-576: Clean up lint issues (unused var, f-strings, EOF newline).

-    arango_password = os.getenv('ARANGO_PASSWORD')
+    os.getenv('ARANGO_PASSWORD')  # Ensures provider raises if missing

-            if storage.store_bibliography_entries(entries):
-                print(f"  💾 Stored bibliography entries")
+            if storage.store_bibliography_entries(entries):
+                print("  💾 Stored bibliography entries")

-            print(f"  ❌ No bibliography found")
+            print("  ❌ No bibliography found")

Ensure file ends with a trailing newline.

Also applies to: 614-614, 617-617, 619-620


141-146: Minor: mode='r' is default; can be dropped for style.

-            with open(file_path, 'r', encoding='utf-8') as f:
+            with open(file_path, encoding='utf-8') as f:
                 return f.read()

287-304: Precompile hot regexes to reduce repeated compile overhead.

Define patterns at module scope and reuse them inside methods. I can provide a patch if desired.

Also applies to: 306-315, 317-326


91-99: Consider preserving paragraph boundaries when joining chunks.

Using '\n\n' instead of a space keeps structure for downstream parsing.

-        return ' '.join(chunks) if chunks else None
+        return '\n\n'.join(chunks) if chunks else None

Also applies to: 151-159


544-552: In-text citation extraction is unimplemented.

Add a simple numeric [n] extractor as a first pass, then map to entries by entry_number.

I can add a minimal implementation with tests on request.


33-36: Modernize typing: prefer built-in generics.

Keeps Ruff happy (UP035) and aligns with 3.10+.

-from typing import List, Dict, Optional, Tuple, Union
+from typing import Optional
+# Use built-in generics: list, dict, tuple instead of typing.List/Dict/Tuple
tools/arxiv/tests/validate_pipeline.py (2)

11-11: Remove unused import.

-import subprocess

13-13: Optional: switch to built-in generics and drop typing where possible.

-from typing import Tuple, List
+from typing import Tuple, List  # or use built-ins in annotations: tuple[bool, list[str]]

And update annotations to tuple[bool, list[str]] when you touch this file next.

Also applies to: 22-25, 68-83, 174-176

tools/arxiv/utils/lifecycle.py (1)

147-164: Map PROCESSING status in emoji and description dictionaries

If PaperStatus.PROCESSING occurs, the CLI shows “⚪ Unknown status.” Add explicit mapping.

     status_emoji = {
         PaperStatus.ERROR: "❌",
         PaperStatus.NOT_FOUND: "❓",
         PaperStatus.METADATA_ONLY: "📋",
         PaperStatus.DOWNLOADED: "📥",
+        PaperStatus.PROCESSING: "⏳",
         PaperStatus.PROCESSED: "⚙️",
         PaperStatus.HIRAG_INTEGRATED: "🎯"
     }
@@
     status_descriptions = {
         PaperStatus.NOT_FOUND: "Paper not found in system",
         PaperStatus.METADATA_ONLY: "Metadata available, files not downloaded",
         PaperStatus.DOWNLOADED: "Files downloaded, not processed",
+        PaperStatus.PROCESSING: "Processing in progress",
         PaperStatus.PROCESSED: "Fully processed through ACID pipeline",
         PaperStatus.HIRAG_INTEGRATED: "Integrated into HiRAG system",
         PaperStatus.ERROR: "Error occurred during processing"
     }
tools/arxiv/tests/run_large_scale_test.sh (1)

46-49: Adjust Step 1 messaging to reflect discovery of prebuilt lists, not collection

Current text says “Collecting… from ArXiv API” but the code only discovers existing lists. Tweak wording to avoid confusion.

-echo -e "\n${GREEN}Step 1: Collecting papers from ArXiv API${NC}"
-echo "This will search for papers on AI, RAG, LLMs, and Actor Network Theory"
+echo -e "\n${GREEN}Step 1: Discovering existing paper lists${NC}"
+echo "Looking for prebuilt arxiv_ids_*.txt lists (AI, RAG, LLMs, ANT)"
tools/rag_utils/README.md (1)

1-252: Polish wording and examples; add run-as-module note

Minor grammar/list formatting nits flagged, and examples would benefit from a note that examples should be executed as modules (python -m …) due to package-relative imports.

  • Normalize “arXiv” casing and bullet spacing.
  • Add: “Run examples as modules, e.g., python -m tools.rag_utils.examples.arxiv_example”.
  • Consider running markdownlint and LanguageTool on this file to batch-fix micro issues.
tools/rag_utils/examples/arxiv_example.py (6)

32-36: Parameterize ArangoDB host via env var (default localhost)

Avoid hardcoding a private IP; improves portability.

-client = ArangoClient(hosts='http://192.168.1.69:8529')
+arango_host = os.getenv('ARANGO_HOST', 'http://localhost:8529')
+client = ArangoClient(hosts=arango_host)

52-57: Remove unused loop variable per Ruff B007

Title isn’t used in the loop body.

-for paper_id, title in core_papers.items():
+for paper_id in core_papers:

89-90: Drop extraneous f-string

-            print(f"  ❌ No bibliography entries found")
+            print("  ❌ No bibliography entries found")

123-127: Drop extraneous f-string

-            print(f"   Collection: bibliography_entries")
+            print("   Collection: bibliography_entries")

14-16: Add helpful import guard when run as a script (mirror filesystem_example)

Running this file directly will fail due to package-relative imports. Add the same try/except guidance used in filesystem_example.

# Replace the simple import with:
try:
    from ..academic_citation_toolkit import create_arxiv_citation_toolkit
except ImportError as e:
    if __name__ == "__main__" and (__package__ is None or __package__ == ""):
        raise SystemExit(
            "Run as a module:\n  python -m tools.rag_utils.examples.arxiv_example"
        ) from e
    raise

1-8: Optional: tie docstring to Information Reconstructionism/Conveyance

A one-liner noting how citation extraction supports information conveyance across networks would align with repo guidelines.

tools/rag_utils/examples/filesystem_example.py (2)

134-136: Unnecessary mode argument in open()

Reading is default; remove 'r'.

-            with open(f"{output_dir}/bibliography.json", 'r') as f:
+            with open(f"{output_dir}/bibliography.json") as f:

137-167: Drop extraneous f-strings where no interpolation occurs

Cleans up Ruff F541 warnings.

-            print(f"  📊 Storage summary:")
+            print("  📊 Storage summary:")
@@
-                print(f"     Sample stored entry:")
+                print("     Sample stored entry:")
@@
-        print(f"  ❌ No bibliography entries found")
+        print("  ❌ No bibliography entries found")
@@
-    print(f"\n📂 Output files created:")
+    print("\n📂 Output files created:")
tools/rag_utils/__init__.py (2)

1-11: Add brief theoretical-framework note to the package docstring.

Per guidelines, connect implementation to Information Reconstructionism/Conveyance.

Apply:

@@
 """
 RAG Utils - Universal Academic Tools
 ====================================
 
 Source-agnostic utilities for building Retrieval-Augmented Generation (RAG)
 systems from academic corpora. These tools work with any academic paper source:
 ArXiv, SSRN, PubMed, Harvard Law Library, or any other collection.
 
+Theoretical note (Information Reconstructionism/Conveyance):
+these utilities reconstruct citation/bibliography structures from raw texts
+and convey them as structured knowledge into downstream RAG pipelines.
+
 Key Modules:
 - academic_citation_toolkit: Universal citation and bibliography extraction
 """

64-64: Add trailing newline (Ruff W292).

-]
+]
+
tools/rag_utils/examples/custom_provider_example.py (6)

11-15: Modernize type hints and drop unused import.

Use built-in generics (list[str]) and remove unused sys import. Also satisfies Ruff UP035.

-import sys
-import json
+import json
 import sqlite3
-from typing import List, Optional
+from typing import Optional
@@
-    def get_document_chunks(self, document_id: str) -> List[str]:
+    def get_document_chunks(self, document_id: str) -> list[str]:
@@
-    def store_bibliography_entries(self, entries: List[BibliographyEntry]) -> bool:
+    def store_bibliography_entries(self, entries: list[BibliographyEntry]) -> bool:
@@
-    def store_citations(self, citations: List[InTextCitation]) -> bool:
+    def store_citations(self, citations: list[InTextCitation]) -> bool:
@@
-    def get_document_chunks(self, document_id: str) -> List[str]:
+    def get_document_chunks(self, document_id: str) -> list[str]:

Also applies to: 63-71, 140-146, 173-179, 331-337


357-361: Remove f-strings without placeholders (Ruff F541).

-    print(f"   DocumentProvider: MockAPIDocumentProvider")
+    print("   DocumentProvider: MockAPIDocumentProvider")
@@
-    print(f"   Extractor: UniversalBibliographyExtractor")
+    print("   Extractor: UniversalBibliographyExtractor")
@@
-            print(f"   ❌ No bibliography entries found")
+            print("   ❌ No bibliography entries found")
@@
-            print(f"📊 Database Statistics:")
+            print("📊 Database Statistics:")
@@
-            print(f"   Confidence distribution:")
+            print("   Confidence distribution:")
@@
-    print(f"\n📂 Files Created:")
+    print("\n📂 Files Created:")
@@
-    print(f"   Tables: bibliography_entries, in_text_citations")
+    print("   Tables: bibliography_entries, in_text_citations")

Also applies to: 390-397, 401-416, 430-433


44-61: Harden Web API fetch: raise for HTTP errors, normalize return to Optional[str].

Improves robustness and keeps return typing consistent.

     def get_document_text(self, document_id: str) -> Optional[str]:
         """Fetch full document text from web API."""
         try:
             import requests
-            
-            url = f"{self.api_base_url}/documents/{document_id}/fulltext"
-            response = requests.get(url, headers=self.headers, timeout=30)
-            
-            if response.status_code == 200:
-                data = response.json()
-                return data.get('full_text', data.get('content', ''))
-            else:
-                print(f"API Error {response.status_code} for document {document_id}")
-                return None
-                
-        except Exception as e:
-            print(f"Error fetching document {document_id}: {e}")
+            url = f"{self.api_base_url}/documents/{document_id}/fulltext"
+            response = requests.get(url, headers=self.headers, timeout=30)
+            response.raise_for_status()
+            data = response.json()
+            text = data.get('full_text') or data.get('content')
+            return text or None
+        except requests.RequestException as e:
+            print(f"HTTP error for document {document_id}: {e}")
+            return None
+        except Exception as e:
+            print(f"Error fetching/decoding document {document_id}: {e}")
             return None

150-161: Prefer ON CONFLICT DO UPDATE over OR REPLACE to preserve row identity/timestamps.

Avoids deleting/reinserting rows (which resets created_at and autoincrement ids).

-                INSERT OR REPLACE INTO bibliography_entries 
+                INSERT INTO bibliography_entries 
                 (source_paper_id, entry_number, raw_text, title, authors, venue, 
                  year, arxiv_id, doi, pmid, ssrn_id, url, confidence)
-                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+                ON CONFLICT(source_paper_id, entry_number) DO UPDATE SET
+                  raw_text=excluded.raw_text,
+                  title=excluded.title,
+                  authors=excluded.authors,
+                  venue=excluded.venue,
+                  year=excluded.year,
+                  arxiv_id=excluded.arxiv_id,
+                  doi=excluded.doi,
+                  pmid=excluded.pmid,
+                  ssrn_id=excluded.ssrn_id,
+                  url=excluded.url,
+                  confidence=excluded.confidence

3-9: Add brief theoretical-framework context to module docstring.

Tie the example to Information Reconstructionism/Conveyance as required.

 Demonstrates creating custom DocumentProvider and CitationStorage implementations
 for the Academic Citation Toolkit. Shows how to extend the toolkit for any
 academic corpus or storage system.
+
+Conceptual note (Information Reconstructionism/Conveyance):
+this example reconstructs citation structures from raw text and conveys them
+into a structured store to support downstream RAG workflows.

440-440: Add trailing newline (Ruff W292).

 if __name__ == "__main__":
     main()
+
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0ed34d7 and e84c2ab.

📒 Files selected for processing (36)
  • .gitignore (1 hunks)
  • CLAUDE.md (2 hunks)
  • README.md (3 hunks)
  • tools/arxiv/CLAUDE.md (3 hunks)
  • tools/arxiv/configs/arxiv_search.yaml (1 hunks)
  • tools/arxiv/configs/arxiv_search_minimal.yaml (1 hunks)
  • tools/arxiv/configs/arxiv_search_nokeywords.yaml (1 hunks)
  • tools/arxiv/configs/arxiv_search_practical.yaml (1 hunks)
  • tools/arxiv/database/__init__.py (0 hunks)
  • tools/arxiv/db/export_ids.py (1 hunks)
  • tools/arxiv/scripts/collect_ai_papers.py (0 hunks)
  • tools/arxiv/scripts/collect_ai_papers_extended.py (0 hunks)
  • tools/arxiv/scripts/pdf_scanner.py (0 hunks)
  • tools/arxiv/scripts/pdf_scanner_fixed.py (0 hunks)
  • tools/arxiv/scripts/rebuild_postgresql.py (0 hunks)
  • tools/arxiv/scripts/rebuild_postgresql_fixed.py (0 hunks)
  • tools/arxiv/scripts/run_embedding_phase_only.py (0 hunks)
  • tools/arxiv/scripts/run_pipeline_from_list.py (0 hunks)
  • tools/arxiv/scripts/run_test_pipeline.py (0 hunks)
  • tools/arxiv/scripts/run_weekend_test.sh (0 hunks)
  • tools/arxiv/tests/run_large_scale_test.sh (1 hunks)
  • tools/arxiv/tests/validate_pipeline.py (1 hunks)
  • tools/arxiv/utils/__init__.py (1 hunks)
  • tools/arxiv/utils/check_papers.py (4 hunks)
  • tools/arxiv/utils/detect_latex.py (1 hunks)
  • tools/arxiv/utils/lifecycle.py (2 hunks)
  • tools/arxiv/utils/merge_lists.py (4 hunks)
  • tools/arxiv/utils/rebuild_database.py (4 hunks)
  • tools/arxiv/utils/run_acid_pipeline.sh (2 hunks)
  • tools/rag_utils/README.md (1 hunks)
  • tools/rag_utils/__init__.py (1 hunks)
  • tools/rag_utils/academic_citation_toolkit.md (1 hunks)
  • tools/rag_utils/academic_citation_toolkit.py (1 hunks)
  • tools/rag_utils/examples/arxiv_example.py (1 hunks)
  • tools/rag_utils/examples/custom_provider_example.py (1 hunks)
  • tools/rag_utils/examples/filesystem_example.py (1 hunks)
💤 Files with no reviewable changes (11)
  • tools/arxiv/scripts/run_embedding_phase_only.py
  • tools/arxiv/database/init.py
  • tools/arxiv/scripts/run_weekend_test.sh
  • tools/arxiv/scripts/run_test_pipeline.py
  • tools/arxiv/scripts/pdf_scanner.py
  • tools/arxiv/scripts/rebuild_postgresql_fixed.py
  • tools/arxiv/scripts/run_pipeline_from_list.py
  • tools/arxiv/scripts/collect_ai_papers_extended.py
  • tools/arxiv/scripts/pdf_scanner_fixed.py
  • tools/arxiv/scripts/collect_ai_papers.py
  • tools/arxiv/scripts/rebuild_postgresql.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Include docstrings in code that connect implementation to the theoretical framework (Information Reconstructionism/Conveyance)
Within the same module, use relative imports (e.g., from .utils import helper_function)

Files:

  • tools/arxiv/utils/__init__.py
  • tools/arxiv/utils/detect_latex.py
  • tools/rag_utils/examples/arxiv_example.py
  • tools/arxiv/db/export_ids.py
  • tools/rag_utils/__init__.py
  • tools/arxiv/utils/lifecycle.py
  • tools/rag_utils/examples/custom_provider_example.py
  • tools/arxiv/tests/validate_pipeline.py
  • tools/arxiv/utils/check_papers.py
  • tools/arxiv/utils/merge_lists.py
  • tools/arxiv/utils/rebuild_database.py
  • tools/rag_utils/examples/filesystem_example.py
  • tools/rag_utils/academic_citation_toolkit.py
tools/arxiv/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tools/arxiv/**/*.py: Format code with Black for ArXiv tooling
Run Ruff lint checks on ArXiv tooling
Late chunking: process full documents before chunking to preserve context
Ensure database operations are atomic (success or rollback)
Maintain phase separation: complete extraction before embedding
Process files directly from the filesystem without database queries where specified
Implement error recovery with support for resuming from checkpoints
Preserve document structure and context throughout the processing pipeline

Files:

  • tools/arxiv/utils/__init__.py
  • tools/arxiv/utils/detect_latex.py
  • tools/arxiv/db/export_ids.py
  • tools/arxiv/utils/lifecycle.py
  • tools/arxiv/tests/validate_pipeline.py
  • tools/arxiv/utils/check_papers.py
  • tools/arxiv/utils/merge_lists.py
  • tools/arxiv/utils/rebuild_database.py
{tools,experiments}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Import from core framework when in tools/ or experiments/ (e.g., from core.framework.embedders import JinaV4Embedder)

Files:

  • tools/arxiv/utils/__init__.py
  • tools/arxiv/utils/detect_latex.py
  • tools/rag_utils/examples/arxiv_example.py
  • tools/arxiv/db/export_ids.py
  • tools/rag_utils/__init__.py
  • tools/arxiv/utils/lifecycle.py
  • tools/rag_utils/examples/custom_provider_example.py
  • tools/arxiv/tests/validate_pipeline.py
  • tools/arxiv/utils/check_papers.py
  • tools/arxiv/utils/merge_lists.py
  • tools/arxiv/utils/rebuild_database.py
  • tools/rag_utils/examples/filesystem_example.py
  • tools/rag_utils/academic_citation_toolkit.py
{experiments,tools}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

When in experiments/ or other tools/, import modules from tools/ as needed (e.g., from tools.arxiv.pipelines.arxiv_pipeline import AcidPipeline)

Files:

  • tools/arxiv/utils/__init__.py
  • tools/arxiv/utils/detect_latex.py
  • tools/rag_utils/examples/arxiv_example.py
  • tools/arxiv/db/export_ids.py
  • tools/rag_utils/__init__.py
  • tools/arxiv/utils/lifecycle.py
  • tools/rag_utils/examples/custom_provider_example.py
  • tools/arxiv/tests/validate_pipeline.py
  • tools/arxiv/utils/check_papers.py
  • tools/arxiv/utils/merge_lists.py
  • tools/arxiv/utils/rebuild_database.py
  • tools/rag_utils/examples/filesystem_example.py
  • tools/rag_utils/academic_citation_toolkit.py
{core,tools}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Infrastructure (core/, tools/) must not depend on experiments/

Files:

  • tools/arxiv/utils/__init__.py
  • tools/arxiv/utils/detect_latex.py
  • tools/rag_utils/examples/arxiv_example.py
  • tools/arxiv/db/export_ids.py
  • tools/rag_utils/__init__.py
  • tools/arxiv/utils/lifecycle.py
  • tools/rag_utils/examples/custom_provider_example.py
  • tools/arxiv/tests/validate_pipeline.py
  • tools/arxiv/utils/check_papers.py
  • tools/arxiv/utils/merge_lists.py
  • tools/arxiv/utils/rebuild_database.py
  • tools/rag_utils/examples/filesystem_example.py
  • tools/rag_utils/academic_citation_toolkit.py
tools/arxiv/configs/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

Place ArXiv processing configuration files under tools/arxiv/configs/

Files:

  • tools/arxiv/configs/arxiv_search_practical.yaml
  • tools/arxiv/configs/arxiv_search_nokeywords.yaml
  • tools/arxiv/configs/arxiv_search_minimal.yaml
  • tools/arxiv/configs/arxiv_search.yaml
🧠 Learnings (15)
📓 Common learnings
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T00:38:36.857Z
Learning: Applies to tools/arxiv/configs/**/*.yaml : Place ArXiv processing configuration files under tools/arxiv/configs/
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T00:38:36.857Z
Learning: Applies to tools/arxiv/**/*.py : Run Ruff lint checks on ArXiv tooling
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/{scripts/lifecycle_cli.py,pipelines/arxiv_pipeline.py} : Use local file storage paths: /bulk-store/arxiv-data/pdf/YYMM for PDFs and latex/YYMM for LaTeX sources
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/scripts/lifecycle_cli.py : Implement ArXiv Lifecycle Manager CLI with subcommands process, status, batch, metadata, executing the unified workflow (PostgreSQL check, download missing content, sync PostgreSQL/ArangoDB, run ACID, generate Jina v4 embeddings, integrate HiRAG)

Applied to files:

  • tools/arxiv/utils/__init__.py
  • tools/arxiv/utils/lifecycle.py
  • README.md
  • CLAUDE.md
  • tools/arxiv/CLAUDE.md
📚 Learning: 2025-09-04T00:38:36.857Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T00:38:36.857Z
Learning: Applies to tools/arxiv/configs/**/*.yaml : Place ArXiv processing configuration files under tools/arxiv/configs/

Applied to files:

  • tools/arxiv/configs/arxiv_search_practical.yaml
  • tools/arxiv/configs/arxiv_search_nokeywords.yaml
  • tools/arxiv/configs/arxiv_search_minimal.yaml
  • .gitignore
  • tools/arxiv/configs/arxiv_search.yaml
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/{scripts/lifecycle_cli.py,pipelines/arxiv_pipeline.py} : Use local file storage paths: /bulk-store/arxiv-data/pdf/YYMM for PDFs and latex/YYMM for LaTeX sources

Applied to files:

  • tools/arxiv/configs/arxiv_search_minimal.yaml
  • tools/arxiv/utils/lifecycle.py
  • .gitignore
  • tools/arxiv/utils/check_papers.py
  • README.md
  • tools/arxiv/configs/arxiv_search.yaml
  • tools/arxiv/CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/scripts/lifecycle_cli.py : Lifecycle Manager should query PostgreSQL for metadata/files and perform idempotent operations with audit trail and error recovery

Applied to files:

  • tools/arxiv/utils/lifecycle.py
  • README.md
  • CLAUDE.md
  • tools/arxiv/CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/scripts/lifecycle_cli.py : Expose status levels NOT_FOUND, METADATA_ONLY, DOWNLOADED, PROCESSED, HIRAG_INTEGRATED from lifecycle_cli.py status command (including --json output)

Applied to files:

  • tools/arxiv/utils/lifecycle.py
  • tools/arxiv/CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/scripts/lifecycle_cli.py : Lifecycle Manager should perform PostgreSQL checks and downloads (not the pipeline) before invoking ACID processing

Applied to files:

  • README.md
  • CLAUDE.md
  • tools/arxiv/utils/run_acid_pipeline.sh
  • tools/arxiv/CLAUDE.md
📚 Learning: 2025-09-04T00:38:36.857Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T00:38:36.857Z
Learning: Applies to tools/arxiv/**/*.py : Process files directly from the filesystem without database queries where specified

Applied to files:

  • README.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : ACID pipeline must be phase-separated: complete Extraction before Embedding

Applied to files:

  • CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Perform ArangoDB operations atomically (all-or-nothing transactions)

Applied to files:

  • CLAUDE.md
  • tools/arxiv/utils/run_acid_pipeline.sh
📚 Learning: 2025-09-04T00:38:36.857Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-04T00:38:36.857Z
Learning: Applies to tools/arxiv/**/*.py : Ensure database operations are atomic (success or rollback)

Applied to files:

  • CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Implement checkpointing with an atomic checkpoint file named acid_phased_checkpoint.json to resume on failure

Applied to files:

  • CLAUDE.md
  • tools/arxiv/utils/run_acid_pipeline.sh
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Use /dev/shm/acid_staging as the inter-phase staging directory (RamFS) and clean GPU memory between phases

Applied to files:

  • CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/pipelines/arxiv_pipeline.py : Process PDFs directly from filesystem without database queries (no DB dependency in pipeline)

Applied to files:

  • CLAUDE.md
📚 Learning: 2025-09-04T00:39:22.896Z
Learnt from: CR
PR: r3d91ll/HADES-Lab#0
File: tools/arxiv/CLAUDE.md:0-0
Timestamp: 2025-09-04T00:39:22.896Z
Learning: Applies to tools/arxiv/{scripts/lifecycle_cli.py,pipelines/arxiv_pipeline.py} : Respect environment variables: ARANGO_PASSWORD, ARANGO_HOST, USE_GPU, CUDA_VISIBLE_DEVICES

Applied to files:

  • tools/arxiv/utils/run_acid_pipeline.sh
🧬 Code graph analysis (7)
tools/rag_utils/examples/arxiv_example.py (1)
tools/rag_utils/academic_citation_toolkit.py (6)
  • create_arxiv_citation_toolkit (555-560)
  • main (570-617)
  • extract_paper_bibliography (512-531)
  • store_bibliography_entries (173-175)
  • store_bibliography_entries (193-230)
  • store_bibliography_entries (244-255)
tools/rag_utils/__init__.py (1)
tools/rag_utils/academic_citation_toolkit.py (12)
  • BibliographyEntry (41-60)
  • InTextCitation (63-76)
  • DocumentProvider (78-98)
  • CitationStorage (160-180)
  • ArangoDocumentProvider (100-131)
  • FileSystemDocumentProvider (133-158)
  • ArangoCitationStorage (182-235)
  • JSONCitationStorage (237-266)
  • UniversalBibliographyExtractor (268-531)
  • UniversalCitationExtractor (533-552)
  • create_arxiv_citation_toolkit (555-560)
  • create_filesystem_citation_toolkit (562-567)
tools/arxiv/utils/lifecycle.py (2)
tools/arxiv/utils/arxiv_lifecycle_manager.py (3)
  • ArXivLifecycleManager (80-570)
  • PaperStatus (44-52)
  • LifecycleResult (56-77)
tools/arxiv/utils/arxiv_api_client.py (1)
  • ArXivAPIClient (74-462)
tools/rag_utils/examples/custom_provider_example.py (1)
tools/rag_utils/academic_citation_toolkit.py (18)
  • DocumentProvider (78-98)
  • CitationStorage (160-180)
  • UniversalBibliographyExtractor (268-531)
  • BibliographyEntry (41-60)
  • InTextCitation (63-76)
  • get_document_text (91-93)
  • get_document_text (113-116)
  • get_document_text (141-149)
  • get_document_chunks (96-98)
  • get_document_chunks (118-131)
  • get_document_chunks (151-158)
  • store_bibliography_entries (173-175)
  • store_bibliography_entries (193-230)
  • store_bibliography_entries (244-255)
  • store_citations (178-180)
  • store_citations (232-235)
  • store_citations (257-266)
  • extract_paper_bibliography (512-531)
tools/arxiv/utils/rebuild_database.py (1)
tools/arxiv/utils/lifecycle.py (1)
  • setup_logging (39-52)
tools/rag_utils/examples/filesystem_example.py (1)
tools/rag_utils/academic_citation_toolkit.py (6)
  • create_filesystem_citation_toolkit (562-567)
  • main (570-617)
  • extract_paper_bibliography (512-531)
  • store_bibliography_entries (173-175)
  • store_bibliography_entries (193-230)
  • store_bibliography_entries (244-255)
tools/rag_utils/academic_citation_toolkit.py (3)
tools/rag_utils/examples/custom_provider_example.py (7)
  • get_document_text (44-61)
  • get_document_text (327-329)
  • get_document_chunks (63-71)
  • get_document_chunks (331-336)
  • store_bibliography_entries (140-171)
  • store_citations (173-201)
  • main (338-437)
tools/rag_utils/examples/filesystem_example.py (1)
  • main (63-172)
tools/rag_utils/examples/arxiv_example.py (1)
  • main (17-136)
🪛 Ruff (0.12.2)
tools/arxiv/utils/__init__.py

13-13: No newline at end of file

Add trailing newline

(W292)

tools/rag_utils/examples/arxiv_example.py

52-52: Loop control variable title not used within loop body

(B007)


89-89: f-string without any placeholders

Remove extraneous f prefix

(F541)


125-125: f-string without any placeholders

Remove extraneous f prefix

(F541)


139-139: No newline at end of file

Add trailing newline

(W292)

tools/rag_utils/__init__.py

64-64: No newline at end of file

Add trailing newline

(W292)

tools/rag_utils/examples/custom_provider_example.py

15-15: typing.List is deprecated, use list instead

(UP035)


358-358: f-string without any placeholders

Remove extraneous f prefix

(F541)


360-360: f-string without any placeholders

Remove extraneous f prefix

(F541)


392-392: f-string without any placeholders

Remove extraneous f prefix

(F541)


404-404: f-string without any placeholders

Remove extraneous f prefix

(F541)


411-411: f-string without any placeholders

Remove extraneous f prefix

(F541)


430-430: f-string without any placeholders

Remove extraneous f prefix

(F541)


432-432: f-string without any placeholders

Remove extraneous f prefix

(F541)


440-440: No newline at end of file

Add trailing newline

(W292)

tools/arxiv/utils/merge_lists.py

61-61: Unnecessary mode argument

Remove mode argument

(UP015)


67-67: Replace aliased errors with OSError

Replace IOError with builtin OSError

(UP024)

tools/rag_utils/examples/filesystem_example.py

134-134: Unnecessary mode argument

Remove mode argument

(UP015)


137-137: f-string without any placeholders

Remove extraneous f prefix

(F541)


144-144: f-string without any placeholders

Remove extraneous f prefix

(F541)


153-153: f-string without any placeholders

Remove extraneous f prefix

(F541)


164-164: f-string without any placeholders

Remove extraneous f prefix

(F541)


175-175: No newline at end of file

Add trailing newline

(W292)

tools/rag_utils/academic_citation_toolkit.py

33-33: typing.List is deprecated, use list instead

(UP035)


33-33: typing.Dict is deprecated, use dict instead

(UP035)


33-33: typing.Tuple is deprecated, use tuple instead

(UP035)


145-145: Unnecessary mode argument

Remove mode argument

(UP015)


576-576: Local variable arango_password is assigned to but never used

Remove assignment to unused variable arango_password

(F841)


614-614: f-string without any placeholders

Remove extraneous f prefix

(F541)


617-617: f-string without any placeholders

Remove extraneous f prefix

(F541)


620-620: No newline at end of file

Add trailing newline

(W292)

🪛 LanguageTool
tools/rag_utils/README.md

[grammar] ~9-~9: There might be a mistake here.
Context: ...from: - Computer Science papers (ArXiv) - Economics papers (SSRN) - Medical papers...

(QB_NEW_EN)


[grammar] ~10-~10: There might be a mistake here.
Context: ...pers (ArXiv) - Economics papers (SSRN) - Medical papers (PubMed) - Legal papers (...

(QB_NEW_EN)


[grammar] ~11-~11: There might be a mistake here.
Context: ... papers (SSRN) - Medical papers (PubMed) - Legal papers (Harvard Law Library) - Any...

(QB_NEW_EN)


[grammar] ~12-~12: There might be a mistake here.
Context: ...ed) - Legal papers (Harvard Law Library) - Any academic corpus ## Available Tools ...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ...Tools ### 🕸️ Academic Citation Toolkit File: academic_citation_toolkit.py ...

(QB_NEW_EN)


[grammar] ~28-~28: There might be a mistake here.
Context: ...d citations, author-year, hybrid formats - Pluggable architecture: Easy to extend...

(QB_NEW_EN)


[grammar] ~50-~50: There might be a mistake here.
Context: ...ks for: - ArXiv computer science papers - SSRN economics papers - PubMed medical...

(QB_NEW_EN)


[grammar] ~51-~51: There might be a mistake here.
Context: ...r science papers - SSRN economics papers - PubMed medical papers - Harvard Law Libr...

(QB_NEW_EN)


[grammar] ~52-~52: There might be a mistake here.
Context: ...onomics papers - PubMed medical papers - Harvard Law Library legal papers ### 2....

(QB_NEW_EN)


[grammar] ~59-~59: There might be a mistake here.
Context: ...*: ArangoDB, filesystem, APIs, databases - Storage Backend: ArangoDB, PostgreSQL,...

(QB_NEW_EN)


[grammar] ~60-~60: There might be a mistake here.
Context: ...ckend**: ArangoDB, PostgreSQL, JSON, CSV - Format Parser: Different citation form...

(QB_NEW_EN)


[grammar] ~67-~67: There might be a mistake here.
Context: ...liography sections** (formal references) - In-text citations (contextual pointers...

(QB_NEW_EN)


[grammar] ~68-~68: There might be a mistake here.
Context: ...n-text citations** (contextual pointers) - Citation networks (paper-to-paper rela...

(QB_NEW_EN)


[grammar] ~69-~69: There might be a mistake here.
Context: ...etworks** (paper-to-paper relationships) - Author networks (collaboration pattern...

(QB_NEW_EN)


[grammar] ~77-~77: There might be a mistake here.
Context: ...) - Geographic region (US vs EU vs Asia) - Time period (1990s vs 2020s) - Publicati...

(QB_NEW_EN)


[grammar] ~78-~78: There might be a mistake here.
Context: ...s Asia) - Time period (1990s vs 2020s) - Publication venue (journal vs conference...

(QB_NEW_EN)


[grammar] ~170-~170: There might be a mistake here.
Context: ...RMES for: - Citation network enrichment - Bibliography metadata extraction - Acade...

(QB_NEW_EN)


[grammar] ~171-~171: There might be a mistake here.
Context: ...hment - Bibliography metadata extraction - Academic relationship mapping ### HADES...

(QB_NEW_EN)


[grammar] ~178-~178: There might be a mistake here.
Context: ...nal analysis (WHERE × WHAT × CONVEYANCE) - Observer-dependent citation networks - C...

(QB_NEW_EN)


[grammar] ~179-~179: There might be a mistake here.
Context: ...) - Observer-dependent citation networks - Context amplification measurement ### H...

(QB_NEW_EN)


[grammar] ~186-~186: There might be a mistake here.
Context: ...terns: - Configuration-driven operation - Reusable across modules - Tool gifting b...

(QB_NEW_EN)


[grammar] ~187-~187: There might be a mistake here.
Context: ...iven operation - Reusable across modules - Tool gifting between modules ## Perform...

(QB_NEW_EN)


[grammar] ~194-~194: There might be a mistake here.
Context: ...tweight**: Processes papers individually - Streaming: No need to load entire corp...

(QB_NEW_EN)


[grammar] ~195-~195: There might be a mistake here.
Context: ... No need to load entire corpus in memory - Configurable: Adjustable chunk sizes a...

(QB_NEW_EN)


[grammar] ~200-~200: There might be a mistake here.
Context: ...phy extraction**: ~1-2 seconds per paper - Citation parsing: ~0.5-1 seconds per p...

(QB_NEW_EN)


[grammar] ~201-~201: There might be a mistake here.
Context: ...tion parsing**: ~0.5-1 seconds per paper - Network construction: Scales with corp...

(QB_NEW_EN)


[grammar] ~202-~202: There might be a mistake here.
Context: ... construction**: Scales with corpus size - Parallelizable: Easy to distribute acr...

(QB_NEW_EN)


[grammar] ~207-~207: There might be a mistake here.
Context: ...itations**: 90%+ for numbered references - Medium confidence for author-year: 70-...

(QB_NEW_EN)


[grammar] ~208-~208: There might be a mistake here.
Context: ...: 70-85% depending on format consistency - Robust error handling: Graceful degrad...

(QB_NEW_EN)


[grammar] ~215-~215: There might be a mistake here.
Context: ...xtractor**: Build collaboration networks - Topic Evolution Tracker: Track concept...

(QB_NEW_EN)


[grammar] ~216-~216: There might be a mistake here.
Context: ...r**: Track concept development over time - Cross-Corpus Linker: Connect papers ac...

(QB_NEW_EN)


[grammar] ~217-~217: There might be a mistake here.
Context: ... Connect papers across different sources - Citation Context Analyzer: Understand ...

(QB_NEW_EN)


[grammar] ~222-~222: There might be a mistake here.
Context: ...cholar API**: Academic graph integration - OpenCitations: Citation database integ...

(QB_NEW_EN)


[grammar] ~223-~223: There might be a mistake here.
Context: ...tations**: Citation database integration - Crossref API: DOI resolution and metad...

(QB_NEW_EN)


[grammar] ~224-~224: There might be a mistake here.
Context: ...ssref API**: DOI resolution and metadata - ORCID API: Author disambiguation ## C...

(QB_NEW_EN)

tools/rag_utils/academic_citation_toolkit.md

[grammar] ~66-~66: There might be a mistake here.
Context: ..." pass ``` Implementations: - ArangoDocumentProvider: For ArangoDB (our ArXiv setup) - `File...

(QB_NEW_EN)


[grammar] ~67-~67: There might be a mistake here.
Context: ...rovider: For ArangoDB (our ArXiv setup) - FileSystemDocumentProvider: For local files #### CitationStorage...

(QB_NEW_EN)


[grammar] ~86-~86: There might be a mistake here.
Context: ..." pass ``` Implementations: - ArangoCitationStorage: For ArangoDB storage - `JSONCitation...

(QB_NEW_EN)


[grammar] ~87-~87: There might be a mistake here.
Context: ...goCitationStorage: For ArangoDB storage - JSONCitationStorage`: For JSON file storage ### Main Proces...

(QB_NEW_EN)


[grammar] ~110-~110: There might be a mistake here.
Context: ... Bibliography Extraction Strategies: 1. Explicit Headers: Look for "References...

(QB_NEW_EN)


[grammar] ~116-~116: There might be a mistake here.
Context: ...disciplines Entry Parsing Features: - Identifier Extraction: ArXiv IDs, DOIs...

(QB_NEW_EN)


[grammar] ~117-~117: There might be a mistake here.
Context: ...*: ArXiv IDs, DOIs, PubMed IDs, SSRN IDs - Metadata Extraction: Authors, titles, ...

(QB_NEW_EN)


[grammar] ~118-~118: There might be a mistake here.
Context: ...action**: Authors, titles, venues, years - Confidence Scoring: Based on extracted...

(QB_NEW_EN)


[grammar] ~119-~119: There might be a mistake here.
Context: ...: Based on extracted information quality - Format Flexibility: Works with differe...

(QB_NEW_EN)


[grammar] ~338-~338: There might be a mistake here.
Context: ... of crashing ``` ### Confidence Scoring Every extracted entry includes a confide...

(QB_NEW_EN)


[grammar] ~365-~365: There might be a mistake here.
Context: ...rformance Optimization ### Memory Usage - Streaming processing: Process papers i...

(QB_NEW_EN)


[grammar] ~366-~366: There might be a mistake here.
Context: ...g**: Process papers individually, don't load entire corpus - **Configurable chunk si...

(QB_NEW_EN)


[grammar] ~370-~370: There might be a mistake here.
Context: ...t text when needed ### Processing Speed - Regex optimization: Pre-compile freque...

(QB_NEW_EN)

🪛 Shellcheck (0.10.0)
tools/arxiv/tests/run_large_scale_test.sh

[warning] 52-52: Prefer mapfile or read -a to split command output (or quote to avoid splitting).

(SC2207)

🔇 Additional comments (18)
tools/arxiv/configs/arxiv_search_practical.yaml (1)

44-44: Output base_dir relocation — LGTM

Consistent with repo-wide move to data/arxiv_collections. Please verify all tooling resolves this path relative to repo root (not CWD).

tools/arxiv/db/export_ids.py (1)

304-307: Default --out-dir aligned to data/arxiv_collections — LGTM

Matches YAML configs and utils defaults.

tools/arxiv/configs/arxiv_search_minimal.yaml (1)

31-31: Minimal config base_dir update — LGTM

In sync with other configs and CLI defaults.

CLAUDE.md (1)

369-371: Utils path update — LGTM

Docs now reference rebuild_database.py under utils/.

tools/arxiv/utils/detect_latex.py (1)

26-26: Remove sys.path hacks — LGTM

Cleaner, package-relative imports only.

.gitignore (2)

379-383: Safe to remove old script-data ignores — no tracked files found.


388-390: No tracked analysis docs found — the specified internal analysis documents aren’t in version control, so ignoring them is safe.

README.md (1)

38-38: Reorg note looks good.

"utils/ # Utility scripts" aligns with the repo move. No issues.

tools/arxiv/utils/rebuild_database.py (2)

371-381: Verify schema for pdf_size_bytes to avoid UPDATE failures.

The UPDATE sets pdf_size_bytes, but the insert schema doesn’t include this column and docs don’t mention it. Confirm the column exists or add a migration.

If the column is missing, either:

  • add it: ALTER TABLE papers ADD COLUMN pdf_size_bytes BIGINT;, or
  • drop it from the UPDATE (lower fidelity).

I can generate a migration if you confirm the intended schema.


335-341: Nice: robust arXiv ID normalization for PDFs.

Regex-based version stripping covers v-suffixed files reliably.

tools/arxiv/CLAUDE.md (1)

22-29: Pipeline section pathing is consistent with the preceding cd.

Good consistency between utils → ../pipelines. No changes needed.

tools/arxiv/utils/merge_lists.py (1)

43-49: Good: explicit directory writability check.

Nice defensive step with clear error messaging.

tools/arxiv/tests/validate_pipeline.py (2)

217-219: Update to new lifecycle CLI LGTM.

Matches the repo reorg to tools/arxiv/utils and lifecycle.py.


217-219: Verify CLI path and process command exist
The test prints:

cd tools/arxiv/utils  
python lifecycle.py process [arxiv_id]  

Confirm that tools/arxiv/utils/lifecycle.py actually registers a process subcommand (via click, argparse, etc.) and update these instructions if the script or command name has changed.

tools/arxiv/utils/lifecycle.py (2)

32-34: Good: switched to proper package-relative imports, removed path hacks

This aligns with our packaging and import guidelines.


287-304: Ignore unnecessary pdf_url change
The ArXivMetadata dataclass declares pdf_url: str = "" and its __post_init__ populates it with https://arxiv.org/pdf/{arxiv_id}.pdf, so metadata.pdf_url is always defined and no substitution is needed.

Likely an incorrect or invalid review comment.

tools/rag_utils/examples/filesystem_example.py (1)

13-23: Nice: import guard for module execution UX

Clear, helpful guidance when run directly.

tools/rag_utils/__init__.py (1)

13-35: Public re-exports via relative imports look good.

Matches repo guideline to use relative imports within the module and provides a clean public API surface.

output:
# Base directory for output files
base_dir: tools/arxiv/scripts/data/arxiv_collections
base_dir: data/arxiv_collections
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Path update LGTM; verify no stale references to old base_dir.

The move to data/arxiv_collections is consistent with the reorg and with .gitignore’s /data/ rule.

Run to confirm nothing still points at the old path:


🏁 Script executed:

#!/bin/bash
rg -nP 'tools/arxiv/scripts/data/arxiv_collections'

Length of output: 377


Update documentation paths

  • In docs/prd/arxiv_metadata_service_prd.md (lines 34, 188), replace tools/arxiv/scripts/data/arxiv_collections/ with data/arxiv_collections/.
🤖 Prompt for AI Agents
In tools/arxiv/configs/arxiv_search_nokeywords.yaml around line 44 the base_dir
is set to data/arxiv_collections which requires updating references in the
documentation; open docs/prd/arxiv_metadata_service_prd.md and on lines 34 and
188 replace any occurrences of tools/arxiv/scripts/data/arxiv_collections/ with
data/arxiv_collections/ so the docs match the config path exactly.

output:
# Base directory for output files
base_dir: tools/arxiv/scripts/data/arxiv_collections
base_dir: data/arxiv_collections
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Path update LGTM; verify downstream code and docs are aligned.

Consistent with reorg to data/arxiv_collections and existing ignore rules.

Quick sweep for stale references:


🏁 Script executed:

#!/bin/bash
rg -nP 'tools/arxiv/scripts/data/arxiv_collections'

Length of output: 377


🏁 Script executed:

#!/bin/bash
rg -nP -nC2 'tools/arxiv/scripts/data/arxiv_collections' docs/prd/arxiv_metadata_service_prd.md

Length of output: 518


Update documentation paths in docs/prd/arxiv_metadata_service_prd.md at lines 34 and 188: replace tools/arxiv/scripts/data/arxiv_collections/ with data/arxiv_collections.

🤖 Prompt for AI Agents
In tools/arxiv/configs/arxiv_search.yaml around line 72, the base_dir was
changed to data/arxiv_collections but the documentation still references the old
path; update docs/prd/arxiv_metadata_service_prd.md at lines 34 and 188
replacing occurrences of tools/arxiv/scripts/data/arxiv_collections/ with
data/arxiv_collections so the documentation matches the new config path.

Comment on lines +51 to +63
# Use compgen to safely check for matching files
PAPER_LIST_FILES=($(compgen -G "../../../data/arxiv_collections/arxiv_ids_*.txt"))

if [ ${#PAPER_LIST_FILES[@]} -eq 0 ]; then
echo "No existing paper lists found. You need to create a paper list first."
echo -e "${YELLOW}Note: Use the lifecycle manager to process papers:${NC}"
echo "python lifecycle.py batch <paper_list.txt>"
exit 1
else
# Use the most recent file by modification time
PAPER_LIST=$(ls -t "${PAPER_LIST_FILES[@]}" | head -1)
PAPER_COUNT=$(wc -l < "$PAPER_LIST")
echo -e "${GREEN}Found $PAPER_COUNT papers in $PAPER_LIST${NC}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use mapfile and robust ls to avoid word-splitting and SC2207

Array assignment from command substitution can split on spaces; prefer mapfile. Also pass “--” to ls for correctness.

-# Use compgen to safely check for matching files
-PAPER_LIST_FILES=($(compgen -G "../../../data/arxiv_collections/arxiv_ids_*.txt"))
+# Use compgen to list matching files without glob errors
+mapfile -t PAPER_LIST_FILES < <(compgen -G "../../../data/arxiv_collections/arxiv_ids_*.txt" || true)
@@
-    # Use the most recent file by modification time
-    PAPER_LIST=$(ls -t "${PAPER_LIST_FILES[@]}" | head -1)
+    # Use the most recent file by modification time
+    PAPER_LIST=$(ls -t -- "${PAPER_LIST_FILES[@]}" | head -1)
     PAPER_COUNT=$(wc -l < "$PAPER_LIST")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Use compgen to safely check for matching files
PAPER_LIST_FILES=($(compgen -G "../../../data/arxiv_collections/arxiv_ids_*.txt"))
if [ ${#PAPER_LIST_FILES[@]} -eq 0 ]; then
echo "No existing paper lists found. You need to create a paper list first."
echo -e "${YELLOW}Note: Use the lifecycle manager to process papers:${NC}"
echo "python lifecycle.py batch <paper_list.txt>"
exit 1
else
# Use the most recent file by modification time
PAPER_LIST=$(ls -t "${PAPER_LIST_FILES[@]}" | head -1)
PAPER_COUNT=$(wc -l < "$PAPER_LIST")
echo -e "${GREEN}Found $PAPER_COUNT papers in $PAPER_LIST${NC}"
# Use compgen to list matching files without glob errors
mapfile -t PAPER_LIST_FILES < <(compgen -G "../../../data/arxiv_collections/arxiv_ids_*.txt" || true)
if [ ${#PAPER_LIST_FILES[@]} -eq 0 ]; then
echo "No existing paper lists found. You need to create a paper list first."
echo -e "${YELLOW}Note: Use the lifecycle manager to process papers:${NC}"
echo "python lifecycle.py batch <paper_list.txt>"
exit 1
else
# Use the most recent file by modification time
PAPER_LIST=$(ls -t -- "${PAPER_LIST_FILES[@]}" | head -1)
PAPER_COUNT=$(wc -l < "$PAPER_LIST")
echo -e "${GREEN}Found $PAPER_COUNT papers in $PAPER_LIST${NC}"
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 52-52: Prefer mapfile or read -a to split command output (or quote to avoid splitting).

(SC2207)

🤖 Prompt for AI Agents
In tools/arxiv/tests/run_large_scale_test.sh around lines 51-63, the current
array assignment uses command substitution which causes word-splitting (SC2207)
and the ls call isn’t robust for filenames starting with “-”. Replace the array
assignment with mapfile -t PAPER_LIST_FILES < <(compgen -G
"../../../data/arxiv_collections/arxiv_ids_*.txt") to safely populate the array
without word-splitting, ensure you quote expansions (e.g.
"${PAPER_LIST_FILES[@]}"), and call ls with the end-of-options marker (ls -t --
"${PAPER_LIST_FILES[@]}") when selecting the most recent file; keep the existing
length check and wc -l logic but ensure all expansions are quoted to avoid
splitting issues.

Comment on lines 10 to 22
Usage Examples:
# Process a single paper
python lifecycle_cli.py process 2508.21038
python lifecycle.py process 2508.21038
# Check status of a paper
python lifecycle_cli.py status 2508.21038
python lifecycle.py status 2508.21038
# Process multiple papers
python lifecycle_cli.py batch papers.txt
python lifecycle.py batch papers.txt
# Force reprocessing
python lifecycle_cli.py process 2508.21038 --force
python lifecycle.py process 2508.21038 --force
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix CLI usage examples — current examples will fail due to relative imports

Because this module uses relative imports (from .arxiv_lifecycle_manager …), invoking it as a file (python lifecycle.py …) will raise “attempted relative import with no known parent package.” Update the examples to module execution.

-    python lifecycle.py process 2508.21038
+    python -m tools.arxiv.utils.lifecycle process 2508.21038
@@
-    python lifecycle.py status 2508.21038
+    python -m tools.arxiv.utils.lifecycle status 2508.21038
@@
-    python lifecycle.py batch papers.txt
+    python -m tools.arxiv.utils.lifecycle batch papers.txt
@@
-    python lifecycle.py process 2508.21038 --force
+    python -m tools.arxiv.utils.lifecycle process 2508.21038 --force
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Usage Examples:
# Process a single paper
python lifecycle_cli.py process 2508.21038
python lifecycle.py process 2508.21038
# Check status of a paper
python lifecycle_cli.py status 2508.21038
python lifecycle.py status 2508.21038
# Process multiple papers
python lifecycle_cli.py batch papers.txt
python lifecycle.py batch papers.txt
# Force reprocessing
python lifecycle_cli.py process 2508.21038 --force
python lifecycle.py process 2508.21038 --force
"""
"""
Usage Examples:
# Process a single paper
python -m tools.arxiv.utils.lifecycle process 2508.21038
# Check status of a paper
python -m tools.arxiv.utils.lifecycle status 2508.21038
# Process multiple papers
python -m tools.arxiv.utils.lifecycle batch papers.txt
# Force reprocessing
python -m tools.arxiv.utils.lifecycle process 2508.21038 --force
"""
🤖 Prompt for AI Agents
In tools/arxiv/utils/lifecycle.py around lines 10 to 22, the CLI usage examples
show running the file directly (python lifecycle.py ...) which fails due to
relative imports; update the examples to use module execution under the package
path (python -m tools.arxiv.utils.lifecycle ...) for each example (process,
status, batch, force) so the module is executed with a proper package context
and relative imports resolve.

Comment on lines +171 to +186
from rag_utils.academic_citation_toolkit import (
UniversalBibliographyExtractor,
DocumentProvider,
BibliographyEntry
)

class SSRNProvider(DocumentProvider):
def __init__(self, api_key: str):
self.api_key = api_key

def get_document_text(self, document_id: str) -> Optional[str]:
# Fetch from SSRN API
response = requests.get(f"https://ssrn.com/api/papers/{document_id}",
headers={"Authorization": f"Bearer {self.api_key}"})
return response.json().get("full_text") if response.ok else None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Example won’t run: missing requests import in SSRN provider snippet.

Add the import to prevent NameError.

 from rag_utils.academic_citation_toolkit import (
     UniversalBibliographyExtractor,
     DocumentProvider,
     BibliographyEntry
 )
+import requests
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from rag_utils.academic_citation_toolkit import (
UniversalBibliographyExtractor,
DocumentProvider,
BibliographyEntry
)
class SSRNProvider(DocumentProvider):
def __init__(self, api_key: str):
self.api_key = api_key
def get_document_text(self, document_id: str) -> Optional[str]:
# Fetch from SSRN API
response = requests.get(f"https://ssrn.com/api/papers/{document_id}",
headers={"Authorization": f"Bearer {self.api_key}"})
return response.json().get("full_text") if response.ok else None
from rag_utils.academic_citation_toolkit import (
UniversalBibliographyExtractor,
DocumentProvider,
BibliographyEntry
)
import requests
class SSRNProvider(DocumentProvider):
def __init__(self, api_key: str):
self.api_key = api_key
def get_document_text(self, document_id: str) -> Optional[str]:
# Fetch from SSRN API
response = requests.get(
f"https://ssrn.com/api/papers/{document_id}",
headers={"Authorization": f"Bearer {self.api_key}"}
)
return response.json().get("full_text") if response.ok else None
🤖 Prompt for AI Agents
In tools/rag_utils/academic_citation_toolkit.md around lines 171 to 186, the
SSRNProvider example references requests but does not import it; add an import
statement for requests at the top of the snippet (alongside the other imports)
so the example runs without raising NameError.

Comment on lines +37 to +38
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Don’t configure root logging in a library module.

Move basicConfig under main() to avoid affecting host apps importing this module.

-logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)

Add in main():

 def main():
     """Test the universal citation toolkit."""
+    logging.basicConfig(level=logging.INFO,
+                        format="%(asctime)s %(levelname)s %(name)s: %(message)s")

Also applies to: 581-585

🤖 Prompt for AI Agents
In tools/rag_utils/academic_citation_toolkit.py around lines 37-38 (and also
lines 581-585), the module currently calls logging.basicConfig at import time
which configures the root logger for any host application; move the basicConfig
call into a main() or if __name__ == "__main__" guard so that root logging is
only configured when the module is run as a script. Remove or replace top-level
basicConfig and ensure logger = logging.getLogger(__name__) remains, then add
logging.basicConfig(...) inside main() before any logging calls when executed
directly.

Comment on lines +415 to +418
# Extract DOI
doi_match = re.search(r'doi:?\s*([10]\.\d+/[^\s,]+)', entry_text, re.IGNORECASE)
doi = doi_match.group(1) if doi_match else None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

DOI regex is wrong ([10] matches “1” or “0”); fix to the standard DOI form.

Current pattern will miss valid DOIs and match invalid ones.

-            doi_match = re.search(r'doi:?\s*([10]\.\d+/[^\s,]+)', entry_text, re.IGNORECASE)
+            doi_match = re.search(r'\b(10\.\d{4,9}/[^\s"<>]+)\b', entry_text, re.IGNORECASE)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Extract DOI
doi_match = re.search(r'doi:?\s*([10]\.\d+/[^\s,]+)', entry_text, re.IGNORECASE)
doi = doi_match.group(1) if doi_match else None
# Extract DOI
doi_match = re.search(r'\b(10\.\d{4,9}/[^\s"<>]+)\b', entry_text, re.IGNORECASE)
doi = doi_match.group(1) if doi_match else None
🤖 Prompt for AI Agents
In tools/rag_utils/academic_citation_toolkit.py around lines 415 to 418, the DOI
extraction regex uses [10]\.\d+ which incorrectly matches only '1' or '0' before
the dot; replace it with a proper DOI pattern such as using 10\.\d{4,9}/ to
match standard DOIs and capture the suffix up to whitespace or punctuation (for
example: doi:?\s*(10\.\d{4,9}/[^\s,;]+) with re.IGNORECASE), so
doi_match.group(1) returns a valid DOI when present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants