Privacy-preserving citation verification for unpublished research manuscripts. The manuscript text never leaves your local machine.
Three-stage pipeline:
- Extract (local, no internet) -- Parses the PDF reference section using regex. Auto-detects citation style (APA, IEEE, Vancouver, Harvard, Chicago). Outputs structured JSON.
- Verify (online, metadata only) -- Checks each reference title/author against CrossRef, Semantic Scholar, and Google Scholar APIs. Only minimal metadata is sent. Computes confidence scores via fuzzy matching. Also fetches paper abstracts and summaries (when available) for correctness checking in Stage 3.
- Audit (local, no internet) -- Uses a local LLM (Ollama) to compare the manuscript body against verified references. Flags uncited references, missing citations, year mismatches, misquoted claims, and unsupported claims. Uses fetched abstracts/summaries to verify that the manuscript accurately represents the cited work.
pip install -e ".[dev]"
Requires Ollama for Stage 3 only:
ollama pull llama3.1
Run the full pipeline:
ref-verifier run paper.pdf -o output/ -m llama3.1
Or run stages independently:
ref-verifier extract paper.pdf -o refs.json
ref-verifier verify refs.json -o verified.json
ref-verifier audit paper.pdf verified.json -o report.json -m llama3.1
Options:
-s / --style-- Force citation style (apa,ieee,vancouver,harvard,chicago). Auto-detected if omitted.-m / --model-- Ollama model name (default:llama3.1). Only used byauditandrun.--google-scholar-- Enable Google Scholar fallback (slow, rate-limited).-v / --verbose-- Verbose logging.
Each stage produces a JSON file:
extracted_references.json-- Parsed references with authors, title, year, journal, volume, pages, DOI.verification_results.json-- Verification status (verified/ambiguous/not_found), confidence scores, canonical metadata, abstracts, and TLDR summaries.audit_report.json-- Citation issues list with severity and a human-readable summary.
Run tests (excluding slow live-API tests):
pytest -m "not slow"
Run all tests including live API verification:
pytest -m slow
The test suite includes 13 real research papers across all 5 citation styles (IEEE, Vancouver, APA, Harvard, Chicago) with single-column and two-column layouts. Each paper has a companion JSON with 3 injected fake citations for verifier testing.
- Support latex and .docx import
- Measure different local LLM performance
- Integrate the option for paid/premium scholar API
- Improve front-end interface
- Add JSON file cleanup
- Improve test suite: download real papers with real citations and add fake citations
