Skip to content

Privacy-preserving citation verification for unpublished drafts. Extracts and normalizes references locally into JSON, verifies existence and canonical metadata via scholarly APIs/public sources, then audits the literature review (with air-gapped LLM) for citation accuracy. Manuscript never leaves local.

Notifications You must be signed in to change notification settings

PMQ9/local-llm-ref-verifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

local-llm-ref-verifier

Privacy-preserving citation verification for unpublished research manuscripts. The manuscript text never leaves your local machine.

How it works

Three-stage pipeline:

Alt Text

  1. Extract (local, no internet) -- Parses the PDF reference section using regex. Auto-detects citation style (APA, IEEE, Vancouver, Harvard, Chicago). Outputs structured JSON.
  2. Verify (online, metadata only) -- Checks each reference title/author against CrossRef, Semantic Scholar, and Google Scholar APIs. Only minimal metadata is sent. Computes confidence scores via fuzzy matching. Also fetches paper abstracts and summaries (when available) for correctness checking in Stage 3.
  3. Audit (local, no internet) -- Uses a local LLM (Ollama) to compare the manuscript body against verified references. Flags uncited references, missing citations, year mismatches, misquoted claims, and unsupported claims. Uses fetched abstracts/summaries to verify that the manuscript accurately represents the cited work.

Install

pip install -e ".[dev]"

Requires Ollama for Stage 3 only:

ollama pull llama3.1

Usage

Run the full pipeline:

ref-verifier run paper.pdf -o output/ -m llama3.1

Or run stages independently:

ref-verifier extract paper.pdf -o refs.json
ref-verifier verify refs.json -o verified.json
ref-verifier audit paper.pdf verified.json -o report.json -m llama3.1

Options:

  • -s / --style -- Force citation style (apa, ieee, vancouver, harvard, chicago). Auto-detected if omitted.
  • -m / --model -- Ollama model name (default: llama3.1). Only used by audit and run.
  • --google-scholar -- Enable Google Scholar fallback (slow, rate-limited).
  • -v / --verbose -- Verbose logging.

Output

Each stage produces a JSON file:

  • extracted_references.json -- Parsed references with authors, title, year, journal, volume, pages, DOI.
  • verification_results.json -- Verification status (verified/ambiguous/not_found), confidence scores, canonical metadata, abstracts, and TLDR summaries.
  • audit_report.json -- Citation issues list with severity and a human-readable summary.

Testing

Run tests (excluding slow live-API tests):

pytest -m "not slow"

Run all tests including live API verification:

pytest -m slow

The test suite includes 13 real research papers across all 5 citation styles (IEEE, Vancouver, APA, Harvard, Chicago) with single-column and two-column layouts. Each paper has a companion JSON with 3 injected fake citations for verifier testing.

To do list

  • Support latex and .docx import
  • Measure different local LLM performance
  • Integrate the option for paid/premium scholar API
  • Improve front-end interface
  • Add JSON file cleanup
  • Improve test suite: download real papers with real citations and add fake citations

About

Privacy-preserving citation verification for unpublished drafts. Extracts and normalizes references locally into JSON, verifies existence and canonical metadata via scholarly APIs/public sources, then audits the literature review (with air-gapped LLM) for citation accuracy. Manuscript never leaves local.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages