Skip to content

ywatanabe1989/crossref-local

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrossRef Local Database

Local hosting and analysis tools for CrossRef 2025 Public Data File (167M papers, 1.4TB).

IF Validation

Components

Directory Description
impact_factor/ Journal impact factor calculator
vendor/dois2sqlite/ JSON to SQLite converter (from CrossRef Labs)
vendor/labs-data-file-api/ REST API server (from CrossRef Labs)
data/ Database storage (gitignored)

Quick Start

# Calculate Impact Factor
cd impact_factor
python cli/calculate_if.py --journal "Nature" --year 2024

# Query API
curl "http://localhost:3333/api/search/?doi=10.1038/nature12373"

Setup Guide

1. Download CrossRef Data

# Download via torrent (168GB compressed)
aria2c --continue=true --max-connection-per-server=16 \
  "https://academictorrents.com/details/e0eda0104902d61c025e27e4846b66491d4c9f98"

2. Create Database

cd vendor/dois2sqlite
python3.11 -m venv .env && source .env/bin/activate
pip install -e .

# Create and load database
dois2sqlite create ./data/crossref.db
dois2sqlite load "./data/March 2025 Public Data File from Crossref" ./data/crossref.db \
  --n-jobs 8 --commit-size 100000
dois2sqlite index ./data/crossref.db

3. Run API Server

cd vendor/labs-data-file-api
python3 -m venv .env && source .env/bin/activate
pip install -r requirements.txt
ln -s ../../data/crossref.db crossref.db
python3 manage.py migrate
python main.py index-all-with-location --data-directory "../../data/March 2025 Public Data File from Crossref"
python3 manage.py runserver 0.0.0.0:3333
Impact Factor Calculator

Results: Strong rank correlation (Spearman r = 0.736) with JCR values across 33 journals.

Important Limitation: Some publishers (notably Elsevier journals like The Lancet, NEJM) don't deposit complete reference lists to CrossRef, resulting in low citation coverage (<10%) and unreliable IF calculations. Journals with >10% coverage show excellent agreement (ratio 0.96-1.46).

Coverage Journals Accuracy
>10% Nature, Science, Cell, most neuroscience Reliable (within 50% of JCR)
<10% The Lancet, NEJM, IEEE, eLife Unreliable (use with caution)

Run validation: ./examples/impact_factor/run_all_demos.sh (sample output)

Setup (One-Time)

Rebuild citations table for fast IF calculations:

cd impact_factor
screen -S citations-rebuild
python scripts/database/rebuild_citations_table.py \
  --db ../data/crossref.db --batch-size 8192
# Takes 12-48 hours, reduces IF calculation from 5+ min to < 1 sec

Usage

# Single journal
python cli/calculate_if.py --journal "Nature" --year 2024

# Using ISSN (faster)
python cli/calculate_if.py --issn "0028-0836" --year 2024

# Batch processing
echo -e "Nature\nScience\nCell" > journals.txt
python cli/calculate_if.py --journal-file journals.txt --year 2024 --output results.csv

# 5-year impact factor
python cli/calculate_if.py --journal "Nature" --year 2024 --window 5

See impact_factor/docs/ for detailed documentation.

API Endpoints

Search by DOI

curl "http://localhost:3333/api/search/?doi=10.1001/.387"

Search by Title

curl "http://localhost:3333/api/search/?title=deep%20learning&year=2020"

Search by Author

curl "http://localhost:3333/api/search/?authors=smith&year=2020"

Combined Search

curl "http://localhost:3333/api/search/?title=medicine&year=2020&authors=jones"
Project Structure
crossref_local/
├── README.md                 # This file
├── impact_factor/            # Impact factor calculator
│   ├── cli/                  # Command-line tools
│   ├── src/                  # Core library
│   ├── scripts/              # Database maintenance
│   ├── tests/                # Test suite
│   └── docs/                 # Documentation
├── vendor/                   # External tools (vendored)
│   ├── dois2sqlite/          # JSON to SQLite converter
│   └── labs-data-file-api/   # REST API server
├── data/                     # Database (gitignored)
│   └── crossref.db           # 1.4TB SQLite database
├── docs/                     # Root documentation
└── legacy/                   # Historical files (gitignored)
Data Sources

Vendored Dependencies

Original repositories (preserved locally in case of upstream changes):

License

For academic and research purposes. CrossRef data usage subject to CrossRef terms.


SciTeX
AGPL-3.0 · ywatanabe@scitex.ai

About

Local hosting and analysis tools for CrossRef 2025 Public Data File (167M papers, 1.4TB).

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •