Skip to content
This repository has been archived by the owner on Mar 23, 2021. It is now read-only.

elifesciences/datacapsule-crossref

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataCapsule Crossref

Retrieve and extract citations from Crossref data.

The links to the latest dumps can be found in the notebook.

If the notebook doesn't render within GitHub, you could try the following URL using nbviewer.

Pre-requsites

  • Python 3
  • pipeview (pv) to show progress in some shell scripts (ubuntu: sudo apt-get install pv)

Setup

pip install -r requirements.txt

Data Retrieval

Data is retrieved via the Crossref's Works API (doc).

Starting with the cursor *. The data/crossref-works.zip.meta file contains the next cursor to use, should the download be interrupted for any reasons (it is likely it will). The download currently takes about 90 hours at the minimum and can't be run in parallel due to the way the cursor works.

To start or resume the download run:

./download_crossref_works.sh

The file data/crossref-works.zip as well as data/crossref-works.zip.meta will be created and updated. crossref-works.zip will contain files with the raw response.

Extract Citations

Run:

./extract_citations_from_crossref_works.sh [--multi-processing]

Note the --multi-processing flag is optional and may make the processing faster.

That will create data/crossref-works-citations.tsv.gz a compressed tsv file with the following columns:

  • citing_doi
  • cited_doi

Create Summary Stats

Run:

./extract_summaries_from_crossref_works.sh [--multi-processing] [--debug]

Note the --multi-processing flag is optional and may make the processing faster. The --debug flag is currently required as it will add another debug column containing more details about references.

That will create data/crossref-works-summaries.tsv.gz a compressed tsv file with a number of key features of works (used by the following step).

Run:

./citations_stats.sh

That will create the following files with summary stats:

  • data/crossref-works-summaries-stat.tsv
  • data/crossref-works-summaries-by-type-and-publisher-stat.tsv.gz
  • data/crossref-works-reference-stat.tsv.gz
  • data/crossref-works-citations*.tsv

About

Retrieve and extract citations from Crossref data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •