opencitations_Cleanup

Script for making http://opencitations.net csv data dump (needs to be downloaded and extracted into a folder callled Data seperately) be in memory for ram with 64 GBs (Could work for 32 and 16 GBs as well athough authorsc should be dropped). For future releases just add the names of the files that have been added to names array for extraction.

To run download dataset (roughly 120Gbs of space), extract into Data then run

PYTHONHASHSEED=0 python3 DownsizeCitationData.py

This produces AllCitations.parq and HashedDoiMap.parq which contain the hashed pairs of refrences and hash to doi map respectively, total size 18 Gbs. For small Rams run

PYTHONHASHSEED=0 python3 DownsizeCitationData.py n

where the n at the end is as much as you think your ram can manage at the time (ie. 32/n).

REMEMEBER TO SET PYTHONHASHSEED=0 WHEN RUN OTHERWISE HASH SEED IS RANDOM and the result will be garbage. In jupyter or pycharm this can easily be set for the whole interpreter.

Hash map should be fine up to hundreds of billions of papers, currently has 60 million papers, thus false positives are guaranteed to be non-existent

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
DownsizeCitationData.py		DownsizeCitationData.py
README.md		README.md
reference_comparison_utils.py		reference_comparison_utils.py
reference_extractor.py		reference_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

opencitations_Cleanup

About

Releases

Packages

Languages

JakubJDolezal/opencitations_Cleanup

Folders and files

Latest commit

History

Repository files navigation

opencitations_Cleanup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages