Updated: Feb 23, 2022
- Create the
sci-sci
conda environment fromenvironment.yml
. - Download the OpenAlex snapshots from this
link to a directory of your choosing (say,
basedir
). - Open
preprocessing/flatten_openalex_files.py
and update theBASEDIR
variable to the above directory. - Uncomment and run
flatten_<entity>
functions to generate the flattened compressed CSV files.
- The
flatten_works()
function generates CSV and Parquet files at the same time.
Warnings:
- flattening authors and works take anywhere between 15 and 30 hours. The code will cache the files, so you
should consider running it in batches by setting the
files_to_process
variable.
- Filtering CSVs based on concepts, publication years, and venues