To reproduce the figures and analysis in this paper:
- Run
query_semanticscholar.py
on the domain {CS, Chemistry, Economics, Medicine, Physics}, or your own list of scientists - Run
filter_by_year
using the birth/death dates file, output toabstracts_filtered_year
- Clean the abstracts with
get_vectors.py
, output toabstracts-cleaned
directory - Encode the abstracts with SBERT via
sbert.py
, output tosbert-abstracts
- Order/convert dates to timestamps via
emergence_order.py
, output toabstracts-ordered
directory - Run models on the
abstracts-ordered
directory
-
Hyperparameters must be tuned for models on each scientist: run
opt_hyperparam_exemplar.py
for each model/field combination, output individual param values to `/individual-s-vals/ -
Run comparison between models:
src/models/predict.py --type <nobel/turing> --field <field> --measure ll -i
-
Run shuffle tests between models:
src/models/predict.py --type <nobel/turing> --field <field> --measure ll -i -s --sy
-
Run authorship analysis:
src/models/predict_k_author_papers.py --type <nobel/turing> --field <field> -k <max authors, or -1 for first author>
Most figures are generated through functions in rain_plots.py
, based on simulation outputs generated through the "Running models" section.
The stacked authorship charts can be generated through stacked_bar_authorship.py
.
tSNE visualizations can be generated through make_tsne_figure.py
.