This repo contains a pipeline that parses the submissions and reviews of all OpenReview venues into a unified format and annotates them with metadata.
🔍 Metadata includes:
- 🧪 Research Hypothesis (annotated via LLM)
- 🔗 References (from Semantic Scholar)
- 📊 Citation Counts for accepted papers (from Semantic Scholar)
Create a new conda environment and install the dependencies:
conda create -n openreview_parser python=3.11
conda activate openreview_parser
pip install -e .
python -c "import nltk; nltk.download('punkt_tab')"We parse PDF submissions using GROBID.
⚙️ From the ./openreview_parser/pipeline directory:
bash ./scripts/setup_grobid.shbash ./scripts/run_grobid.sh💡 Tip: GROBID requires Java (≥11). Check with java -version.
🐛 If you run into trouble, check issues in the s2orc-doc2json repo.
🪟 tmux is also needed:
sudo apt update
sudo apt install tmux📖 Overview of the unified data model: README
If you do not have an OpenReview account, run the pipeline with the guest client by setting the corresponding field in the config files in ./openreview_parser/pipeline/configs/.
🛠️ Pipeline steps:
- 📥 Retrieve venues from OpenReview
- 🧩 Map venue schemas to a unified data model
- 📄 Parse all submissions and reviews
- 🏷️ Annotate submissions with metadata:
- 🔗 References (Title & Abstract)
- 🧪 Research Hypothesis
- 📊 Citation Counts
From the ./openreview_parser/pipeline directory:
bash ./scripts/run_pipeline.sh💡 This will run the pipeline twice: once for API V1 venues and once for API V2.
pipeline.yaml.
-
🧪 Hypothesis Annotation: Set
OPENAI_API_KEYin your environment. -
📊 Citation Count: Recommended to use a Semantic Scholar API key to avoid rate limits.
-
🔗 References: Run from
./openreview_parser/scientific_databases:python s2_datasets.py
💾 Requires ~140GB disk space and a Semantic Scholar API key.
Make sure GROBID is running. Then:
API V1 venue:
python pipeline.py --config ./configs/pipeline_v1.yaml --venue "ICLR.cc/2022/Conference"API V2 venue:
python pipeline.py --config ./configs/pipeline_v2.yaml --venue "ICLR.cc/2024/Conference"📄 A list of API V1 venues: venue_strings_api_v1.json
You can find the resulting dataset on HuggingFace:
👉 scientific-quality-score-prediction
🗓️ Pipeline run timestamps:
- ✅ Latest: 1.1.2025
If you use this codebase, please cite:
@article{hopner2025automatic,
title={Automatic Evaluation Metrics for Artificially Generated Scientific Research},
author={H{"o}pner, Niklas and Eshuijs, Leon and Alivanistos, Dimitrios and Zamprogno, Giacomo and Tiddi, Ilaria},
journal={arXiv preprint arXiv:2503.05712},
year={2025}
}