📝 OpenReview Parser

This repo contains a pipeline that parses the submissions and reviews of all OpenReview venues into a unified format and annotates them with metadata.

🔍 Metadata includes:

🧪 Research Hypothesis (annotated via LLM)
🔗 References (from Semantic Scholar)
📊 Citation Counts for accepted papers (from Semantic Scholar)

📦 Dependencies

Create a new conda environment and install the dependencies:

conda create -n openreview_parser python=3.11
conda activate openreview_parser
pip install -e . 
python -c "import nltk; nltk.download('punkt_tab')"

📄 GROBID Setup

We parse PDF submissions using GROBID.

⚙️ From the ./openreview_parser/pipeline directory:

bash ./scripts/setup_grobid.sh

▶️ To test the setup:

bash ./scripts/run_grobid.sh

💡 Tip: GROBID requires Java (≥11). Check with java -version.

🐛 If you run into trouble, check issues in the s2orc-doc2json repo.

🪟 tmux is also needed:

sudo apt update
sudo apt install tmux

📚 Data Model

📖 Overview of the unified data model: README

🚀 Run Pipeline

If you do not have an OpenReview account, run the pipeline with the guest client by setting the corresponding field in the config files in ./openreview_parser/pipeline/configs/.

🛠️ Pipeline steps:

📥 Retrieve venues from OpenReview
🧩 Map venue schemas to a unified data model
📄 Parse all submissions and reviews
🏷️ Annotate submissions with metadata:
- 🔗 References (Title & Abstract)
- 🧪 Research Hypothesis
- 📊 Citation Counts

🏃 Run Script

From the ./openreview_parser/pipeline directory:

bash ./scripts/run_pipeline.sh

💡 This will run the pipeline twice: once for API V1 venues and once for API V2.

⚠️ Metadata annotation is optional and controlled in pipeline.yaml.

🧠 Metadata Annotation

🧪 Hypothesis Annotation: Set OPENAI_API_KEY in your environment.
📊 Citation Count: Recommended to use a Semantic Scholar API key to avoid rate limits.
🔗 References: Run from ./openreview_parser/scientific_databases:
```
python s2_datasets.py
```
💾 Requires ~140GB disk space and a Semantic Scholar API key.

🧪 Run for Specific Venues

Make sure GROBID is running. Then:

API V1 venue:

python pipeline.py --config ./configs/pipeline_v1.yaml --venue "ICLR.cc/2022/Conference"

API V2 venue:

python pipeline.py --config ./configs/pipeline_v2.yaml --venue "ICLR.cc/2024/Conference"

📄 A list of API V1 venues: venue_strings_api_v1.json

📊 Dataset

You can find the resulting dataset on HuggingFace:
👉 scientific-quality-score-prediction

🕒 Updates

🗓️ Pipeline run timestamps:

✅ Latest: 1.1.2025

📚 Citation

If you use this codebase, please cite:

@article{hopner2025automatic,
  title={Automatic Evaluation Metrics for Artificially Generated Scientific Research},
  author={H{"o}pner, Niklas and Eshuijs, Leon and Alivanistos, Dimitrios and Zamprogno, Giacomo and Tiddi, Ilaria},
  journal={arXiv preprint arXiv:2503.05712},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
openreview_parser.egg-info		openreview_parser.egg-info
openreview_parser		openreview_parser
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📝 OpenReview Parser

📦 Dependencies

📄 GROBID Setup

📚 Data Model

🚀 Run Pipeline

🏃 Run Script

🧠 Metadata Annotation

🧪 Run for Specific Venues

📊 Dataset

🕒 Updates

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

NikeHop/OpenReviewParser

Folders and files

Latest commit

History

Repository files navigation

📝 OpenReview Parser

📦 Dependencies

📄 GROBID Setup

📚 Data Model

🚀 Run Pipeline

🏃 Run Script

🧠 Metadata Annotation

🧪 Run for Specific Venues

📊 Dataset

🕒 Updates

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages