RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

📢 News

[2026-02-03] 🔥 Training recipes (external) are now available. For RuRL, refer to RuscaRL (sync) or verl-rubric (async). For RuFT/SFT, refer to LlamaFactory. Our RubricHub rule-based scorer/grader integration for RuRL (incl. instruction-following rules) is being cleaned up and will be released soon.
[2026-02-03] 🔥 Data synthesis code released. See data_synthesis_final/README.md.
[2026-01-17] RubricHub dataset is released, see https://huggingface.co/datasets/sojuL/RubricHub_v1.
[2026-01-12] RubricHub paper is released, see https://arxiv.org/abs/2601.08430.

📖 Introduction

Reinforcement Learning with Verifiable Rewards (RLVR) has shown great success in math and coding. However, open-ended generation remains challenging due to the lack of ground truth.

We introduce RubricHub, a large-scale (~110k) and multi-domain rubric dataset constructed via an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces highly discriminative criteria capable of capturing subtle nuances in model responses.

Based on RubricHub, we propose a two-stage post-training pipeline:

RuFT (Rubric-based Rejection Sampling Fine-Tuning)
RuRL (Rubric-based Reinforcement Learning)

Experimental results show that our post-trained Qwen3-14B achieves SOTA results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5.

🚀 Methodology

Automated Coarse-to-Fine Rubric Generation

Existing rubrics often suffer from scalability bottlenecks and low discriminability. Our framework addresses this through three stages:

Principle-Guided & Response-Grounded Generation: Synthesizing criteria anchored to specific response contexts and guided by meta-principles to prevent generic or hallucinatory criteria.
Multi-Model Aggregation: Aggregating perspectives from heterogeneous frontier models (e.g., GPT-5.1, Gemini 3 Pro) to eliminate single-source bias.
Difficulty Evolution: Evolving criteria to capture discriminative nuances between "excellent" and "exceptional" responses, preventing score saturation.

📊 RubricHub Dataset

RubricHub contains approximately 110k high-quality query-rubric pairs across five major domains:

🏥 Medical: 27.1%
🔬 Science: 27.1%
📝 Instruction Following:
✍️ Writing: 15.9%
💬 Chat: 9.0%

The dataset features high-density supervision, with complex domains like Writing and Medical averaging over 30 fine-grained criteria per query.

📈 Experiments

We validated RubricHub using Qwen3 base models. The results demonstrate significant improvements across all domains.

Key Result: On HealthBench, our Qwen3-14B (post-trained with RuFT → RuRL) achieves a score of 69.3, outperforming GPT-5 (67.2).

🛠️ Usage

Data Synthesis (Coarse-to-Fine Rubric Generation)

(Recommended) Create a clean env:

conda create -n rubrichub python=3.10 -y
conda activate rubrichub

Install deps for the data synthesis pipeline:

pip install -U openai tqdm pyarrow

Prepare an input JSONL, and set QUESTION_COLUMN to the field name that contains your prompt text (it can be question, prompt, instruction, etc.).
Edit run_data_synthesis.sh (top “1) Fill here”):

fill input/output paths and QUESTION_COLUMN
for each model slot (REFERENCE_*, RESPONSE_*, RUBRIC_*, MERGE_*, AUGMENT_*), fill its *_BASE_URL, *_API_KEY, and *_MODEL
- if your OpenAI-compatible server ignores API keys, set *_API_KEY to any non-empty string (e.g., "dummy") since the OpenAI SDK requires it

Run:

./run_data_synthesis.sh

Outputs will be written to $OUTPUT_DIR/:

final.parquet (main artifact)
final.jsonl (same content, easier to inspect)
step0_reference.jsonl ~ step4_augmented.jsonl (intermediates for resume/debug)

For pipeline architecture and implementation details, see data_synthesis_final/README.md.

Training (RuFT & RuRL)

You can reproduce the RuFT/RuRL pipeline using existing open-source trainers:

RuFT (Rubric-based Rejection Sampling Fine-Tuning): run SFT with LlamaFactory.
RuRL (Rubric-based Reinforcement Learning): run RL with RuscaRL (sync) or verl-rubric (async).

🖊️ Citation

If you find RubricHub useful for your research, please cite our paper:

@article{li2026rubrichub,
  title={RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation},
  author={Li, Sunzhu and Zhao, Jiale and Wei, Miteto and Ren, Huimin and Zhou, Yang and Yang, Jingwen and Liu, Shunyu and Zhang, Kaike and Chen, Wei},
  journal={arXiv preprint arXiv:2601.08430},
  year={2026}
}

📄 License

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data_synthesis_final		data_synthesis_final
image		image
.gitignore		.gitignore
README.md		README.md
rule_fn.py		rule_fn.py
run_data_synthesis.sh		run_data_synthesis.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

📢 News

📖 Introduction

🚀 Methodology

Automated Coarse-to-Fine Rubric Generation

📊 RubricHub Dataset

📈 Experiments

🛠️ Usage

Data Synthesis (Coarse-to-Fine Rubric Generation)

Training (RuFT & RuRL)

🖊️ Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

📢 News

📖 Introduction

🚀 Methodology

Automated Coarse-to-Fine Rubric Generation

📊 RubricHub Dataset

📈 Experiments

🛠️ Usage

Data Synthesis (Coarse-to-Fine Rubric Generation)

Training (RuFT & RuRL)

🖊️ Citation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages