RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation
- [2026-02-03] 🔥 Training recipes (external) are now available. For RuRL, refer to RuscaRL (sync) or verl-rubric (async). For RuFT/SFT, refer to LlamaFactory. Our RubricHub rule-based scorer/grader integration for RuRL (incl. instruction-following rules) is being cleaned up and will be released soon.
- [2026-02-03] 🔥 Data synthesis code released. See
data_synthesis_final/README.md. - [2026-01-17] RubricHub dataset is released, see https://huggingface.co/datasets/sojuL/RubricHub_v1.
- [2026-01-12] RubricHub paper is released, see https://arxiv.org/abs/2601.08430.
Reinforcement Learning with Verifiable Rewards (RLVR) has shown great success in math and coding. However, open-ended generation remains challenging due to the lack of ground truth.
We introduce RubricHub, a large-scale (~110k) and multi-domain rubric dataset constructed via an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces highly discriminative criteria capable of capturing subtle nuances in model responses.
Based on RubricHub, we propose a two-stage post-training pipeline:
- RuFT (Rubric-based Rejection Sampling Fine-Tuning)
- RuRL (Rubric-based Reinforcement Learning)
Experimental results show that our post-trained Qwen3-14B achieves SOTA results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5.
Existing rubrics often suffer from scalability bottlenecks and low discriminability. Our framework addresses this through three stages:
- Principle-Guided & Response-Grounded Generation: Synthesizing criteria anchored to specific response contexts and guided by meta-principles to prevent generic or hallucinatory criteria.
- Multi-Model Aggregation: Aggregating perspectives from heterogeneous frontier models (e.g., GPT-5.1, Gemini 3 Pro) to eliminate single-source bias.
- Difficulty Evolution: Evolving criteria to capture discriminative nuances between "excellent" and "exceptional" responses, preventing score saturation.
RubricHub contains approximately 110k high-quality query-rubric pairs across five major domains:
- 🏥 Medical: 27.1%
- 🔬 Science: 27.1%
- 📝 Instruction Following:
- ✍️ Writing: 15.9%
- 💬 Chat: 9.0%
The dataset features high-density supervision, with complex domains like Writing and Medical averaging over 30 fine-grained criteria per query.
We validated RubricHub using Qwen3 base models. The results demonstrate significant improvements across all domains.
Key Result: On HealthBench, our Qwen3-14B (post-trained with RuFT → RuRL) achieves a score of 69.3, outperforming GPT-5 (67.2).
- (Recommended) Create a clean env:
conda create -n rubrichub python=3.10 -y
conda activate rubrichub- Install deps for the data synthesis pipeline:
pip install -U openai tqdm pyarrow-
Prepare an input JSONL, and set
QUESTION_COLUMNto the field name that contains your prompt text (it can bequestion,prompt,instruction, etc.). -
Edit
run_data_synthesis.sh(top “1) Fill here”):
- fill input/output paths and
QUESTION_COLUMN - for each model slot (
REFERENCE_*,RESPONSE_*,RUBRIC_*,MERGE_*,AUGMENT_*), fill its*_BASE_URL,*_API_KEY, and*_MODEL- if your OpenAI-compatible server ignores API keys, set
*_API_KEYto any non-empty string (e.g.,"dummy") since the OpenAI SDK requires it
- if your OpenAI-compatible server ignores API keys, set
- Run:
./run_data_synthesis.shOutputs will be written to $OUTPUT_DIR/:
final.parquet(main artifact)final.jsonl(same content, easier to inspect)step0_reference.jsonl~step4_augmented.jsonl(intermediates for resume/debug)
For pipeline architecture and implementation details, see data_synthesis_final/README.md.
You can reproduce the RuFT/RuRL pipeline using existing open-source trainers:
- RuFT (Rubric-based Rejection Sampling Fine-Tuning): run SFT with LlamaFactory.
- RuRL (Rubric-based Reinforcement Learning): run RL with RuscaRL (sync) or verl-rubric (async).
If you find RubricHub useful for your research, please cite our paper:
@article{li2026rubrichub,
title={RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation},
author={Li, Sunzhu and Zhao, Jiale and Wei, Miteto and Ren, Huimin and Zhou, Yang and Yang, Jingwen and Liu, Shunyu and Zhang, Kaike and Chen, Wei},
journal={arXiv preprint arXiv:2601.08430},
year={2026}
}This project is licensed under the Apache 2.0 License.

