TemplateMath: Template-based Data Generation (TDG)

This is the official repository for the paper "Training and Evaluating Language Models with Template-based Data Generation", published at the ICLR 2025 DATA-FM Workshop.

Our work introduces Template-based Data Generation (TDG), a scalable paradigm to address the critical data bottleneck in training LLMs for complex reasoning tasks. We use TDG to create TemplateGSM, a massive dataset designed to unlock the next level of mathematical reasoning in AI.

🚀 What is TemplateGSM?

TemplateGSM is a foundational dataset containing over 7.4 million grade school math problems. Each problem is synthetically generated and comes with both a natural language explanation and a programmatically verifiable code solution.

Unlike existing resources, TemplateGSM is built on a framework of programmatic verification, ensuring every single problem-solution pair is correct. This provides an unprecedented level of quality at a massive scale, making it ideal for both supervised fine-tuning (SFT) and emerging alignment techniques like Reinforcement Learning with Verifiable Rewards (RLVR).

At >500x the size of the widely-used MATH benchmark, TemplateGSM provides the community with a powerful new resource to train and evaluate more capable and reliable models.

✅ Key Features

Massive Scale: Over 7.4 million problem-solution pairs, with the potential to generate a virtually infinite amount more using our open-source code.
Programmatic Verification: Every solution is accompanied by executable Python code that has been run to verify its correctness. This guarantees data quality and eliminates the noise found in web-scraped datasets.
Rich Diversity: Generated from 7,473 unique meta-templates (authored by GPT-4), the dataset covers a wide range of mathematical structures and linguistic styles.
Enables Verifiable Rewards: The dataset's structure provides a direct, binary reward signal (correct/incorrect) for training models with reinforcement learning, a concept we term Reinforcement Learning with Verifiable Rewards (RLVR).

💡 How to Use

You can easily access and use TemplateGSM directly from the Hugging Face Hub.

from datasets import load_dataset

# Load the full dataset (7.47M problems)
dataset = load_dataset("math-ai/TemplateGSM", "templategsm-7473-1k")

# Or, load a smaller configuration
# dataset = load_dataset("math-ai/TemplateGSM", "templategsm-1000-1k") # 1M problems

print(dataset['train'][0])

Dataset Structure

problem: string - The mathematical word problem.
solution_code: string - A commented Python solution that programmatically solves the problem.
result: string - The final numerical answer.
solution_wocode: string - A step-by-step solution explained in natural language.
template_id: int - The ID of the meta-template used for generation.
problem_id: int - A unique index for the problem within its template.
source: string - The original data source used to inspire the template.

The dataset is organized into several configurations based on the number of templates used:

templategsm-1000-1k: 1,000,000 problems from the first 1,000 templates.
templategsm-2000-1k: 2,000,000 problems from the first 2,000 templates.
templategsm-4000-1k: 4,000,000 problems from the first 4,000 templates.
templategsm-7473-1k: 7,473,000 problems from all 7,473 templates (the full dataset).

🙏 Citation

If you use the TemplateGSM dataset or the Template-based Data Generation (TDG) paradigm in your research, please cite our paper. Your citation allows us to continue building and sharing impactful resources with the community!

Citing the Dataset or Methodology:

@misc{zhang2024training,
  title={Training and Evaluating Language Models with Template-based Data Generation},
  author={Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew Chi-Chih},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
templates		templates
utils		utils
README.md		README.md
TemplateMath_Part_I.pdf		TemplateMath_Part_I.pdf
generate-problems.py		generate-problems.py
reformat.py		reformat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TemplateMath: Template-based Data Generation (TDG)

🚀 What is TemplateGSM?

✅ Key Features

💡 How to Use

Dataset Structure

🙏 Citation

Citing the Dataset or Methodology:

About

Uh oh!

Releases

Packages

Languages

iiis-ai/TemplateMath

Folders and files

Latest commit

History

Repository files navigation

TemplateMath: Template-based Data Generation (TDG)

🚀 What is TemplateGSM?

✅ Key Features

💡 How to Use

Dataset Structure

🙏 Citation

Citing the Dataset or Methodology:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages