Skip to content

A collection of recent open-source math datasets for training and evaluating Math LLMs

Notifications You must be signed in to change notification settings

amao0o0/awesome-AI-Math-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 

Repository files navigation

AI Math Datasets

This repo contains recent open-source math datasets (mainly English) for training and evaluating Math Large Language Models (LLMs).

Note

This repo is currently under development and updated regularly.

Table of Contents


Pre-training

πŸ“ Text Only

Dataset Descriptions References
Open-Web-Math An open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
Open-Web-Math-Pro Refined from open-web-math using the ProX refining framework. It contains about 5B high-quality math-related tokens, ready for pre-training. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
AMPS Auxiliary Mathematics Problems and Solutions. A collection of mathematical problems and step-by-step solutions, comprising over 100,000 problems from Khan Academy and approximately 5 million problems generated using Mathematica scripts. πŸ™ Repo
NaturalProofs A dataset designed to study mathematical reasoning in natural language, comprising approximately 32,000 theorem statements and proofs, 14,000 definitions, and 2,000 additional pages sourced from diverse mathematical domains πŸ“„ Paper
πŸ™ Repo
MathPile A math-centric corpus comprising about 9.5 billion tokens. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
AlgebraicStack A dataset of 11B tokens of code specifically related to mathematics. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
MathCode-Pile Containing 19.2B tokens, with math-related data covering web pages, textbooks, model-synthesized text, and math-related code. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
FineMath Consisting of 34B tokens (FineMath-3+) and 54B tokens (FineMath-3+ with InfiMM-WebMath-3+) of mathematical educational content filtered from CommonCrawl. πŸ“„ Paper
πŸ€— Dataset
Proof-Pile-2 A 55 billion token dataset of mathematical and scientific documents from arxiv, open-web-math and algebraic-stack. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
AutoMathText A dataset encompassing around 200 GB of mathematical texts. It's a compilation sourced from a diverse range of platforms including various websites, arXiv, and GitHub (OpenWebMath, RedPajama, Algebraic Stack). πŸ“„Paper
πŸ™Repo
πŸ€—Dataset
MegaMath An open math pretraining dataset curated from diverse, math-focused sources, with over 300B tokens. πŸ“„Paper
πŸ™Repo
πŸ€—Dataset

πŸ–ΌοΈ Vision-Text Modality

Dataset Descriptions References
InfiMM-WebMath-40B A dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. πŸ“„ Paper
πŸ€— Dataset

Supervised Fine-Tuning

πŸ“ Text Only

Dataset Descriptions References
SVAMP A collection of 1,000 elementary-level math word problems. πŸ“„ Paper
πŸ™ Repo
GSM8K A dataset consists of 8.5K high-quality grade school math word problems. Each problem takes between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ βˆ’ Γ— Γ·) to reach the final answer. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
MathQA A dataset of 37k English multiple-choice math word problems covering multiple math domain categories by modeling operation programs corresponding to word problems in the AQuA dataset πŸ”— Project
MATH A challenging dataset that extends beyond the high school level and covers diverse topics, including algebra, precalculus, and number theory. Each problem in MATH has a full step-by-step solution. πŸ”— Project
NuminaMath A comprehensive collection of 860,000 pairs ranging from high-school-level to advanced-competition-level. The dataset has both CoT and PoT rationales (NuminaMath-CoT and -TIR (tool integrated reasoning)) πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
MetaMath A dataset with 395K samples created by bootstrapping questions from MATH and GSM8K. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
πŸ€— Dataset
MathInstruct An instruction tuning dataset that combines data from 13 mathematical rationale datasets, uniquely focusing on the hybrid use of chain-of-thought (CoT) and program-of-thought (PoT) rationales. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
πŸ€— Dataset
CoinMath A dataset designed to enhance mathematical reasoning in large language models by incorporating diverse coding styles into code-based rationales. It includes math questions annotated with code-based solutions that feature concise comments, descriptive naming conventions, and hardcoded solutions πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
OpenMathInstruct-2 A math instruction tuning dataset with 14M problem-solution pairs generated using the Llama3.1-405B-Instruct model. πŸ“„ Paper
πŸ€— Dataset
CAMEL Math Containing 50K problem-solution pairs obtained using GPT-4. The dataset problem-solutions pairs were generated from 25 math topics, and 25 subtopics for each topic. πŸ“„ Paper
πŸ€— Dataset

πŸ–ΌοΈ Vision-Text Modality

Dataset Descriptions References
GeoQA Containing 4,998 Chinese geometric multiple-choice questions with rich domain-specific program annotations. πŸ“„ Paper
πŸ™ Repo
UniGeo Containing 4,998 calculation problems and 9,543 proving problems. πŸ“„ Paper
πŸ™ Repo
Geo170K A synthesize dataset witch contains around 60,000 geometric image caption pairs and more than 110,000 question answer pairs. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
MAVIS Containing two datasets: 1. MAVIS-Caption: 588K high-quality caption-diagram pairs, spanning geometry and function, 2. MAVIS-Instruct: 834K instruction-tuning data with CoT rationales in a text-lite version. πŸ“„ Paper
πŸ™ Repo
Geometry3K Consisting of 3,002 geometry problems with dense annotation in formal language. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
MathV360K Consisting 40K images from 24 datasets and 360K question-answer pairs. πŸ”— Project
πŸ€— Dataset
MultiMath300K A multimodal, multilingual, multi-level, and multistep mathematical reasoning dataset that encompasses a wide range of K-12 level mathematical problem. πŸ”— Project

Reinforcement Learning

​While many datasets listed in Supervised Fine-Tuning can be adapted for reinforcement learning, we specifically highlight datasets explicitly designed for RL as indicated in their respective references.

πŸ“ Text Only

Dataset Descriptions References
PRM800K A process supervision dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
Big-Math A dataset of over 250,000 high-quality math questions with verifiable answers, purposefully made for reinforcement learning (RL). Extracted questions satisfy three desiderata: (1) problems with uniquely verifiable solutions, (2) problems that are open-ended, and (3) problems with a closed-form solution. πŸ™ Repo
πŸ€— Dataset
Math-Shepherd Problems and step-by-step solutions with automatic labels πŸ“„ Paper
πŸ”— Project
πŸ€— Dataset
OpenR1-Math-220k Consisting of 220k math problems with two to four reasoning traces generated by DeepSeek R1 for problems from NuminaMath 1.5. πŸ€— Dataset

Benchmark

πŸ“ Text Only

Dataset Descriptions References
Lila A mathematical reasoning benchmark consisting of over 140K natural language questions from 23 diverse tasks. πŸ”— Project
πŸ€— Dataset
MathBench A benchmark that tests large language models on math, covering five-level difficulty mechanisms. It evaluates both theory and problem-solving skills in English and Chinese. πŸ“„ Paper
πŸ™ Repo
MathOdyssey A collection of 387 mathematical problems for evaluating the general mathematical capacities of LLMs. Featuring a spectrum of questions from Olympiad-level competitions, advanced high school curricula, and university-level mathematics. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
Omni-MATH A challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
πŸ€— Dataset
HARP A math reasoning dataset consisting of 4,780 short answer questions from US national math competitions. πŸ“„ Paper
πŸ™ Repo

πŸ–ΌοΈ Vision-Text Modality

Dataset Descriptions References
MathVerse A collection of 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. πŸ”— Project
πŸ€— Dataset
MathVista A benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets πŸ”— Project
πŸ€— Dataset
MATH-Vision A collection of 3,040 mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty. πŸ”— Project
πŸ€— Dataset
We-Math A collection of 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
πŸ€— Dataset
OlympiadBench A Olympiad-level bilingual multimodal scientific benchmark containing math and physics problems sourced from the International Olympiads, the Chinese Olympiad, and the Chinese College Entrance Exam (GaoKao) πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset

Related Repo

https://github.com/tongyx361/Awesome-LLM4Math

https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md

About

A collection of recent open-source math datasets for training and evaluating Math LLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published