This repo contains recent open-source math datasets (mainly English) for training and evaluating Math Large Language Models (LLMs).
Note
This repo is currently under development and updated regularly.
Dataset | Descriptions | References |
---|---|---|
Open-Web-Math | An open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. | π Paper π Repo π€ Dataset |
Open-Web-Math-Pro | Refined from open-web-math using the ProX refining framework. It contains about 5B high-quality math-related tokens, ready for pre-training. | π Paper π Repo π€ Dataset |
AMPS | Auxiliary Mathematics Problems and Solutions. A collection of mathematical problems and step-by-step solutions, comprising over 100,000 problems from Khan Academy and approximately 5 million problems generated using Mathematica scripts. | π Repo |
NaturalProofs | A dataset designed to study mathematical reasoning in natural language, comprising approximately 32,000 theorem statements and proofs, 14,000 definitions, and 2,000 additional pages sourced from diverse mathematical domains | π Paper π Repo |
MathPile | A math-centric corpus comprising about 9.5 billion tokens. | π Paper π Repo π€ Dataset |
AlgebraicStack | A dataset of 11B tokens of code specifically related to mathematics. | π Paper π Repo π€ Dataset |
MathCode-Pile | Containing 19.2B tokens, with math-related data covering web pages, textbooks, model-synthesized text, and math-related code. | π Paper π Repo π€ Dataset |
FineMath | Consisting of 34B tokens (FineMath-3+) and 54B tokens (FineMath-3+ with InfiMM-WebMath-3+) of mathematical educational content filtered from CommonCrawl. | π Paper π€ Dataset |
Proof-Pile-2 | A 55 billion token dataset of mathematical and scientific documents from arxiv, open-web-math and algebraic-stack. | π Paper π Repo π€ Dataset |
AutoMathText | A dataset encompassing around 200 GB of mathematical texts. It's a compilation sourced from a diverse range of platforms including various websites, arXiv, and GitHub (OpenWebMath, RedPajama, Algebraic Stack). | πPaper πRepo π€Dataset |
MegaMath | An open math pretraining dataset curated from diverse, math-focused sources, with over 300B tokens. | πPaper πRepo π€Dataset |
Dataset | Descriptions | References |
---|---|---|
InfiMM-WebMath-40B | A dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. | π Paper π€ Dataset |
Dataset | Descriptions | References |
---|---|---|
SVAMP | A collection of 1,000 elementary-level math word problems. | π Paper π Repo |
GSM8K | A dataset consists of 8.5K high-quality grade school math word problems. Each problem takes between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ β Γ Γ·) to reach the final answer. | π Paper π Project π Repo |
MathQA | A dataset of 37k English multiple-choice math word problems covering multiple math domain categories by modeling operation programs corresponding to word problems in the AQuA dataset | π Project |
MATH | A challenging dataset that extends beyond the high school level and covers diverse topics, including algebra, precalculus, and number theory. Each problem in MATH has a full step-by-step solution. | π Project |
NuminaMath | A comprehensive collection of 860,000 pairs ranging from high-school-level to advanced-competition-level. The dataset has both CoT and PoT rationales (NuminaMath-CoT and -TIR (tool integrated reasoning)) | π Paper π Repo π€ Dataset |
MetaMath | A dataset with 395K samples created by bootstrapping questions from MATH and GSM8K. | π Paper π Project π Repo π€ Dataset |
MathInstruct | An instruction tuning dataset that combines data from 13 mathematical rationale datasets, uniquely focusing on the hybrid use of chain-of-thought (CoT) and program-of-thought (PoT) rationales. | π Paper π Project π Repo π€ Dataset |
CoinMath | A dataset designed to enhance mathematical reasoning in large language models by incorporating diverse coding styles into code-based rationales. It includes math questions annotated with code-based solutions that feature concise comments, descriptive naming conventions, and hardcoded solutions | π Paper π Repo π€ Dataset |
OpenMathInstruct-2 | A math instruction tuning dataset with 14M problem-solution pairs generated using the Llama3.1-405B-Instruct model. | π Paper π€ Dataset |
CAMEL Math | Containing 50K problem-solution pairs obtained using GPT-4. The dataset problem-solutions pairs were generated from 25 math topics, and 25 subtopics for each topic. | π Paper π€ Dataset |
Dataset | Descriptions | References |
---|---|---|
GeoQA | Containing 4,998 Chinese geometric multiple-choice questions with rich domain-specific program annotations. | π Paper π Repo |
UniGeo | Containing 4,998 calculation problems and 9,543 proving problems. | π Paper π Repo |
Geo170K | A synthesize dataset witch contains around 60,000 geometric image caption pairs and more than 110,000 question answer pairs. | π Paper π Repo π€ Dataset |
MAVIS | Containing two datasets: 1. MAVIS-Caption: 588K high-quality caption-diagram pairs, spanning geometry and function, 2. MAVIS-Instruct: 834K instruction-tuning data with CoT rationales in a text-lite version. | π Paper π Repo |
Geometry3K | Consisting of 3,002 geometry problems with dense annotation in formal language. | π Paper π Project π Repo |
MathV360K | Consisting 40K images from 24 datasets and 360K question-answer pairs. | π Project π€ Dataset |
MultiMath300K | A multimodal, multilingual, multi-level, and multistep mathematical reasoning dataset that encompasses a wide range of K-12 level mathematical problem. | π Project |
βWhile many datasets listed in Supervised Fine-Tuning can be adapted for reinforcement learning, we specifically highlight datasets explicitly designed for RL as indicated in their respective references.
Dataset | Descriptions | References |
---|---|---|
PRM800K | A process supervision dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset | π Paper π Project π Repo |
Big-Math | A dataset of over 250,000 high-quality math questions with verifiable answers, purposefully made for reinforcement learning (RL). Extracted questions satisfy three desiderata: (1) problems with uniquely verifiable solutions, (2) problems that are open-ended, and (3) problems with a closed-form solution. | π Repo π€ Dataset |
Math-Shepherd | Problems and step-by-step solutions with automatic labels | π Paper π Project π€ Dataset |
OpenR1-Math-220k | Consisting of 220k math problems with two to four reasoning traces generated by DeepSeek R1 for problems from NuminaMath 1.5. | π€ Dataset |
Dataset | Descriptions | References |
---|---|---|
Lila | A mathematical reasoning benchmark consisting of over 140K natural language questions from 23 diverse tasks. | π Project π€ Dataset |
MathBench | A benchmark that tests large language models on math, covering five-level difficulty mechanisms. It evaluates both theory and problem-solving skills in English and Chinese. | π Paper π Repo |
MathOdyssey | A collection of 387 mathematical problems for evaluating the general mathematical capacities of LLMs. Featuring a spectrum of questions from Olympiad-level competitions, advanced high school curricula, and university-level mathematics. | π Paper π Project π Repo |
Omni-MATH | A challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. | π Paper π Project π Repo π€ Dataset |
HARP | A math reasoning dataset consisting of 4,780 short answer questions from US national math competitions. | π Paper π Repo |
Dataset | Descriptions | References |
---|---|---|
MathVerse | A collection of 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. | π Project π€ Dataset |
MathVista | A benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets | π Project π€ Dataset |
MATH-Vision | A collection of 3,040 mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty. | π Project π€ Dataset |
We-Math | A collection of 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. | π Paper π Project π Repo π€ Dataset |
OlympiadBench | A Olympiad-level bilingual multimodal scientific benchmark containing math and physics problems sourced from the International Olympiads, the Chinese Olympiad, and the Chinese College Entrance Exam (GaoKao) | π Paper π Repo π€ Dataset |