Code for paper Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions. In this work, we propose to let the model perform sampling during training and learn from those self-sampled correct or partially-correct solutions, which are automatically identified by comparing the final or intermediate execution states. An example is shown as below.
- 2023-03-08: Initial code release
- 2023-02-17: Camera-ready version updated on arxiv
- 2023-01-20: Paper is accepted at ICLR 2023
Note: all of the following has only been tested on Linux machines, you may need to build your own
tree-sitter
parsers if a different platform is used.
(Recommended) Create a new conda environment
conda create -n trace-codegen python=3.8
conda activate trace-codegen
Clone the code and install the dependencies
git clone git@github.com:microsoft/TraceCodegen
cd TraceCodegen
pip install -r requirements.txt
(Optional) Set up wandb
for experiment tracking. First following wandb documentation to login, then change the following lines in trainer.logger+
fields of the yaml
config file you would like to run:
entity: <your-user/org-name>
project: <your-project-name>
(Optional) At any point, if you met with the Python import problem (e.g., ModuleNotFoundError
), try doing this in the main (TraceCodegen
) directory:
export PYTHONPATH=`pwd`
We conduct experiments on the MathQA-Python and GSM8k datasets. As they have different licenses and preprocessing pipelines, here we describe them separately. But first, let's make a data
directory:
mkdir data
First follow this script for generation the MathQA-Python dataset from the original MathQA dataset. After that, make sure your data directory looks something like this:
data
|-- mathqa
| |-- train-python.jsonl
| |-- val-python.jsonl
| |-- test-python.jsonl
|---...
We preprocess MathQA-Python by respliting the data with template-based deduplication (see detail in paper). To do this, run the preprocessing script with the following:
python resplit_mathqa_python.py
After this, your data
directory should now look something like this:
data
|-- mathqa
| |-- train-python.jsonl
| |-- val-python.jsonl
| |-- test-python.jsonl
| |-- train_dedup.jsonl
| |-- val_dedup.jsonl
|---...
Note that we only combine and resplit the orignal train and validation set, and the test set kept untouched.
As the solution to GSM8k questions are originally annotated as math formulas, we used a script to automatically extract
MathQA-Python style programs as solutions. To replicate this, first download the data from the
original GSM8k repo. After that, your data
directory should look like this:
data
|-- gsmath
| |-- train.jsonl
| |-- test.jsonl
| |-- ...
|---...
Now run the preprocessing script for GSM8k:
python preprocessing/preprocess_gsm8k.py
After this, your data
directory should look like this:
data
|-- gsmath
| |-- train.jsonl
| |-- test.jsonl
| |-- gsmath_train.jsonl
| |-- gsmath_val.jsonl
| |-- gsmath_test.jsonl
| |-- ...
|---...
Our training framework is built on top of PyTorch-Lightning (version 1.5.10). More specifically, you would only need to change the yaml
configuration files if you would like to adjust the hyperparameters (e.g., batch size, gpus, dataset file, etc).
Note: To run model training, we recommend using GPUs that have at least 32GiB of memory, or decrease the training batch size accordingly. All our experiments are conducted on 8x V100-32GB GPUs.
python trainer.py fit --config <config_file_path>.yaml
Existing yaml
config files can be found in training_configs
, you can also find all the hyperparameter settings (e.g., batch size) in the Appendix of the paper.
If you would like to switch between GPT-Neo-125M
and GPT-Neo-2.7B
models, be sure to change the following fields in the yaml
config file:
trainer:
...
strategy: deepspeed_stage_2_offload # for 2.7B, or "ddp_find_unused_parameters_false" for 125M
...
model:
class_path: ...
init_args:
transformer_model_name: &transformer EleutherAI/gpt-neo-2.7B # or EleutherAI/gpt-neo-125M
...
data:
class_path: ...
init_args:
...
batch_size: 2 # [Optional] change this according to the GPU memory
val_batch_size: 4 # [Optional] change this according to the GPU memory
...
Since the MathQA-Python and GSM8k datasets are in the same format after preprocessing, you just need to change the file paths in the following fields of the yaml
config file:
data:
...
train_file_path: data/mathqa/train_dedup.jsonl # or "data/gsmath/gsmath_train.jsonl" for gsm8k
val_file_path: data/mathqa/val_dedup.jsonl # or "data/gsmath/gsmath_val.jsonl" for gsm8k
To this end, you just need to use different yaml
config files in training_configs
:
training_configs/gpt_self_sampling.yaml # for using self-sampled fully-correct solutions only
training_configs/gpt_self_sampling_partial.yaml # for also using self-sampled partially-correct solutions
- For the MLE baseline, just run with the config file of
training_configs/gpt_mle.yaml
- For running MML, set the following in the
yaml
config file:
model:
...
init_args:
...
mle_lambda: 0.0
mml_lambda: 1.0
...
- For running
$\beta$ -MML, keep the above and setbeta_smoothing: <beta>
- For running MLE-Aug, set
mle_lambda: 1.0
andmml_lambda: 0.0
in above.
For all other hyperparameters, please read the rest of the fields in the yaml
file and the corresponding __init__
function in the corresponding class, or refer to the pytorch-lightning documents.
For running model inference (e.g., on the test set), use the following command:
python trainer.py validate --config <your-config-file> --model.init_args.load_ckpt_file <saved_ckpt_file>
If you use the code in this repository, consider cite:
@inproceedings{ni2023selfsampling,
title={Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions},
author={Ni, Ansong and Inala, Jeevana Priya and Wang, Chenglong and Polozov, Alex and Meek, Christopher and Radev, Dragomir and Gao, Jianfeng},
booktitle={The 2023 International Conference on Learning Representations}
}
For any questions, please open an issue. PRs are definitely welcomed, and please check the following section about contributing to this repo.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.