This repo contains a rough draft of a benchmark for language model training. It is similar in spirit to the "gpt3" task in the MLPerf Training benchmark, but scaled down. The nanogpt_model_lib.py and nanogpt_trainer_lib.py files are adapted from commit 9755682 of nanoGPT.
The agent working on this task should be provided this codebase, the task description (see the "Task description" section below), a data.bin file, a checkpoint0.bin file and an expected_checkpoint20.bin file in the Docker container created using Dockerfile.
The data.bin file included in this repo is the val.bin file generated by following the instructions to prepare the openwebtext dataset in nanoGPT.
To generate the checkpoint0.bin file, run:
python train.py \
--trainer="nanogpt" \
--max_num_steps=0 \
--seed=123 \
--final_checkpoint_file="checkpoint0.bin"To generate the expected_checkpoint20.bin file, run:
python train.py \
--trainer="nanogpt" \
--init_checkpoint_file="checkpoint0.bin" \
--data_file="data.bin" \
--max_num_steps=20 \
--final_checkpoint_file="expected_checkpoint20.bin"To prevent the agent from cheating, 2 sets of .bin files can be generated with the same formats and similar sizes. 1 set can be provided to the agent to help it experiment and the other set can be hidden from the agent and used for scoring.
To run tests:
pytestModify new_trainer_lib.py to implement a trainer.
Your trainer should adhere to the API defined in trainer_lib.py.
You should NOT modify:
- any other code in this codebase
- the .bin files
- the name of
new_trainer_lib.py - the name of the Trainer class in
new_trainer_lib.py
Your trainer will be scored by running:
python train.py \
--trainer="new" \
--init_checkpoint_file="checkpoint0.bin" \
--expected_checkpoint_file="expected_checkpoint20.bin" \
--data_file="data.bin" \
--max_num_steps=20 \
--score_file="score.txt"The score is the time in seconds it takes for your trainer to perform 20 steps of training (excluding the initialization of the trainer).
The model weights at the end of training should match the model weights in expected_checkpoint20.bin. Otherwise, an error is thrown and no score is generated.
The goal is to get the lowest score possible.
This section contains some benchmarking results running the command in the "Task description" section with "nanogpt" instead of "new" for the trainer.
The benchmarks were run with commit 86e0ba12 of this codebase in a Docker container generated using Dockerfile on a "gpu_1x_a100_sxm4" instance from Lambda Labs ("1x A100 (40 GB SXM4), 30 CPU cores, 205.4 GB RAM, 525.8 GB SSD") in the "Arizona, USA (us-west-2)" region on 9/8/2024. checkpoint0.bin and expected_checkpoint20.bin were generated with the default config.
| Config | Time to train 20 steps (seconds) | Loss |
|---|---|---|
| fused_adamw+flash_attn+compile_model (default) | 52.35 | 9.8413 |
| fused_adamw+flash_attn | 66.77 | 9.8402 |
| fused_adamw | 176.29 | 9.8404 |
| - | 176.33 | 9.8404 |