Implement step based checkpointing #2384

joecummings · 2025-02-11T21:27:15Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Closes #2105. This is a widely requested feature that allows users to have greater control over checkpointing frequency in torchtune.

TODO: Add commentary on design decisions. Acknowledge spaghetti code. Beg forgiveness.

Changelog

Update FullModelHFCheckpointer to accept a step parameter when saving a checkpoint. Use that step to designate the checkpoint folder name. Keep epoch_{} as a fall-back for BC.
Modify the full_finetune_single_device.py recipe to utilize step-based checkpointing.
Add tests for `full_finetune_single_device.py`` recipe w/ step-based checkpointing.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Evidence of correct number of checkpoints being saved

(joe-torchtune) [jrcummings@devvm4767.pnb0 ~/projects/joe-torchtune (impl-step-based-ckpt)]$ ls /tmp/torchtune/llama3_2_1B/full_single_device/
step_100  step_125  step_150  step_175  step_200  step_25  step_50  step_75  torchtune_config.yaml

Evidence of correct resuming from ckpt mid-epoch

Evidence of correct resuming from ckpt at epoch boundary

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2025-02-11T21:27:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2384

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Cancelled Jobs

As of commit 755a7ac with merge base fa92c96 ():

NEW FAILURE - The following job has failed:

GPU tests / gpu_test (3.11, stable) (gh)
tests/recipes/test_qat_lora_finetune_distributed.py::TestQATLoRAFinetuneDistributedRecipe::test_save_and_load_merged_weights[llama3/8B_qat_lora-llama3-tune]

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
tests/recipes/test_qat_lora_finetune_distributed.py::TestQATLoRAFinetuneDistributedRecipe::test_save_and_load_merged_weights[llama3/8B_qat_lora-llama3-tune]
GPU tests / gpu_test (3.9, stable) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

bogdansalyp · 2025-02-13T00:07:23Z

recipe_state is still saved to ${output_dir}, not ${output_dir}/step_XXX
resume_from_checkpoint logic should be updated
- RN it looks for ${output_dir} checkpoint, not step_XXX
- maybe replace top level cfg.resume_from_checkpoint to have cfg.checkpointer.resume_from which is either "latest" (default) or the path to the checkpoint to resume from. Or separate mutually exclusive resume_from: /path/ and resume_from_latest: True
- offtopic but cfg.resume_from_checkpoint is mentioned in code as deprecated and replaced by should_load_recipe_state but de facto resume_from_checkpoint is mandatory and should_load_recipe_state doesn't work
recipe_state has proper step and epoch to continue from but the train cycle still starts from 0 -> logs start from 0 & checkpointing start from 0
lr schedulers aren't synced with the resume step
maybe save the wandb run?..... 🥺

…d get resume working w/ StatefulDataLoader

…2382)

Co-authored-by: Felipe Mello <fmellomascarenhas@gmail.com> Co-authored-by: ebsmothers <ebs@meta.com> Co-authored-by: salman <salman.mohammadi@outlook.com>

…ytorch#2412)

…rch#2366)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2025

joecummings force-pushed the impl-step-based-ckpt branch from 20f2acf to 46f7c67 Compare February 11, 2025 21:51

joecummings mentioned this pull request Feb 18, 2025

Add core dependency on stable torchdata #2408

Merged

RedTachyon mentioned this pull request Feb 22, 2025

GRPO Improvement checklist #2421

Open

15 tasks

joecummings and others added 24 commits February 27, 2025 13:41

Add helper functions for pruning old checkpoints

9d45c6b

Step based checkpointing and tests

de08541

Fix helper function tests

19a50af

Silly linting rule

0656313

Allow kwargs in all checkpointers for BC with step

857bec1

Stub

1e2ac1f

Resume from recipe state in step_x

d1d79da

Correct step for resume from checkpoint

0e531ad

fix: Fixed global step restoration from recipe_state

53a6d38

At some point, god will make me pay for my sins

fa927c9

Remove the recipe state checkpointing *only* on intermediate paths an…

ab18acd

…d get resume working w/ StatefulDataLoader

Introduce RecipeStateCheckpointPeriod

33a788c

Remove the need for sampler

8bbe463

Update TOML file with more description matching README (pytorch#2409)

0537afd

Add core dependency on stable torchdata (pytorch#2408)

380da38

Update QAT tutorial (pytorch#2396)

c8c1027

MPS memory usage support (pytorch#2406)

721502f

Update docs and docstrings related to Llama3VisionTransform (pytorch#…

d2eefb1

…2382)

R1-Style distributed GRPO (pytorch#2326)

9e61aaf

Co-authored-by: Felipe Mello <fmellomascarenhas@gmail.com> Co-authored-by: ebsmothers <ebs@meta.com> Co-authored-by: salman <salman.mohammadi@outlook.com>

Add support for StatefulDataLoader (pytorch#2410)

79c0001

Update KVCache maximum sequence length configuration in PPO recipe (p…

abb34fc

…ytorch#2412)

Refactor load_image to return torch.Tensor instead of PIL.Image (pyto…

2803b92

…rch#2366)

Add StatefulDataLoader to select other recipes (pytorch#2431)

52d1d0c

Update README.md w/ GRPO (pytorch#2443)

9832db1

joecummings added 30 commits February 28, 2025 11:33

Most recent checkpoint + train loop changes

7860b07

Make sure input and output checkpoint names match

a3840c7

Hacks, literal hacks

593815c

Remove get_adapter_checkpoint_path

6287542

Add pruning to Meta checkpointer

686948d

Merge remote-tracking branch 'upstream/main' into impl-step-based-ckpt

10ff7cd

batch_count

1fb4ca9

Merge remote-tracking branch 'upstream/main' into impl-step-based-ckpt

c72888d

More cleanup

62143e4

Incorporate fs work

27a16e4

Add back missing functions, update for steps, add comments

90957f9

Undo changes to _utils

d24cbcc

Add back get_most_recent_checkpoint

f080ac8

Re-add async checkpointing test

10a9d46

Remove reference to serialization format

794fc26

Use huggingface_hub function to save files

5557320

Set epoch for dataloader

5065e3a

Wow, bad merge

16ef715

Stub: currently async on resume does not match loss values

8815ede

Fix tests

a031c9a

Update last test

053d64b

Try/catch with intermediate checkpoints

7d5069f

Remove f string messup

f8c9b83

Merge remote-tracking branch 'upstream/main' into impl-step-based-ckpt

57c91fb

Fix more tests

23219bd

I am the problem

8216728

Remove extra checkpoint when testing async

ca86fb2

create output path for adapter ckpt

fae3797

Put this in your pipe and smoke it

229f4bc

Update tests

755a7ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement step based checkpointing #2384

Implement step based checkpointing #2384

Uh oh!

joecummings commented Feb 11, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 11, 2025 •

edited

Loading

Uh oh!

bogdansalyp commented Feb 13, 2025 •

edited by joecummings

Loading

Uh oh!

Uh oh!

Implement step based checkpointing #2384

Are you sure you want to change the base?

Implement step based checkpointing #2384

Uh oh!

Conversation

joecummings commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2384

❌ 1 New Failure, 2 Cancelled Jobs

Uh oh!

bogdansalyp commented Feb 13, 2025 • edited by joecummings Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

joecummings commented Feb 11, 2025 •

edited

Loading

pytorch-bot bot commented Feb 11, 2025 •

edited

Loading

bogdansalyp commented Feb 13, 2025 •

edited by joecummings

Loading