Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert hack that leads to OOM during fine-tuning #3858

Merged
merged 2 commits into from
Jan 4, 2024

Conversation

arnavgarg1
Copy link
Contributor

A few weeks ago, we merged #3830 to temporarily get around an issue of fine-tuning with gradient checkpointing that was introduced with Transformers 4.36. The original issue can be seen here: huggingface/transformers#28023

Since then, they've released a patch release (Transformers 4.36.2) that fixes the original issue for all model types, including Llama-2, Mixtral, Phi etc. However, it seems like the overall interaction between the new transformers version and our hacky fix leads to memory ballooning because we manually map each module to the correct mode, and in the process, set some modules to train mode when they shouldn't be which causes the memory to balloon.

This PR is no longer needed if transformers is set to use 4.36.2 since it has the patch release. I've pinned the minimum version of transformers to this version as part of this PR.

Copy link
Collaborator

@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

github-actions bot commented Jan 4, 2024

Unit Test Results

  6 files  ±0    6 suites  ±0   14m 12s ⏱️ +20s
12 tests ±0    9 ✔️ ±0    3 💤 ±0  0 ±0 
60 runs  ±0  42 ✔️ ±0  18 💤 ±0  0 ±0 

Results for commit afe1867. ± Comparison against base commit d45566b.

@arnavgarg1 arnavgarg1 merged commit 29ad837 into master Jan 4, 2024
18 checks passed
@arnavgarg1 arnavgarg1 deleted the revert_oom_commit branch January 4, 2024 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants