Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Padding free dpo #2437

Open
wants to merge 57 commits into
base: main
Choose a base branch
from
Open

Conversation

dame-cell
Copy link

@dame-cell dame-cell commented Dec 4, 2024

What does this PR do?

New feature #2422

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines.
  • Did you write any new necessary tests?

For now this is just a draft will be continuing to work on it

@osanseviero

@dame-cell
Copy link
Author

dame-cell commented Dec 10, 2024

not really done yet but for now
everything seems to be working
if padding_free is set to True the trainer will not pad and also when padding_free =True attention_mask will not be used

for now here are some task to be done :

  • Ensure when padding_Free =True the trainer will not pad
  • Ensure that when padding_free = True the trainer will not use or return attention_mask
  • Ensure that when padding_free = True we use positon_ids
  • make tests

@dame-cell
Copy link
Author

most of the stuff is done just some small stuff left like dealing with list and converting to tensor

@dame-cell dame-cell marked this pull request as ready for review December 11, 2024 14:43
@dame-cell
Copy link
Author

dame-cell commented Dec 11, 2024

Hey @osanseviero,

The main idea for using padding_free is mostly in place now, but there are still a few things that need to be done. It would be awesome if you could take a look at the code and let me know if there's anything else I should address or add.

I've made it so the user can directly do this

trainer = DPOTrainer(
                model=self.model,
                ref_model=None,
                args=training_args,
                tokenizer=self.tokenizer,
                padding_free=True, # when true it will not use any padding 
                train_dataset=dummy_dataset["train"],
                eval_dataset=dummy_dataset["test"],
            )

with tempfile.TemporaryDirectory() as tmp_dir:
training_args = DPOConfig(
output_dir=tmp_dir,
per_device_train_batch_size=2,
max_steps=3,
remove_unused_columns=False,
gradient_accumulation_steps=4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have tests for this with gradient accumulation too. perhaps using pytest.mark.parameterize?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right will do so thanks for reviewing 😎

@@ -53,6 +53,9 @@ class PPOConfig(OnPolicyConfig):
Discount factor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modif shouldn't be here, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my bad i'll fix it right now

Copy link
Member

@qgallouedec qgallouedec Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still have modifications in ppo files

@qgallouedec
Copy link
Member

You can push it, no worry we can still refine after

@dame-cell
Copy link
Author

dame-cell commented Dec 15, 2024

Thank you for your understanding! I wanted to let you know that I’m a bit tied up today and tomorrow, so I might not be able to push the code right away. I’ll try to get to it as soon as possible, but please feel free to let me know if there’s a hard deadline I should prioritize.

Thanks for your patience!
I'll keep working on it so I'll try to push it by Tommorow if i can

@qgallouedec
Copy link
Member

No rush on our side :)

@dame-cell
Copy link
Author

dame-cell commented Dec 17, 2024

all right so I think this does it I did check if we can train this on a single T4 gpu colab notebook
now using the examples scripts provided the file trl/scripts/dpo.py with a bit of update
I was able to train a model using the padding_Free =True

python trl/examples/scripts/dpo.py \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --learning_rate 5.0e-6 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --gradient_checkpointing \
    --logging_steps 1 \
    --output_dir Qwen2-0.5B-DPO \
    --no_remove_unused_columns \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16

without padding_free it kept saying OOM is this normal or what ?
I have not updated the docs yet because I'm not 100 % sure this one works or is correct until after a review

@dame-cell dame-cell marked this pull request as ready for review December 17, 2024 14:58
@dame-cell
Copy link
Author

@osanseviero Just wanted to follow up on this PR and see if there’s any feedback so far. I’m happy to clarify anything or make updates if needed. Let me know whenever you get a chance—thanks so much for your time! 🙌

@qgallouedec
Copy link
Member

You still need to revert the changes applied to PPO files. And apply pre-commits

@dame-cell
Copy link
Author

dame-cell commented Dec 22, 2024

@qgallouedec
Train a qwen model for only 10 steps using both padding_Free =Trrue and padding_free =False with a batch_size of 1 with no gradient accumuatlion on goggle colab notebook T4 GPU
GPU peak memory

Metric Padding Free=False Padding Free=True
PEAK MEMORY 13.9 9.0

here is the lasts steps metrics when training

Metric Padding Free=False Padding Free=True
Loss 0.6939 0.6914
Grad Norm 4.1626 5.1364
Rewards/Chosen -0.00055 0.00207
Rewards/Rejected 0.00093 -0.00152
Rewards/Accuracies 0.0 1.0
Rewards/Margins -0.00148 0.00359
LogPs/Chosen -57.3851 -57.3589
LogPs/Rejected -29.9984 -27.7588
Logits/Chosen -2.8899 -2.8894
Logits/Rejected -2.7367 -2.3579
Epoch 0.01 0.01

There still appear to be noticeable differences between the Rewards/Chosen and Rewards/Rejected metrics. Despite my efforts to resolve this,I just could not fix it
with gradient accumulation of 8 you can fit upto 4 batch for padding_free =True

@qgallouedec
Copy link
Member

(Please stop tagging osanseviero? Unless you've a good reason. He is not involved here, please don't bother him 🙏)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Let DPOTrainer Support padding_free
3 participants