Give example on how to handle gradient accumulation with cross-entropy #3193

ylacombe · 2024-10-24T10:16:52Z

What does this PR do?

Following the recent highlights on how gradient accumulation with the cross-entropy loss is usually off, it could be great to have it mentioned in the doc. I've thus added some code and explanation of it in the gradient accumulation page.

cc @SunMarc and @muellerzr, let me know what you think of it or if I can make this any clearer!

HuggingFaceDocBuilderDev · 2024-10-24T10:23:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

Looks great! (We need to do the gather() + div by num processes in the trainer still).

Left a few nits, I think it'd be really cool if we can show full training graphs. After doing stuff with FP8 just taking "the end result is the same" at face value I don't fully trust :)

muellerzr · 2024-10-24T10:31:30Z

docs/source/usage_guides/gradient_accumulation.md

+Results on a single device:
+```
+initial model weight is tensor([-0.0075,  0.5364])
+initial model clone weight is tensor([-0.0075,  0.5364])
+Step 0 - Device 0 - num items in the local batch 36
+Total num items 36
+Device 0 - w/ accumulation, the final model weight is tensor([0.0953, 0.4337])
+w/o accumulation, the final model weight is tensor([0.0953, 0.4337])
+```
+
+Results on a two devices set-up:
+```
+initial model weight is tensor([-0.0075,  0.5364])
+initial model clone weight is tensor([-0.0075,  0.5364])
+Step 0 - Device 0 - num items in the local batch 52
+Step 0 - Device 1 - num items in the local batch 84
+Total num items 136
+Device 1 - w/ accumulation, the final model weight is tensor([0.2117, 0.3172])
+Device 0 - w/ accumulation, the final model weight is tensor([0.2117, 0.3172])
+w/o accumulation, the final model weight is tensor([0.2117, 0.3172])
+```


Honestly if we can let's even toss up some wandb graphs 🔥

Indeed, it'd be great, but here we do only one single global batch size, I don't think it's worth adding a graph. Maybe should I modify the current code snippet to do this with multiple global steps ?

Or add some wandb graphs from the upcoming modif of examples/by_feature/gradient_accumulation ?

muellerzr · 2024-10-24T10:33:18Z

docs/source/usage_guides/gradient_accumulation.md

+                model_optimizer.zero_grad()
+
+
+logger.warning(f"Device {accelerator.process_index} - w/ accumulation, the final model weight is {accelerator.unwrap_model(model).weight.detach().cpu().squeeze()}", main_process_only=False)


Rather than logger.warning, we can do print() here or change the default logging level :) (Just logging.warning rather than logging.info weirds me out)

docs/source/usage_guides/gradient_accumulation.md

SunMarc

Nice job @ylacombe ! Left a few suggestions !

docs/source/usage_guides/gradient_accumulation.md

SunMarc · 2024-10-24T12:21:28Z

docs/source/usage_guides/gradient_accumulation.md

+num_samples_in_epoch = len(dataloader)
+remainder = num_samples_in_epoch % gradient_accumulation_steps
+remainder = remainder if remainder != 0 else gradient_accumulation_steps
+total_gradient_updates = math.ceil(num_samples_in_epoch / gradient_accumulation_steps)
+
+total_batched_samples = 0
+for update_step in range(total_gradient_updates):
+        # In order to correctly the total number of non-padded tokens on which we'll compute the cross-entropy loss
+        # we need to pre-load the full local batch - i.e the next per_device_batch_size * accumulation_steps samples
+        batch_samples = []
+        num_batches_in_step = gradient_accumulation_steps if update_step != (total_gradient_updates - 1) else remainder
+        for _ in range(num_batches_in_step):
+            batch_samples += [next(training_iterator)]


This only works when we know the size of the dataloader. Can we think of a solution that doesn't require this information ? I think we can just iter on the dataloader until we have gradient_accumulation_steps to create the batch_sample. If we can't iter anymore, then we stop also. I think that code will be easier to understand.

Yes agreed :) (What we do in the Trainer)

SunMarc · 2024-10-24T12:28:07Z

docs/source/usage_guides/gradient_accumulation.md

+                # Since we performed prefetching, we need to manually set sync_gradients
+                if total_batched_samples % gradient_accumulation_steps != 0:
+                    accelerator.gradient_state._set_sync_gradients(False)
+                else:
+                    accelerator.gradient_state._set_sync_gradients(True)
+


The issue here is due to end_of_dataloader that will modify incorrectly the sync_gradient due to the prefetching.
Maybe we can add an option to disable do_sync in accumulate ? This way, we won't have to put this specific piece of code under accumulate and the user will have total control of when we do sync the gradient. cc @muellerzr

agreed, you can put it as part of this PR if you want @ylacombe

I haven't take into account this case, but we should also set sync_gradient=True when reaching the very last total_batched_samples btw

Not sure to understand exactly what your point @SunMarc, reaching end_of_dataloader also set accelerator.step to 0. If I disable it, we'd have issues when saving the accelerator state, right ?

I think it should be fine as we don't care about step in Trainer also. cc @muellerzr but we can leave that for a follow up PR if you want !

Follow up would be fine by me :)

SunMarc · 2024-10-24T12:41:59Z

docs/source/usage_guides/gradient_accumulation.md

+
+Results on a single device:


Maybe we can precise the exact setup ? I think that we are doing the following ?

dp=1 grad_acc= 2 batch_size = 4 vs dp=1 grad_acc= 1 batch_size = 8 ?
If we are only doing one update, then we won't be able to get a graph. Maybe we do this on a larger dataset where batch_size != len(data_loader) and add the graphs.

SunMarc · 2024-10-24T12:49:04Z

docs/source/usage_guides/gradient_accumulation.md

+Results on a two devices set-up:
+```


On a two devices set-up, the modification you did to take into account the dp won't be reflected here as we are only changing grad acc and batch_size. So the loss will be the same nevertheless. However, it's nice to see that the total_num_items really changed:

dp=2 grad_acc= 2 batch_size = 4 vs dp=2 grad_acc=1 batch_size=8

Maybe we should probably do a separate section/experiment to show the following will have the same loss graph

dp=2 batch_size =2 is the same as dp=1 batch_size=4. See this experiment for clarification

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

muellerzr · 2024-10-30T18:54:11Z

tests/test_examples.py

+    def test_gradient_accumulation_for_autoregressive_models(self):
+        testargs = ["examples/by_feature/gradient_accumulation_for_autoregressive_models.py"]
+        run_command(self.launch_args + testargs)


Just a nit: this doesn't use gradient accumulation here since it uses the default of 1

muellerzr · 2024-10-30T23:28:57Z

examples/by_feature/gradient_accumulation_for_autoregressive_models.py

+        "--per_device_batch_size",
+        type=int,
+        default=2,
+        help="The number of minibatches to be ran before gradients are accumulated.",


Shouldn't this be "The size of each minibatch"?

Suggested change

help="The number of minibatches to be ran before gradients are accumulated.",

help="The size of each minibatch",

ylacombe added 2 commits October 24, 2024 12:13

Add cross-entropy example in the gradient accumulation docs

2eb684e

add example of logs

4d4ed80

correct skeleton code

3b8c887

muellerzr approved these changes Oct 24, 2024

View reviewed changes

muellerzr reviewed Oct 24, 2024

View reviewed changes

docs/source/usage_guides/gradient_accumulation.md Outdated Show resolved Hide resolved

ylacombe added 3 commits October 24, 2024 14:28

replace gather_for_metrics with gather

c01827c

batch_size -> per_device_batch_size

22cbf9c

remove main_process_only=True

395c572

SunMarc reviewed Oct 24, 2024

View reviewed changes

ylacombe and others added 5 commits October 29, 2024 16:48

add autoregressive example in examples/

2e80bf0

Update docs/source/usage_guides/gradient_accumulation.md

5e3e811

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

ruff format

c56c780

add grad accum test

80c720a

update docs

e5d2c50

muellerzr reviewed Oct 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give example on how to handle gradient accumulation with cross-entropy #3193

Give example on how to handle gradient accumulation with cross-entropy #3193

ylacombe commented Oct 24, 2024

HuggingFaceDocBuilderDev commented Oct 24, 2024

muellerzr left a comment

muellerzr Oct 24, 2024

ylacombe Oct 24, 2024 •

edited

Loading

ylacombe Oct 24, 2024

muellerzr Oct 24, 2024

SunMarc left a comment

SunMarc Oct 24, 2024

muellerzr Oct 24, 2024

SunMarc Oct 24, 2024

muellerzr Oct 24, 2024

ylacombe Oct 24, 2024

ylacombe Oct 25, 2024

SunMarc Oct 25, 2024

muellerzr Oct 25, 2024

SunMarc Oct 24, 2024

SunMarc Oct 24, 2024 •

edited

Loading

muellerzr Oct 30, 2024

muellerzr Oct 30, 2024

		model_optimizer.zero_grad()


		logger.warning(f"Device {accelerator.process_index} - w/ accumulation, the final model weight is {accelerator.unwrap_model(model).weight.detach().cpu().squeeze()}", main_process_only=False)

	help="The number of minibatches to be ran before gradients are accumulated.",
	help="The size of each minibatch",

Give example on how to handle gradient accumulation with cross-entropy #3193

Are you sure you want to change the base?

Give example on how to handle gradient accumulation with cross-entropy #3193

Conversation

ylacombe commented Oct 24, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 24, 2024

muellerzr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylacombe Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylacombe Oct 24, 2024 •

edited

Loading

SunMarc Oct 24, 2024 •

edited

Loading