[QEff Finetune] : Made fixes to training script #439

quic-mamta · 2025-06-10T09:33:50Z

Added padding of dataset to make it divisible by batch_size * num_devices.
Updated dataloader code to prevent dropping the last slice from the dataset.
Updated samsum dataset name from "Samsung/samsum" to "knkarthick/samsum".
Added loss modified loss functions to support the loss_weight parameter.
Refactored and corrected the logging of loss and ppl across devices.

Below are the numbers with this PR:

Dataset: Samsum
Model: Llama-3.2-1B
Epoch: 1

Sr. No.	# Devices	Grad. Accum.	BS	Global BS	Train Loss	Test Loss	Train PPL	Test PPL
1	48	4	1	192	1.3206	1.2855	3.7471	3.6360

Dataset: Samsum
Model: Llama-3.1-8B
Epoch: 1

Sr. No.	# Devices	Grad. Accum.	BS	Global BS	Train Loss	Test Loss	Train PPL	Test PPL
1	4	4	1	16	1.0120	1.0725	2.7510	2.9234

quic-mamta · 2025-06-10T09:38:41Z

QEfficient/finetune/dataset/samsum_dataset.py



 def get_preprocessed_samsum(dataset_config, tokenizer, split, context_length=None):
-    dataset = datasets.load_dataset("Samsung/samsum", split=split, trust_remote_code=True)
+    dataset = datasets.load_dataset("knkarthick/samsum", split=split, trust_remote_code=True)


Please check if this dataset can be used.

We are not distributing this dataset hence, it should not be a problem.

QEfficient/finetune/utils/config_utils.py

quic-swatia · 2025-06-16T06:36:19Z

QEfficient/finetune/utils/train_utils.py

@@ -235,11 +241,23 @@ def train(
                train_step_metric.append(step_metric_val)

            if train_config.grad_scaler:
-                scaler.scale(loss).backward()  # backward pass
+                if train_config.enable_ddp:
+                    with model.no_sync():


This will result in no syncing of gradients at any step.

Yes, correct. Removed this no_sync change from this PR. We will raise separate PR for that.

quic-swatia · 2025-06-16T06:36:42Z

QEfficient/finetune/utils/train_utils.py

+                if train_config.enable_ddp:
+                    # FIXME: We can not stop transfer of gradient across devices every time.
+                    # In grad accumulation last step should transfer gradients across devices.
+                    with model.no_sync():


This will result in no syncing of gradients at any step here as well.

Yes, correct. Removed this no_sync change from this PR. We will raise separate PR for that.

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

…ght parameter to make the loss for padded samples as zero. Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

…well. Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

… and zero the loss for padded samples. Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

…e loss fn. Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

quic-mamta requested review from quic-rishinr, ochougul, quic-hemagnih and quic-amitraj as code owners June 10, 2025 09:33

quic-mamta marked this pull request as draft June 10, 2025 09:34

quic-mamta commented Jun 10, 2025

View reviewed changes

quic-mamta force-pushed the jitender_fixes branch 2 times, most recently from 6d833cf to d269d0c Compare June 10, 2025 10:57

quic-mamta changed the title ~~Made fixes to training script based on recent findings.~~ [QEff Finetune] : Made fixes to training script Jun 12, 2025

quic-mamta assigned quic-mamta and quic-meetkuma and unassigned quic-mamta and quic-meetkuma Jun 16, 2025

quic-swatia reviewed Jun 16, 2025

View reviewed changes

Made fixes to training script based on recent findings.

5ea7c14

Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>

quic-meetkuma force-pushed the jitender_fixes branch from 12e48a9 to 085052e Compare June 23, 2025 06:15

quic-meetkuma added 7 commits June 23, 2025 06:18

Cleaned up the patch and added padding of the dataset with a loss_wei…

423736d

…ght parameter to make the loss for padded samples as zero. Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

Minor cleanup

aa80625

Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

Updated loss arithmatic for gradient accumulation.

cb0b915

Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

Minor logging level change.

29b9339

Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

Fixed train and eval loss and ppl mismatch. Further cleanup added as …

6685ff4

…well. Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

Added custom loss implementation which will work based on loss_weight…

a0a03b4

… and zero the loss for padded samples. Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

Minor change to eval ppl calculation.

21eb82d

Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

quic-meetkuma force-pushed the jitender_fixes branch from 085052e to 21eb82d Compare June 23, 2025 06:19

Fixed formating with ruff

19697d0

Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

quic-meetkuma force-pushed the jitender_fixes branch from dfa26af to 19697d0 Compare June 23, 2025 06:22

Brought back pad_dataset function and added some documentation for th…

eee5328

…e loss fn. Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

quic-meetkuma force-pushed the jitender_fixes branch from dde5ebb to eee5328 Compare June 23, 2025 08:08

quic-meetkuma added the fine-tuning label Jun 23, 2025

Updated results dict which is returned at the end of training.

7f5f3b4

Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>

quic-meetkuma force-pushed the jitender_fixes branch from 07b5f70 to 7f5f3b4 Compare June 23, 2025 09:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QEff Finetune] : Made fixes to training script #439

[QEff Finetune] : Made fixes to training script #439

quic-mamta commented Jun 10, 2025 •

edited

Loading

Uh oh!

quic-mamta Jun 10, 2025

Uh oh!

quic-meetkuma Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

quic-swatia Jun 16, 2025

Uh oh!

quic-meetkuma Jun 23, 2025

Uh oh!

quic-swatia Jun 16, 2025

Uh oh!

quic-meetkuma Jun 23, 2025

Uh oh!

Uh oh!

[QEff Finetune] : Made fixes to training script #439

Are you sure you want to change the base?

[QEff Finetune] : Made fixes to training script #439

Conversation

quic-mamta commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quic-mamta Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

quic-meetkuma Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

quic-swatia Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

quic-meetkuma Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

quic-swatia Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

quic-meetkuma Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quic-mamta commented Jun 10, 2025 •

edited

Loading