fix loss masking and padding #287

sohamparikh · 2025-06-04T17:27:51Z

✨ Description

Fixes and improvements for loss masking and padding

Closes #

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Discard masked tokens while computing loss mean
Mask -ve tokens in embeddings
Fix cached yaml comparison when truncate_documents=False
Handle the edge case when no padding is required in a sequence
sum(long_docs_filter) was extremely slow for large datasets

tobyzl2

LGTM!

… soham/sft-fixes

jlamypoirier · 2025-06-04T18:46:00Z

fast_llm/data/dataset/gpt/sampled.py

@@ -525,8 +529,8 @@ def _load_yaml_data(self, data: dict[str, typing.Any]) -> None:
        elif "unshuffled_tokens" not in data:
            # Backward compatibility
            # TODO v0.x: Remove
-            assert self._truncate_documents
-            data["unshuffled_tokens"] = data["tokens_per_epoch"] * data["unshuffled_epochs"]
+            assert not self._truncate_documents


That look wrong, old format only supported _truncate_documents=True

jlamypoirier · 2025-06-04T18:47:43Z

fast_llm/data/dataset/gpt/sampled.py

-            assert self._truncate_documents
-            data["unshuffled_tokens"] = data["tokens_per_epoch"] * data["unshuffled_epochs"]
+            assert not self._truncate_documents
+            data["unshuffled_tokens"] = data["dataset"]["tokens_per_epoch"] * data["unshuffled_epochs"]


I believe the backward compatibility is from before we moved things to dataset, so it was right before.

yes, i got a bit confused about the purpose of this. There's still an issue with yaml_data not containing unshuffled_tokens (we can't get it without building the padded cumsum). Pushed a hack to copy it from loaded_yaml_data instead to avoid breaking the flow, not sure if there's a cleaner way

jlamypoirier · 2025-06-04T18:49:52Z

fast_llm/functional/cross_entropy.py

@@ -145,7 +145,7 @@ def _fused_cross_entropy_forward_backward(

    per_sample_loss = sum_exp_logits.log() - predicted_logits
    if loss_mask is not None:
-        per_sample_loss = per_sample_loss * loss_mask
+        per_sample_loss = per_sample_loss[loss_mask]


Why this change? loss_mask is an integer so multiplication should work.

The loss is not as interpretable/comparable when we include loss from masked tokens (0) in the average. We start seeing a lot of variance in the reported loss when mixing samples with/without masked tokens

Oh, so it's to have the right denominator in the mean? Indexing is a bad idea because it introduces a cuda synchronization point (really slow), but you can divide by loss_mask.sum() instead.

Also we probably want to deal with the case loss_mask.sum()==0 .

jlamypoirier · 2025-06-04T18:54:49Z

fast_llm/layers/language_model/embedding.py

@@ -99,7 +99,10 @@ def _forward(self, input_: torch.Tensor, position_ids: torch.Tensor | None) -> t
                input_ = split(input_, group=group, dim=0)
                if self._use_absolute_position_embeddings:
                    position_ids = split(position_ids, group=group, dim=0)
-            embeddings = torch.embedding(self.word_embeddings_weight, input_)
+            # mask padded tokens
+            input_mask = input_ >= 0


I'd prefer not to do this unless padding is enabled because of the extra compute involved. Why do we have negative input anyway?

We set the padded tokens to -100 mainly to mask loss on them. Many tokenizers don't have pad tokens either so not straightforward to take it from the config either

lmk if the change is ok now, or we need to check for padding from a flag (maybe in kwargs)

We can add a flag, something like we already do for labels https://github.com/ServiceNow/Fast-LLM/blob/main/fast_llm/models/gpt/model.py#L332.
Also I just noticed the method is within @torch.compile, so the overhead shouldn't be too noticeable.

something like this? (truncate_documents might also fit better in the batch config)

… soham/sft-fixes

jlamypoirier · 2025-06-12T16:31:32Z

fast_llm/data/dataset/gpt/sampled.py

@@ -467,6 +468,12 @@ def __getitem__(self, index: int) -> typing.Any:
                    else:
                        # Move on to the next sample.
                        token_count += padding_size
+                elif document_size + tokens_in_sample == self._parameters.sequence_length + 1:
+                    if token_count + document_size == token_start:
+                        # Document belongs to the current sample but the condition below will include it for the next sample


I'm not following, why are we ignoring the document if it belongs to the current sample? (Also it clearly belongs to the previous sample)

From what I understand seems like in this scenario well have token_start_index_in_document==token_end_index_in_document==document_size , so we'll load 0 tokens from the sample. That seems unnecessary but not wrong, also doesn't seem to relate to document_size + tokens_in_sample == self._parameters.sequence_length + 1?

Seems to me the actual fix would be to replace >= with > in the condition below.

oh yes, i got confused because i faced this issue in the multimodal branch but it only occurs when there's images right after the text tokens. Will handle it there

jlamypoirier · 2025-06-12T16:57:35Z

fast_llm/functional/cross_entropy.py

@@ -146,8 +146,13 @@ def _fused_cross_entropy_forward_backward(
    per_sample_loss = sum_exp_logits.log() - predicted_logits
    if loss_mask is not None:
        per_sample_loss = per_sample_loss * loss_mask
-
-    loss = per_sample_loss.mean()
+        unmasked_inputs = loss_mask.sum()


This still cause a cuda sync. You can just do loss = (per_sample_loss * loss_mask).sum() / torch.maximum(loss_mask.sum(), 1)

for my own understanding, how can i check whether a pytorch op causes cuda sync?

jlamypoirier · 2025-06-12T17:47:27Z

fast_llm/data/dataset/gpt/config.py

@@ -75,6 +75,7 @@ class GPTSamplingParameters(SamplingParameters):
    use_loss_masking_spans: bool = False
    use_preference_loss_spans: bool = False
    cross_document_attention: bool = True
+    truncate_documents: bool = True


Not sure we need to move this, but if we do we need to add backward compatibility.

jlamypoirier · 2025-06-17T21:36:39Z

fast_llm/data/data/gpt/config.py

@@ -48,14 +48,15 @@ class GPTDataConfig(DataConfig, GPTLegacyConfig):
        desc="Multiprocessing context. Do not touch.",
        hint=FieldHint.expert,
    )
-    truncate_documents: bool = Field(
-        default=True,
+    truncate_documents: bool | None = Field(


That works, but we normally do backward compatibility in _from_dict, see example in lines 73-90 below. This one can go in GPTTrainerConfig._from_dict.
Also needs a todo for removal.

sohamparikh added 2 commits May 14, 2025 15:36

some sft fixes

06b76cd

bug fixes and improvements

0364fc6

sohamparikh requested review from jlamypoirier, shruthan and tobyzl2 June 4, 2025 17:28

Merge branch 'main' into soham/sft-fixes

d5cd5e8

tobyzl2 approved these changes Jun 4, 2025

View reviewed changes

sohamparikh added 2 commits June 4, 2025 18:05

fix

02fbb7e

Merge branch 'soham/sft-fixes' of github.com:ServiceNow/Fast-LLM into…

e4925f0

… soham/sft-fixes

sohamparikh requested a review from tobyzl2 June 4, 2025 18:06

tobyzl2 approved these changes Jun 4, 2025

View reviewed changes

jlamypoirier requested changes Jun 4, 2025

View reviewed changes

sohamparikh added 13 commits June 4, 2025 19:45

unshuffled tokens for padding

b31987a

fix

9db9cdf

mask optionally

a81ba7f

fix hack

eab15c0

use mask sum

e0e0f78

Merge branch 'main' into soham/sft-fixes

49760e5

handle None mask

6b2b598

flag to mask inputs, move truncate documents

29cb0a8

fix tests

8fe536f

Merge branch 'main' into soham/sft-fixes

a8c63c0

fix legacy sampler

1dc76a6

Merge branch 'soham/sft-fixes' of github.com:ServiceNow/Fast-LLM into…

4266f02

… soham/sft-fixes

fix

29cc709

sohamparikh mentioned this pull request Jun 11, 2025

WIP: multimodal support #227

Draft

25 tasks

jlamypoirier reviewed Jun 12, 2025

View reviewed changes

sohamparikh added 3 commits June 12, 2025 19:21

review

43c868a

Merge branch 'main' into soham/sft-fixes

b3d6d3c

Merge branch 'main' into soham/sft-fixes

650f0e2

jlamypoirier approved these changes Jun 17, 2025

View reviewed changes

fix loss masking and padding #287

Are you sure you want to change the base?

fix loss masking and padding #287

Uh oh!

Conversation

sohamparikh commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

Uh oh!

tobyzl2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sohamparikh Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sohamparikh Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sohamparikh commented Jun 4, 2025 •

edited

Loading

sohamparikh Jun 4, 2025 •

edited

Loading

sohamparikh Jun 5, 2025 •

edited

Loading