LM Head FLOPs #78

Muennighoff · 2023-09-28T19:42:14Z

Why are we not multiplying the LM Head flops per iteration with the checkpoint_activations_factor?

Line 253 in bd0aaba

    
           flops_per_iteration += (6 * batch_size * seq_len * num_layers * (hidden_size**2)) * (vocab_size / (num_layers * hidden_size))

Afaik the factor of 4 means 1 forward, 2 backward & 1 forward, where the last forward is needed for ckpt acts. Don't we also need all 4 for the LM Head? cc @RaymondLi0 @NouamaneTazi

The text was updated successfully, but these errors were encountered:

NouamaneTazi · 2023-09-29T10:55:17Z

In selective recomputation, we only checkpoint attention, so it shouldnt affect LM Head
In full recomputation, we assume that the last hidden_states are the LM Head activs, so no need for recomputing them

Megatron-LM/megatron/model/transformer.py

Line 1905 in bd0aaba

hidden_states = self._checkpointed_forward(hidden_states,

Muennighoff · 2023-09-29T12:59:08Z

In selective recomputation, we only checkpoint attention, so it shouldnt affect LM Head In full recomputation, we assume that the last hidden_states are the LM Head activs, so no need for recomputing them

Megatron-LM/megatron/model/transformer.py

Line 1905 in bd0aaba

hidden_states = self._checkpointed_forward(hidden_states,

I see but then don't we at least need to multiply it by 3 to account for the backward pass?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LM Head FLOPs #78

LM Head FLOPs #78

Muennighoff commented Sep 28, 2023 •

edited

Loading

NouamaneTazi commented Sep 29, 2023

Muennighoff commented Sep 29, 2023

LM Head FLOPs #78

LM Head FLOPs #78

Comments

Muennighoff commented Sep 28, 2023 • edited Loading

NouamaneTazi commented Sep 29, 2023

Muennighoff commented Sep 29, 2023

Muennighoff commented Sep 28, 2023 •

edited

Loading