Refactor the quantization modification logic #2233

dbogunowicz · 2024-04-09T10:56:56Z

Feature Description

The sparseml.transformers.sparsification.modification package is a set of modifications that are applied to some of the transformer models, to make them compatible with our quantization flows.

This PR moves the act of modification away from SparseAutoModel, to the QuantizationModifier. This means we only modify the model when necessary - when we apply the quantization structure. This helps us to avoid clashes with the new behavior in transformers, where the models by default get initialized with SDPAttention.

Notable changes

moving the high-level modification logic to sparseml.modifiers.quantization. Keeping only the transformers-specific logic in the original directory
only the models that are being supported by the new, "post-refactor" quantization modifiers "participate" in the modification. If the user wishes to initialize one of the old models, the modification happens on the initialization of SparseAutoModel.

To my best knowledge, the failing tests are orthogonal to the contents of this PR.

Legacy PR description

Keeping the original PR message (analysis of the problem, that forced me to go down the final path present in the PR) for posteriority, as it contains a lot of useful context:

As reported by @Satrat, after upgrading the transformers version we did not see the expected training speedups during e.g. sparse fine-tuning process. It turned out this was caused by the modify_model(...) function during the initialization of the SparseAutoModelForCausalLM.

Let's explain what was happening using LLaMa as an example.

The model as of transformers==4.39.1 can be initialized with three types of attention:

LlamaSdpaAttention - the default in the current transformers version; it uses the CUDA optimized torch.nn.functional.scaled_dot_product_attention for quicker computation of attention
LlamaAttention - the previous default before the transformers upgrade; this is the attention type that uses torch.matmul method and thus is being modified by us through the modify_model(...) method.
LlamaFlashAttention - irrelevant in the context of this write-up.

The current, erroneous behavior, was the following:

We were initializing the model with the default, SDPA-attention
Because of the misuse of the isinstance() method, we were overriding this attention module's forward method, effectively replacing SDPA-attention's forward method with the original attention class' forward method.
This is what was slowing down the training of the LlaMa model -> unknowingly, we were no longer using the fast CUDA-optimized attention but traditional, torch-based attention computation.

This PR hardens the modification logic - it uses a more restrictive type() instead of isinstance() to pick the correct attention type to modify. Now there is no difference in iterations per second when sparse fine tuning with or without the modify_model(...) function.

tests/sparseml/transformers/sparsification/modification/conftest.py

Satrat

LGTM, but two main thoughts: can we document this environment variable somewhere with its intended usage? And could we add a test script to this PR that demonstrates the speed issue being fixed?

src/sparseml/modifiers/quantization/pytorch.py

src/sparseml/transformers/sparsification/sparse_model.py

tests/sparseml/transformers/sparsification/modification/test_modifying_llama.py

initial commit

31705bc

dbogunowicz requested review from shubhra, Satrat and bfineran April 9, 2024 10:57

dbogunowicz and others added 3 commits April 9, 2024 11:14

...and harden tests

116dbc0

Merge branch 'main' into feature/damian/modifications_bug

e71dbf6

Merge branch 'main' into feature/damian/modifications_bug

afbd0f4

Satrat reviewed Apr 15, 2024

View reviewed changes

tests/sparseml/transformers/sparsification/modification/conftest.py Outdated Show resolved Hide resolved

Satrat reviewed Apr 15, 2024

View reviewed changes

tests/sparseml/transformers/sparsification/modification/conftest.py Outdated Show resolved Hide resolved

bfineran previously approved these changes Apr 15, 2024

View reviewed changes

move the model modification into the quantization modifier logic

2797e07

dbogunowicz dismissed bfineran’s stale review via 2797e07 April 16, 2024 09:38

dbogunowicz added 2 commits April 16, 2024 10:18

harden tests

02a438a

harden tests

265aae9

dbogunowicz changed the title ~~[Fix] Remove hidden issue in modification repo that causes training slowdown~~ Refactor the quantization modification logic Apr 16, 2024

Merge branch 'main' into feature/damian/modifications_bug

4030b1b

Satrat reviewed Apr 18, 2024

View reviewed changes

src/sparseml/modifiers/quantization/pytorch.py Show resolved Hide resolved

src/sparseml/transformers/sparsification/sparse_model.py Show resolved Hide resolved

dbogunowicz added 2 commits April 22, 2024 12:51

Merge branch 'main' into feature/damian/modifications_bug

1723b8f

Merge branch 'main' into feature/damian/modifications_bug

ec52c30

Satrat reviewed Apr 29, 2024

View reviewed changes

tests/sparseml/transformers/sparsification/modification/test_modifying_llama.py Show resolved Hide resolved

add tests suggested by sara

35e1d2e

bfineran approved these changes Apr 29, 2024

View reviewed changes

Satrat approved these changes Apr 29, 2024

View reviewed changes

bfineran merged commit 7cd2feb into main Apr 29, 2024
16 of 17 checks passed

bfineran deleted the feature/damian/modifications_bug branch April 29, 2024 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the quantization modification logic #2233

Refactor the quantization modification logic #2233

dbogunowicz commented Apr 9, 2024 •

edited

Loading

Satrat left a comment

Refactor the quantization modification logic #2233

Refactor the quantization modification logic #2233

Conversation

dbogunowicz commented Apr 9, 2024 • edited Loading

Feature Description

Notable changes

Legacy PR description

Satrat left a comment

Choose a reason for hiding this comment

dbogunowicz commented Apr 9, 2024 •

edited

Loading