Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClipQKV #197

Merged
merged 11 commits into from
Mar 1, 2023
Merged

ClipQKV #197

merged 11 commits into from
Mar 1, 2023

Conversation

vchiley
Copy link
Contributor

@vchiley vchiley commented Feb 28, 2023

This PR

  • enables qk_ln in triton attn variant
  • enables clipping qkv before attn is applied

On its own, Alibi does not fully solve the stability issue; when clipping is added to qkv, training becomes more stable.
Screenshot 2023-02-27 at 9 10 46 PM

Without Alibi, clipping qkv is still very stable (it just doesn't do as well)
Screenshot 2023-02-27 at 9 12 12 PM

I tried clipping values of {0.1, 1, 2, 10}; {0.1, 1} were too aggressive; {2, 10} are shown. We can see that when the clipping value is higher (10) the network is slightly less stable and gets psuedo-loss spikes from which it recovers gracefully.

ClipQKV+Alibi outperforms QKLN
Screenshot 2023-02-27 at 9 18 20 PM

ClipQKV+Alibi has the exact same performance as QKLN+Alibi
Screenshot 2023-02-27 at 9 19 07 PM

@vchiley vchiley self-assigned this Feb 28, 2023
Copy link
Contributor

@abhi-mosaic abhi-mosaic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna leave this to others to review + approve, but could I suggest adding some unit tests to check equality between the torch/causal/triton attention blocks?

examples/llm/src/models/layers/attention.py Outdated Show resolved Hide resolved
examples/llm/src/models/mosaic_gpt.py Outdated Show resolved Hide resolved
vchiley and others added 2 commits February 28, 2023 10:46
Co-authored-by: Abhi Venigalla <77638579+abhi-mosaic@users.noreply.github.com>
@vchiley
Copy link
Contributor Author

vchiley commented Feb 28, 2023

adding some unit tests to check equality between the torch/causal/triton attention blocks?

created issue: https://mosaicml.atlassian.net/browse/RESEARCH-468

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

examples/llm/src/models/layers/attention.py Outdated Show resolved Hide resolved
examples/llm/src/models/layers/attention.py Outdated Show resolved Hide resolved
@dakinggg
Copy link
Collaborator

dakinggg commented Mar 1, 2023

nvm, i misread the output. looks like your added tests are failing

@vchiley
Copy link
Contributor Author

vchiley commented Mar 1, 2023

#197 (comment)

yeah they're gpu tests. I need to xfail them on cpu. Forgot to do that...

Copy link
Contributor

@bcui19 bcui19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after gating the tests to be on GPU only

@vchiley vchiley force-pushed the clip_qkv branch 3 times, most recently from 1b02a56 to 7b6f3ae Compare March 1, 2023 17:10
@vchiley vchiley enabled auto-merge (squash) March 1, 2023 17:11
@vchiley vchiley force-pushed the clip_qkv branch 2 times, most recently from 49392da to 62d9749 Compare March 1, 2023 18:33
Copy link
Contributor

@codestar12 codestar12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@vchiley vchiley merged commit e63c50c into mosaicml:main Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants