Fixed MQA outputs not matching with HF model with non-flash case #71

mayank31398 · 2023-07-18T14:02:06Z

Flash Attention is working correctly and I see errors between HF model's layers and Megatron model's layers as low as 1e-3 to 1e-4 with fp16 precision.
However, with non-flash case, there are large errors due to incorrect shape handling during training.

jlamypoirier · 2023-07-18T15:06:20Z

The existing implementation looks fine to me, see layer outputs below. Of course, the different shape will cause intermediate values and the dropout mask to be different, but the end result is the same when dropout is disabled.

CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc-per-node=1 Megatron-LM/pretrain_gpt.py \
--tokenizer-type=TokenizerFromFile \
--tokenizer-file=[...] \
--make-vocab-size-divisible-by=128 \
--num-workers=0 \
--valid-num-workers=0 \
--data-path=[...] \
--num-layers=1 \
--hidden-size=512 \
--num-attention-heads=4 \
--attention-head-type=multiquery \
--max-position-embeddings=32 \
--seq-length=32 \
--init-method-std=0.022 \
--DDP-impl=local  \
--initial-loss-scale=65536 \
--fp16 \
--train-iters=1 \
--micro-batch-size=2 \
--log-interval=1 \
--eval-iters=0 \
--lr=0.0002 \
[--use-flash-attn \]
--hidden-dropout=0 \
--attention-dropout=0 \
--lr-decay-style=constant

With flash (printing transformer layer output stats and every 997th value):

LAYER 1, name=None, shape=[32, 2, 512], dtype=torch.float16, device=cuda:0, stats=(8.166, 210.750), storage=140294837774848, storage size=65536, storage stats=(8.166, 210.750)
[-0.427978515625, 0.037841796875, 0.058929443359375, 0.26025390625, -0.2462158203125, -0.172119140625, 0.08416748046875, 0.2264404296875, -0.332763671875, 0.12109375, 0.107177734375, -0.071044921875, 0.189697265625, 0.178955078125, 0.239990234375, -0.1292724609375, -0.2047119140625, 0.28662109375, 0.0889892578125, -0.1063232421875, -0.115478515625, -0.16552734375, -0.145751953125, -0.10693359375, 0.388671875, -0.08074951171875, -0.14697265625, 0.183837890625, -0.1710205078125, -0.03802490234375, -0.11138916015625, 0.10986328125, 0.048828125]

Without flash:

LAYER 1, name=None, shape=[32, 2, 512], dtype=torch.float16, device=cuda:0, stats=(8.164, 210.750), storage=140037864947712, storage size=65536, storage stats=(8.164, 210.750)
[-0.427978515625, 0.037841796875, 0.05889892578125, 0.260009765625, -0.246337890625, -0.172119140625, 0.0838623046875, 0.226318359375, -0.332763671875, 0.12109375, 0.107666015625, -0.0709228515625, 0.18994140625, 0.178955078125, 0.2401123046875, -0.1290283203125, -0.2047119140625, 0.28662109375, 0.0887451171875, -0.1063232421875, -0.1156005859375, -0.1656494140625, -0.1458740234375, -0.1070556640625, 0.388671875, -0.080810546875, -0.14697265625, 0.1837158203125, -0.1712646484375, -0.0379638671875, -0.111328125, 0.10986328125, 0.04888916015625]

jlamypoirier · 2023-07-18T15:13:04Z

The interest in the HF format (other than being simpler) is to reduce the number of copying transposes, but that needs a batch-first data format. I'm not sure there is much to do with sequence-first (needed for sequence parallelism)

jlamypoirier · 2023-07-18T15:15:44Z

Also I think Alibi would need to be adjusted

mayank31398 · 2023-07-18T21:36:58Z

@jlamypoirier @RaymondLi0 I found the bug.
The alibi tensor is incorrectly being repeated.
Rather the sequence of steps in this PR lead to the correct alibi tensor.

Earlier the tensor was [b, sq * np, sk rather than [b, np * sq, sk]

RaymondLi0 · 2023-07-19T14:22:59Z

Nice catch, thank you @mayank31398 !

mayank31398 force-pushed the mqa branch from 6c79d08 to befde51 Compare July 18, 2023 14:04

outputs not matching non-flash case in MQA

a993f05

mayank31398 force-pushed the mqa branch from befde51 to a993f05 Compare July 18, 2023 21:29

Merge branch 'multi-query-attention' into mqa

1809fc1

RaymondLi0 approved these changes Jul 19, 2023

View reviewed changes

RaymondLi0 merged commit c82a5a1 into bigcode-project:multi-query-attention Jul 19, 2023

mayank31398 deleted the mqa branch July 19, 2023 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed MQA outputs not matching with HF model with non-flash case #71

Fixed MQA outputs not matching with HF model with non-flash case #71

Uh oh!

mayank31398 commented Jul 18, 2023

Uh oh!

jlamypoirier commented Jul 18, 2023

Uh oh!

jlamypoirier commented Jul 18, 2023

Uh oh!

jlamypoirier commented Jul 18, 2023

Uh oh!

mayank31398 commented Jul 18, 2023 •

edited

Loading

Uh oh!

RaymondLi0 commented Jul 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fixed MQA outputs not matching with HF model with non-flash case #71

Fixed MQA outputs not matching with HF model with non-flash case #71

Uh oh!

Conversation

mayank31398 commented Jul 18, 2023

Uh oh!

jlamypoirier commented Jul 18, 2023

Uh oh!

jlamypoirier commented Jul 18, 2023

Uh oh!

jlamypoirier commented Jul 18, 2023

Uh oh!

mayank31398 commented Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RaymondLi0 commented Jul 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mayank31398 commented Jul 18, 2023 •

edited

Loading