fix OOM when inference with llama-3.1-70b #1302

harborn · 2024-08-30T07:06:31Z

What does this PR do?

background

when I running inference with command:

INPUT=32768
OUTPUT=32768
BATCH_SIZE=12

python gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
    --model_name_or_path Meta-Llama-3.1-70B-Instruct/ \
    --max_input_tokens ${INPUT} \
    --max_new_tokens ${OUTPUT} \
    --bf16 \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size ${BATCH_SIZE} \
    --attn_softmax_bf16 \
    --limit_hpu_graphs \
    --trim_logits \
    --flash_attention_causal_mask \
    --flash_attention_recompute \
    --warmup 1 \
    --n_iteration 1 \
    --bucket_internal \
    --bucket_size=512 \
    --use_flash_attention

it will OOM, while not OOM if BATCH_SIZE=11

after I debugged by using memory analysis tool, I found that the first time of creating causal attention mask tensor need too much device memory, that lead to device memory exhaustion.

details of creating causal mask tensor

Converts 2D attention mask to 4D attention mask by expanding mask to (bsz, head_dim=1, query_length, key_value_length) shape and by adding a large negative bias to not-attended positions.

If attention_mask is causal, a causal mask will be added.

For the first time of creating this tensor, the shape is very big (for my case, it is [12, 1, 32768, 32768]).
During the creation of this tensor, it need a mask tensor. The mask tensor's dtype can be torch.bool, but actual it is torch.int, which caused four times the memory usage. (for shape [12, 1, 32768, 32768], it need 48G device memory, it will cause peak memory usage.)

Fixes

This PR's change is aim to explicitly make the computation of causal attention mask tensor use less device memory by using the torch.bool type mask tensor.

For code changes, just overwrite the base class's to_4d function.

Others

But why BATCH_SIZE=11 did not cause device memory exhaustion?
I think its a bug of LAZY GRAPH.
In lazy graph, it should optimize the computation of the big tensor with using less device memory.
So the best solution of fixing this bug is doing more optimization in LAZY GRAPH.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

yafshar · 2024-09-05T14:45:25Z

@harborn

would you please add more info for this PR and the issue it is addressing in the README?
please run GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v before and after changes and make sure there is no new one is introduced.

harborn · 2024-09-12T01:18:33Z

@harborn

would you please add more info for this PR and the issue it is addressing in the README?

please run GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v before and after changes and make sure there is no new one is introduced.

I have updated the necessary information of this updates in PR description.

yafshar · 2024-09-12T15:54:53Z

@harborn

would you please add more info for this PR and the issue it is addressing in the README?

please run GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v before and after changes and make sure there is no new one is introduced.

I have updated the necessary information of this updates in PR description.

Thanks, very nice explanation!

yafshar

LGTM!

@regisss would you please check this PR.

harborn · 2024-09-14T05:30:56Z

can you merge this PR? @yafshar @regisss

yafshar · 2024-09-18T17:32:28Z

@libinta can you label this PR

yafshar · 2024-09-19T13:55:38Z

@libinta can you label this PR

This PR is ready, can we label this with run_test

harborn · 2024-09-25T07:25:17Z

any task to be finished to merge this PR? @yafshar

github-actions · 2024-09-25T09:54:49Z

The code quality check failed, please run make style.

yafshar · 2024-09-25T11:50:26Z

@harborn, can you run make style and fix any related issues? Also, rebase the code.

harborn · 2024-09-26T02:47:51Z

@harborn, can you run make style and fix any related issues? Also, rebase the code.

done

HuggingFaceDocBuilderDev · 2024-09-26T07:50:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

harborn requested a review from regisss as a code owner August 30, 2024 07:06

harborn changed the title ~~fix oom when infernece with llama-3.1-70b~~ fix OOM when inference with llama-3.1-70b Aug 30, 2024

yafshar approved these changes Sep 12, 2024

View reviewed changes

libinta added the review wip label Sep 18, 2024

libinta added run-test Run CI for PRs from external contributors and removed review wip labels Sep 24, 2024

harborn force-pushed the fix-oom-infer-llama branch from df58e06 to 0cdca2e Compare September 26, 2024 02:47

regisss approved these changes Sep 26, 2024

View reviewed changes

regisss merged commit 4baaf3d into huggingface:main Sep 26, 2024
4 checks passed

harborn and others added 4 commits September 26, 2024 10:23

fix oom when infernece with llama-3.1-70b

128e8c9

add comments

548bf99

update

4cd4171

fix style

0cdca2e

hsubramony pushed a commit that referenced this pull request Sep 28, 2024

Fix OOM when inference with llama-3.1-70b (#1302)

96ebad4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix OOM when inference with llama-3.1-70b #1302

fix OOM when inference with llama-3.1-70b #1302

harborn commented Aug 30, 2024 •

edited

Loading

yafshar commented Sep 5, 2024

harborn commented Sep 12, 2024 •

edited

Loading

yafshar commented Sep 12, 2024

yafshar left a comment

harborn commented Sep 14, 2024

yafshar commented Sep 18, 2024

yafshar commented Sep 19, 2024

harborn commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

yafshar commented Sep 25, 2024

harborn commented Sep 26, 2024

HuggingFaceDocBuilderDev commented Sep 26, 2024

fix OOM when inference with llama-3.1-70b #1302

fix OOM when inference with llama-3.1-70b #1302

Conversation

harborn commented Aug 30, 2024 • edited Loading

What does this PR do?

background

details of creating causal mask tensor

Fixes

Others

Before submitting

yafshar commented Sep 5, 2024

harborn commented Sep 12, 2024 • edited Loading

yafshar commented Sep 12, 2024

yafshar left a comment

Choose a reason for hiding this comment

harborn commented Sep 14, 2024

yafshar commented Sep 18, 2024

yafshar commented Sep 19, 2024

harborn commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

yafshar commented Sep 25, 2024

harborn commented Sep 26, 2024

HuggingFaceDocBuilderDev commented Sep 26, 2024

harborn commented Aug 30, 2024 •

edited

Loading

harborn commented Sep 12, 2024 •

edited

Loading