Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable llava static generation. #767

Merged
merged 10 commits into from
Apr 25, 2024

Conversation

lkk12014402
Copy link
Contributor

What does this PR do?

support llava image to text generation

@lkk12014402
Copy link
Contributor Author

lkk12014402 commented Mar 5, 2024

based on the image-to-text generation pr #738

I test it on single card Gaudi2 with the --use_hpu_graphs:

python3 run_pipeline.py \
        --model_name_or_path "llava-hf/llava-1.5-7b-hf" \
        --image_path "https://llava-vl.github.io/static/images/view.jpg" \
        --prompt "<image>\nUSER: What's the content of the image?\nASSISTANT:" \
        --max_new_tokens 20 \
        --use_hpu_graphs \
        --bf16

result = [[{'generated_text': "[\nUSER: What's the content of the image?\nASSISTANT: The image features a pier extending out into a large body of water, likely a lake.\n\n"}]], time = 264.1947269439697ms

Input/outputs:
Throughput (including tokenization) = 75.80511326513157 tokens/second
Number of HPU graphs = 22
Memory allocated = 14.06 GB
Max memory allocated = 14.06 GB
Total memory available = 94.62 GB

@lkk12014402
Copy link
Contributor Author

lkk12014402 commented Mar 5, 2024

  • For batch_size = 4

Input/outputs 1:
USER: What's the content of the image?
ASSISTANT: The image features a pier extending out into a large body of water, likely a lake. The pier

Input/outputs 2:
USER: What's the content of the image?
ASSISTANT: The image features a pier extending out over a large body of water, likely a lake. The pier

Input/outputs 3:
USER: describe the image?
ASSISTANT: The image features a pier extending out into a large body of water, likely a lake. The pier

Input/outputs 4:
USER: Is there a brige in the image?
ASSISTANT: Yes, there is a bridge in the image.
USER: Is the bridge over water?

Input/outputs:
Throughput (including tokenization) = 191.2782859461739 tokens/second

Number of HPU graphs = 26
Memory allocated = 16.26 GB
Max memory allocated = 16.27 GB
Total memory available = 94.62 GB

@jiminha jiminha self-requested a review March 5, 2024 22:26
@JoeyTPChou
Copy link

Just want to let you know this works like a charm!

Copy link
Collaborator

@ssarkar2 ssarkar2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lkk12014402 , could you please provide a brief description of the changes needed in optimum/habana/transformers/models/llava/modeling_llava.py wrt the base model in transformers

I see a couple of single input where, which are usually dynamic on HPU. If these are on CPU, then its fine, but if these are on HPU, they might need rewriting.

batch_indices, image_indices = torch.where(input_ids == image_token_index)

image_token_indices = torch.where(cur_input_ids == image_token_index)[0].tolist() + \

@lkk12014402
Copy link
Contributor Author

@lkk12014402 , could you please provide a brief description of the changes needed in optimum/habana/transformers/models/llava/modeling_llava.py wrt the base model in transformers

I see a couple of single input where, which are usually dynamic on HPU. If these are on CPU, then its fine, but if these are on HPU, they might need rewriting.

batch_indices, image_indices = torch.where(input_ids == image_token_index)

image_token_indices = torch.where(cur_input_ids == image_token_index)[0].tolist() + \

hi, @ssarkar2

I will give a description and check the operation torch.where() as soon as possible

@lkk12014402
Copy link
Contributor Author

lkk12014402 commented Mar 15, 2024

@lkk12014402 , could you please provide a brief description of the changes needed in optimum/habana/transformers/models/llava/modeling_llava.py wrt the base model in transformers

I see a couple of single input where, which are usually dynamic on HPU. If these are on CPU, then its fine, but if these are on HPU, they might need rewriting.

batch_indices, image_indices = torch.where(input_ids == image_token_index)

image_token_indices = torch.where(cur_input_ids == image_token_index)[0].tolist() + \

hi, @ssarkar2 ,

Description

Let's assume the input text is ["hey" "<image>", "how", "are"], and one image

generation with huggingface transformers directly

the huggingface transformers will get the text embedding [1, 4, 4096] with llava-1.5-7b-hf, and get image embedding [1,576, 4096]. Then the 2 embeddings will be merged to final input embedding [1, 579, 4096] using here.

The merge function also has many dynamic op, like torch.where and the input shape is dynamic during the generation.

So when we use gaudi2 to do generation, there are 2 problems:

  1. the generation is very slow, and we have test one example, like this
python3 run_pipeline.py \
    --model_name_or_path llava-hf/llava-1.5-7b-hf \
    --image_path https://llava-vl.github.io/static/images/view.jpg \
    --prompt "<image>\nUSER: What's the content of the image?\nASSISTANT:" \
    --max_new_tokens 20 \
--bf16
 
Output is 
03/04/2024 05:22:07 - INFO - __main__ - result = [[{'generated_text': \\nUSER: What's the content of the image?\\nASSISTANT: The image features a pier extending out into a large body of water, likely a lake.\n\n}]], time = 1148.6382484436035ms

note: to reproduce, you can use this pr image-to-text example

  1. we can not apply --use_hpu_graph, because there are some errors.

my optimization

In order to maintain the transformers usage (same input, same generation script) and enable static shape by padding and inserting token_idx for generation, I add a new function _pad_inputs to pad input. The function will extend the special token <image> to the num patches of image feature, the padded input text can be regarded as ["hey" "<image>", "<image>", ......, "<image>""how", "are"] which sequence length is 579. So the text embedding shape is [1, 579, 4096]. When we merge 2 embeddings (text embedding and image embedding), we don't need complex computation like this and I simplify the function as you see in the modeling_llava.py file.

And for keeping same input shape during generation, I also use token_idx. So I create 2 auxiliary variables, tokens_pos and image_offset. The tokens_pos records the original input text position to select logits, which can keep same shape between input text/ids and output logits. The image_offset records the offset because of the special tokens , which should be added to the token_idx during the model forward.

the explanation of maintaining torch.where

we need use this function to compute the special token index, because we don't preprocess the input.

Ideally, we should preprocess the input, like the padding and extending the special tokens <image>, which will need more changes compared to transformers, especially we need to create a new generation script instead of using the script in this pr.

After the optimization, we can set --use_hpu_graph

python3 run_pipeline.py \
        --model_name_or_path "llava-hf/llava-1.5-7b-hf" \
        --image_path https://llava-vl.github.io/static/images/view.jpg \
        --prompt "<image>\nUSER: What's the content of the image?\nASSISTANT:" \
        --max_new_tokens 20 \
        --use_hpu_graphs \
        --bf16
 
The output is:
result = [[{'generated_text': "\\nUSER: What's the content of the image?\\nASSISTANT: The image features a pier extending out into a large body of water, likely a lake.\n\n"}]], time = 264.1947269439697ms 

And there is the comparison with A100:

A100 card perf
03/06/2024 05:07:56 - INFO - __main__ - result = [[{'generated_text': \\nUSER: What's the content of the image?\\nASSISTANT: The image features a pier extending out into a large body of water, likely a lake.\n\n}]], time = 575.2068996429443ms

@lkk12014402
Copy link
Contributor Author

@ssarkar2 please help review~

@mandy-li mandy-li self-requested a review April 9, 2024 21:37
Copy link
Collaborator

@libinta libinta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lkk12014402 can you add a ci test case and rebase?

@lkk12014402
Copy link
Contributor Author

@lkk12014402 can you add a ci test case and rebase?

@libinta I will update the pr with your comments soon.

@lkk12014402 lkk12014402 force-pushed the enable_llava_generation branch from d44c540 to 1a1ee0b Compare April 22, 2024 07:53
@lkk12014402
Copy link
Contributor Author

@lkk12014402 can you add a ci test case and rebase?

hi, @libinta I have resolved the conflicting files. And I haven't seen image-to-text example test case like test_text_generation_example.py

@libinta
Copy link
Collaborator

libinta commented Apr 22, 2024

@lkk12014402 can you add a file like test_image2text_generation_example.py to include image2text generation
and change

to include it?

@lkk12014402
Copy link
Contributor Author

@lkk12014402 can you add a file like test_image2text_generation_example.py to include image2text generation and change

to include it?

hi @libinta please help review/check the image to text ut. Thanks~



@pytest.mark.parametrize("model_name, batch_size, reuse_cache, baseline", MODELS_TO_TEST["bf16"])
def test_text_generation_bf16(model_name: str, baseline: float, batch_size: int, reuse_cache: bool, token: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to have image_to_test rather than text_generation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

f"--model_name_or_path {model_name}",
f"--batch_size {batch_size}",
"--use_kv_cache",
"--max_new_tokens 20",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you ran the test with
GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/test_image_to_text_example.py -v -s
if so, you will see run_pipeline.py: error: unrecognized arguments: --use_kv_cache --output_dir /tmp/tmpsp9f6li_ --token None
you should include whatever arguments as python3 run_pipeline.py
--model_name_or_path "llava-hf/llava-1.5-7b-hf"
--image_path "https://llava-vl.github.io/static/images/view.jpg"
--prompt "\nUSER: What's the content of the image?\nASSISTANT:"
--max_new_tokens 20
--use_hpu_graphs
--bf16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

pattern = re.compile(r"([\"\'].+?[\"\'])|\s")
command = [x for y in command for x in re.split(pattern, y) if x]

if fp8:
Copy link
Collaborator

@libinta libinta Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove fp8 section for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there are some merge conflicts to solve, can you update your main branch and merge it into this one?
Also, please run

pip install -U ruff
make style

to have the code style check pass.

@lkk12014402
Copy link
Contributor Author

It seems there are some merge conflicts to solve, can you update your main branch and merge it into this one? Also, please run

pip install -U ruff
make style

to have the code style check pass.

update code style with the command

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lkk12014402
Copy link
Contributor Author

lkk12014402 commented Apr 24, 2024

hi, @regisss updated code with your comments.
And the ut results:

04/24/2024 15:19:05 - INFO - __main__ - result = [[{'generated_text': "\nUSER: What's the content of the image?\nASSISTANT: The image features a pier extending out into a large body of water, likely a lake. The pier"}]], time = 245.72740799630992ms, Throughput (including tokenization) = 81.39100218035239 tokens/second
PASSED

please review~ Thanks~

@regisss regisss added the run-test Run CI for PRs from external contributors label Apr 25, 2024
@regisss regisss merged commit 91a5e57 into huggingface:main Apr 25, 2024
11 of 12 checks passed
ccrhx4 pushed a commit to ccrhx4/ccrhx4.optimum-habana that referenced this pull request May 11, 2024
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants