LLaVA example can not work when `BIGDL_QUANTIZE_KV_CACHE=1` #11697

ATMxsp01 · 2024-07-31T09:33:06Z

The llava example in python/llm/example/GPU/PyTorch-Models/Model/llava can not work correctly when ENV BIGDL_QUANTIZE_KV_CACHE is set to 1.
Running generate.py after all the steps in README.md, we get a model with the following structure:

Model Structure


LlavaLlamaForCausalLM(
  (model): LlavaLlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
    (vision_tower): CLIPVisionTower(
      (vision_tower): CLIPVisionModel(
        (vision_model): CLIPVisionTransformer(
          (embeddings): CLIPVisionEmbeddings(
            (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
            (position_embedding): Embedding(577, 1024)
          )
          (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (encoder): CLIPEncoder(
            (layers): ModuleList(
              (0-23): 24 x CLIPEncoderLayer(
                (self_attn): CLIPAttention(
                  (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
                  (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
                )
                (layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
                (mlp): CLIPMLP(
                  (activation_fn): QuickGELUActivation()
                  (fc1): Linear(in_features=1024, out_features=4096, bias=True)
                  (fc2): Linear(in_features=4096, out_features=1024, bias=True)
                )
                (layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
              )
            )
          )
          (post_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (mm_projector): Sequential(
      (0): Linear(in_features=1024, out_features=4096, bias=True)
      (1): GELU(approximate='none')
      (2): Linear(in_features=4096, out_features=4096, bias=True)
    )
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

The script will crash after reading input from terminal, when calling method model.generate, the final part of the traceback is:

  File "C:\Users\arda\miniforge3\envs\llava=test\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arda\miniforge3\envs\llava=test\Lib\site-packages\transformers\models\llama\modeling_llama.py", line 1068, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "C:\Users\arda\miniforge3\envs\llava=test\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arda\miniforge3\envs\llava=test\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arda\miniforge3\envs\llava=test\Lib\site-packages\ipex_llm\transformers\models\llama.py", line 323, in llama_decoder_forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "C:\Users\arda\miniforge3\envs\llava=test\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arda\miniforge3\envs\llava=test\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arda\miniforge3\envs\llava=test\Lib\site-packages\ipex_llm\transformers\models\llama.py", line 1539, in llama_attention_forward_4_38
    return forward_function(
           ^^^^^^^^^^^^^^^^^
  File "C:\Users\arda\miniforge3\envs\llava=test\Lib\site-packages\ipex_llm\transformers\models\llama.py", line 1743, in llama_attention_forward_4_38_quantized
    attn_output = xe_addons.sdp_fp8(query_states, key_states, value_states, new_attn_mask)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: expected scalar type Byte but found Float

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaVA example can not work when `BIGDL_QUANTIZE_KV_CACHE=1` #11697

LLaVA example can not work when `BIGDL_QUANTIZE_KV_CACHE=1` #11697

ATMxsp01 commented Jul 31, 2024

LLaVA example can not work when BIGDL_QUANTIZE_KV_CACHE=1 #11697

LLaVA example can not work when BIGDL_QUANTIZE_KV_CACHE=1 #11697

Comments

ATMxsp01 commented Jul 31, 2024

LLaVA example can not work when `BIGDL_QUANTIZE_KV_CACHE=1` #11697

LLaVA example can not work when `BIGDL_QUANTIZE_KV_CACHE=1` #11697