llama_int8 do not support do_sample=True

### Describe the bug

 with demo run_llama_int8.py, setting generate_kwargs["do_sample"] to be True, I got the error as follows:

command:
python run_llama_int8.py -m ${MODEL_ID} --quantized-model-path "/workspace/saved_results/best_model.pt" --benchmark --jit --int8-bf16-mixed --num-iter 5 --prompt "hello"

error log:
/opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py:1405: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on meta. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('meta') before running `.generate()`.
  warnings.warn(
Traceback (most recent call last):
  File "/lzw/run_llama_int8.py", line 378, in <module>
    output = user_model.generate(
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/opt/conda/lib/python3.9/site-packages/transformers/generation/utils.py", line 2524, in sample
    outputs = self(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/models.py", line 624, in LlamaForCausalLM_forward
    outputs = self.model(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1522, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1531, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/models.py", line 283, in LlamaModel_forward
    attention_mask = self._prepare_decoder_attention_mask(
  File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/attentions.py", line 65, in _prepare_decoder_attention_mask
    combined_attention_mask = _make_causal_mask(
  File "/opt/conda/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/attentions.py", line 18, in _make_causal_mask
    mask = torch.full(
NotImplementedError: Could not run 'aten::_local_scalar_dense' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::_local_scalar_dense' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradMeta, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

do_sample is an import feature for me.

### Versions

[pip3] intel-extension-for-pytorch==2.1.0.dev0+cpu.llm
[pip3] numpy==1.24.1
[pip3] torch==2.1.0.dev20230711+cpu
[pip3] torchaudio==2.1.0.dev20230711+cpu
[pip3] torchvision==0.16.0.dev20230711+cpu
[conda] intel-extension-for-pytorch 2.1.0.dev0+cpu.llm          pypi_0    pypi
[conda] numpy                     1.24.1                   pypi_0    pypi
[conda] torch                     2.1.0.dev20230711+cpu          pypi_0    pypi
[conda] torchaudio                2.1.0.dev20230711+cpu          pypi_0    pypi
[conda] torchvision               0.16.0.dev20230711+cpu          pypi_0    pypi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama_int8 do not support do_sample=True #430

Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama_int8 do not support do_sample=True #430

Description

Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions