Description
System Info
transformers
version: 4.42.3- Platform: Linux-6.1.85+-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.3
- Accelerate version: 0.32.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- Tensorflow version (GPU?): 2.15.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.4 (gpu)
- Jax version: 0.4.26
- JaxLib version: 0.4.26
- Using distributed or parallel set-up in script?: Yes
- Using GPU in script?: Yes
- GPU type: Tesla T4
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am referring to the below tutorial to finetune Microsoft Phi3 on my custom dataset: https://github.com/microsoft/Phi-3CookBook/blob/main/code/04.Finetuning/Phi-3-finetune-lora-python.ipynb
As I am doing it on Colab on T4 GPU, the Flash Attention is not supported yet [FlashAttention only supports Ampere GPUs or newer.]
Thus, according to the below code from tutorial, attention_implementation is selected as 'sdpa' with compute datatype as torch.float16
if torch.cuda.is_bf16_supported():
compute_dtype = torch.bfloat16
attn_implementation = 'flash_attention_2'
else:
compute_dtype = torch.float16
attn_implementation = 'sdpa'
Loading Model
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=compute_dtype, trust_remote_code=True, device_map='auto',
attn_implementation=attn_implementation
)
Error
It gives me the error: ValueError: Phi3ForCausalLM does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please request the support for this architecture: #28005.
and keeping attention_implementation='eager' leads to CUDA Out of Memory error.
Expected behavior
SDPA should be supported as an Attention Implementation for Microsoft Phi3 model