Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for mistral type Model to use Mistral and Zephyr #1553

Closed
manjunathshiva opened this issue Nov 27, 2023 · 8 comments
Closed

Add support for mistral type Model to use Mistral and Zephyr #1553

manjunathshiva opened this issue Nov 27, 2023 · 8 comments

Comments

@manjunathshiva
Copy link

Feature request

Using airllm to used 4GB GPU for mistral type Model gives me below error

File "C:\model.py", line 5, in
model = AirLLMLlama2("./modles/zephyr-7b-beta")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\LLM\venv\Lib\site-packages\airllm\airllm.py", line 184, in init
self.init_model()
File "C:\LLM\venv\Lib\site-packages\airllm\airllm.py", line 197, in init_model
self.model = BetterTransformer.transform(self.model) # enable flash attention
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\305031856\AppData\Local\Programs\Python\Python311\Lib\contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "C:\LLM\venv\Lib\site-packages\optimum\bettertransformer\transformation.py", line 228, in transform
raise NotImplementedError(
NotImplementedError: The model type mistral is not yet supported to be used with BetterTransformer. Feel free to open an issue at https://github.com/huggingface/optimum/issues if you would like this model type to be supported. Currently supported models are: dict_keys(['albert', 'bark', 'bart', 'bert', 'bert-generation', 'blenderbot', 'bloom', 'camembert', 'blip-2', 'clip', 'codegen', 'data2vec-text', 'deit', 'distilbert', 'electra', 'ernie', 'fsmt', 'falcon', 'gpt2', 'gpt_bigcode', 'gptj', 'gpt_neo', 'gpt_neox', 'hubert', 'layoutlm', 'llama', 'm2m_100', 'marian', 'markuplm', 'mbart', 'opt', 'pegasus', 'rembert', 'prophetnet', 'roberta', 'roc_bert', 'roformer', 'splinter', 'tapas', 't5', 'vilt', 'vit', 'vit_mae', 'vit_msn', 'wav2vec2', 'whisper', 'xlm-roberta', 'yolos']).

Motivation

Zephyr is currently the leading model in Hugging Face so support is very much needed !

Your contribution

Yes I can help if any help needed! Am a Senior Software Engineer with 17 years of Industry experience,.

@Govind-S-B
Copy link

I tried looking various listed issues around as well and seems like its been unaddressed for more than a month. I was thinking of finally adding support for mistral architecture on my own , even though I dont know much about it .
Found this resource in the docs which might help : https://huggingface.co/docs/optimum/bettertransformer/tutorials/contribute.
I am also trying to get Air LLM working with mistral , good to see others are working on the same

@manjunathshiva
Copy link
Author

Thank you very much! Mistral 7B is top model which out performed Llama 13B in few cases. Zephyr-7b-beta from Hugging Face which is finetuned from Mistral is the best one which even beats Llama 70B in few cases. Adding support for Mistral will open up Mistral and zephyr model. Thanks for the link for contribution.

@Govind-S-B
Copy link

Btw I dont think pursuing performance improvements using airllm is worth it , I tried it with a 34B param model and its really really slow on my 8GB card , the bottleneck is gonna be the processing power. A quantized model loaded straight into card is better imo

@manjunathshiva
Copy link
Author

Thanks for the update! I think update may not be required until model becomes faster!

@fxmarty
Copy link
Contributor

fxmarty commented Dec 13, 2023

Hi @manjunathshiva, in Transformers 4.36 release we started adding native torch.nn.functional.scaled_dot_product_attention support for decoder models (see https://github.com/huggingface/transformers/releases/tag/v4.36.0 & https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention).

As for decoder models we do not use nested tensors and simply rely on SDPA, let's add this directly in Transformers.

I opened the issue huggingface/transformers#28005 in Transformers to track the support. Please continue the discussion there!

@fxmarty fxmarty closed this as completed Dec 13, 2023
@jesulo
Copy link

jesulo commented Jan 19, 2024

Hi, BetterTransformer support Mistral? or Solar Mistral? Regards

@pradeepdev-1995
Copy link

any updates on this is BetterTransformer support Mistral?

@fxmarty
Copy link
Contributor

fxmarty commented Jan 22, 2024

Hi @jesulo @pradeepdev-1995, BetterTransformer optimization for Mistral (which in our case is simply calling PyTorch's SDPA op instead of manual attention) has been integrated in Transformers natively, see https://huggingface.co/docs/transformers/v4.37.0/en/perf_infer_gpu_one#bettertransformer and https://huggingface.co/docs/transformers/v4.37.0/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention, as long as you use torch>=2.1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants