Change load format for Mixtral #2028

WoosukKwon · 2023-12-11T18:31:04Z

Closes #2018

This is a bandage solution for Mixtral.

mlinmg · 2023-12-18T14:52:16Z

it stills gives error #2024 when trying to load GPTQ version of Mixtral, And I'm unable to specify to load it with pt it says to load it with safetensors

WoosukKwon · 2023-12-18T18:46:29Z

Hi @mlinmg, this part of logic was fixed by #2133. Could you upgrade vLLM to v0.2.6 and try again? If it still doesn't work, could you please share the model name so that we can reproduce?

mlinmg · 2023-12-18T18:55:32Z

Yes, I'll do it later today/tomorrow.
However the model is Mixtral-GPTQ

Sobsz · 2023-12-21T16:51:49Z

However the model is Mixtral-GPTQ

The problem there is that the model is only in Safetensors format, not PyTorch. You'll have to either wait for vLLM to support Mixtral in Safetensors or download the full-size model and quantize it yourself.

skt7 · 2023-12-23T01:56:37Z

Yes right, the issue seems to persist with the TheBloke/Mixtral-8x7B-v0.1-GPTQ model, @TheBloke any chance you can provide the pt weights for the same?

TheBloke · 2023-12-23T10:15:53Z

I've not uploaded non-safetensors for GPTQ for 6+ months. I'm not sure if AutoGPTQ even supports it any more. I guess Transformers does.

That's a major pain though and I'm certainly not going to upload PT for models on an ongoing basis. I'm willing to do it as a once-off for Mixtral 8x7B and the Instruct version, for testing purposes. I will upload them in separate branches of the main repos.

Is this a problem unique to Mixtral GPTQ, or all vLLM GPTQs? I've not yet tested vLLM GPTQ support myself - planning to do so soon, then I'll add mention of it to my GPTQ READMEs.

Sobsz · 2023-12-23T10:25:17Z

It's unique to Mixtral, as far as I know. I believe I've read somewhere that it's caused by the layer names being different between the PyTorch and Safetensors versions, but don't take my word for it.

Someone did upload a GPTQ version of Mixtral, but only the base model: https://huggingface.co/IbuNai/Mixtral-8x7B-v0.1-gptq-4bit-pth
I haven't tested it, so it might not work for the same reason as the Safetensors model.

TheBloke · 2023-12-23T10:29:08Z

Hmm, that sounds different then. If I make .pt versions of my GPTQs, the layer names will be identical.

This issue sounds like vLLM only supports the non-HF version of Mixtral, ie the version distributed as consolidated.xx.pth files, which did indeed have different layer names.

Looks like that IbuNaj guy converted my Mixtral GPTQ to pth - maybe he renamed the layers at the same time then.

Let me know if that works and if it does I'll see about doing an Instruct version.

But hopefully vLLM will support HF version Mixtral as GPTQ soon?

I know it will support the Mixtral AWQ versions soon, as Casper Hansen has made a PR for that - so that might be the easier option for you to use.

Sobsz · 2023-12-23T13:22:59Z

Just checked, their model doesn't work, though seemingly for a different reason; while your version KeyErrors at model.layers.0.block_sparse_moe.experts.4.w1.g_idx, theirs instead fails at model.layers.18.block_sparse_moe.gate.qweight (which is listed in the index, oddly enough).

skt7 · 2023-12-25T14:49:35Z

Thanks @TheBloke and @Sobsz for the prompt response, seems there was some issue while serving with TP>1, which was fixed in #2208. As it is not released yet, once I built vllm from source using the latest code I was able to serve the TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ main model.

skt7 · 2023-12-25T15:27:49Z

Checked the other variants as well but seems only 4bit models are supported right now, with 3bit and 8bit variants getting the following error:

Traceback (most recent call last):
  File "/opt/conda/envs/vllm_2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/vllm_2/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/usr/vllm/vllm/entrypoints/api_server.py", line 82, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/usr/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/home/usr/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/usr/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/usr/vllm/vllm/engine/llm_engine.py", line 113, in __init__
    self._init_workers_ray(placement_group)
  File "/home/usr/vllm/vllm/engine/llm_engine.py", line 200, in _init_workers_ray
    self._run_workers(
  File "/home/usr/vllm/vllm/engine/llm_engine.py", line 768, in _run_workers
    self._run_workers_in_batch(workers, method, *args, **kwargs))
  File "/home/usr/vllm/vllm/engine/llm_engine.py", line 745, in _run_workers_in_batch
    all_outputs = ray.get(all_outputs)
  File "/home/usr/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/usr/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/usr/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayWorkerVllm.execute_method() (pid=160685, ip=10.128.0.3, actor_id=090d22607e677ff90c8b386b01000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f068c68cf70>)
  File "/home/usr/vllm/vllm/engine/ray_utils.py", line 31, in execute_method
    return executor(*args, **kwargs)
  File "/home/usr/vllm/vllm/worker/worker.py", line 79, in load_model
    self.model_runner.load_model()
  File "/home/usr/vllm/vllm/worker/model_runner.py", line 60, in load_model
    self.model = get_model(self.model_config)
  File "/home/usr/vllm/vllm/model_executor/model_loader.py", line 41, in get_model
    quant_config = get_quant_config(model_config.quantization,
  File "/home/usr/vllm/vllm/model_executor/weight_utils.py", line 95, in get_quant_config
    return quant_cls.from_config(hf_quant_config)
  File "/home/usr/vllm/vllm/model_executor/layers/quantization/gptq.py", line 64, in from_config
    return cls(weight_bits, group_size, desc_act)
  File "/home/usr/vllm/vllm/model_executor/layers/quantization/gptq.py", line 33, in __init__
    raise ValueError(
ValueError: Currently, only 4-bit weight quantization is supported for GPTQ, but got 3 bits.

Change load format for Mixtral

6278936

WoosukKwon requested a review from zhuohan123 December 11, 2023 18:31

simon-mo approved these changes Dec 11, 2023

View reviewed changes

WoosukKwon merged commit b9bcdc7 into main Dec 11, 2023
2 checks passed

WoosukKwon deleted the fix-mixtral branch December 11, 2023 18:32

This was referenced Dec 11, 2023

Load Qlora-Mixtral-finetuned model: KeyError: 'model.layers.1.block_sparse_moe.experts.4.w3.weight #2024

Closed

Mixtral - KeyError: 'model.layers.10.block_sparse_moe.experts.0.w1.weight' #2020

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Change the load format to pt for Mixtral (vllm-project#2028)

39c386c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change load format for Mixtral #2028

Change load format for Mixtral #2028

WoosukKwon commented Dec 11, 2023

mlinmg commented Dec 18, 2023

WoosukKwon commented Dec 18, 2023

mlinmg commented Dec 18, 2023

Sobsz commented Dec 21, 2023

skt7 commented Dec 23, 2023 •

edited

Loading

TheBloke commented Dec 23, 2023

Sobsz commented Dec 23, 2023

TheBloke commented Dec 23, 2023

Sobsz commented Dec 23, 2023

skt7 commented Dec 25, 2023

skt7 commented Dec 25, 2023

Change load format for Mixtral #2028

Change load format for Mixtral #2028

Conversation

WoosukKwon commented Dec 11, 2023

mlinmg commented Dec 18, 2023

WoosukKwon commented Dec 18, 2023

mlinmg commented Dec 18, 2023

Sobsz commented Dec 21, 2023

skt7 commented Dec 23, 2023 • edited Loading

TheBloke commented Dec 23, 2023

Sobsz commented Dec 23, 2023

TheBloke commented Dec 23, 2023

Sobsz commented Dec 23, 2023

skt7 commented Dec 25, 2023

skt7 commented Dec 25, 2023

skt7 commented Dec 23, 2023 •

edited

Loading