Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change load format for Mixtral #2028

Merged
merged 1 commit into from
Dec 11, 2023
Merged

Change load format for Mixtral #2028

merged 1 commit into from
Dec 11, 2023

Conversation

WoosukKwon
Copy link
Collaborator

Closes #2018

This is a bandage solution for Mixtral.

@mlinmg
Copy link

mlinmg commented Dec 18, 2023

it stills gives error #2024 when trying to load GPTQ version of Mixtral, And I'm unable to specify to load it with pt it says to load it with safetensors

@WoosukKwon
Copy link
Collaborator Author

Hi @mlinmg, this part of logic was fixed by #2133. Could you upgrade vLLM to v0.2.6 and try again? If it still doesn't work, could you please share the model name so that we can reproduce?

@mlinmg
Copy link

mlinmg commented Dec 18, 2023

Yes, I'll do it later today/tomorrow.
However the model is Mixtral-GPTQ

@Sobsz
Copy link

Sobsz commented Dec 21, 2023

However the model is Mixtral-GPTQ

The problem there is that the model is only in Safetensors format, not PyTorch. You'll have to either wait for vLLM to support Mixtral in Safetensors or download the full-size model and quantize it yourself.

@skt7
Copy link
Contributor

skt7 commented Dec 23, 2023

Yes right, the issue seems to persist with the TheBloke/Mixtral-8x7B-v0.1-GPTQ model, @TheBloke any chance you can provide the pt weights for the same?

@TheBloke
Copy link

I've not uploaded non-safetensors for GPTQ for 6+ months. I'm not sure if AutoGPTQ even supports it any more. I guess Transformers does.

That's a major pain though and I'm certainly not going to upload PT for models on an ongoing basis. I'm willing to do it as a once-off for Mixtral 8x7B and the Instruct version, for testing purposes. I will upload them in separate branches of the main repos.

Is this a problem unique to Mixtral GPTQ, or all vLLM GPTQs? I've not yet tested vLLM GPTQ support myself - planning to do so soon, then I'll add mention of it to my GPTQ READMEs.

@Sobsz
Copy link

Sobsz commented Dec 23, 2023

It's unique to Mixtral, as far as I know. I believe I've read somewhere that it's caused by the layer names being different between the PyTorch and Safetensors versions, but don't take my word for it.

Someone did upload a GPTQ version of Mixtral, but only the base model: https://huggingface.co/IbuNai/Mixtral-8x7B-v0.1-gptq-4bit-pth
I haven't tested it, so it might not work for the same reason as the Safetensors model.

@TheBloke
Copy link

Hmm, that sounds different then. If I make .pt versions of my GPTQs, the layer names will be identical.

This issue sounds like vLLM only supports the non-HF version of Mixtral, ie the version distributed as consolidated.xx.pth files, which did indeed have different layer names.

Looks like that IbuNaj guy converted my Mixtral GPTQ to pth - maybe he renamed the layers at the same time then.

Let me know if that works and if it does I'll see about doing an Instruct version.

But hopefully vLLM will support HF version Mixtral as GPTQ soon?

I know it will support the Mixtral AWQ versions soon, as Casper Hansen has made a PR for that - so that might be the easier option for you to use.

@Sobsz
Copy link

Sobsz commented Dec 23, 2023

Just checked, their model doesn't work, though seemingly for a different reason; while your version KeyErrors at model.layers.0.block_sparse_moe.experts.4.w1.g_idx, theirs instead fails at model.layers.18.block_sparse_moe.gate.qweight (which is listed in the index, oddly enough).

@skt7
Copy link
Contributor

skt7 commented Dec 25, 2023

Thanks @TheBloke and @Sobsz for the prompt response, seems there was some issue while serving with TP>1, which was fixed in #2208. As it is not released yet, once I built vllm from source using the latest code I was able to serve the TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ main model.

@skt7
Copy link
Contributor

skt7 commented Dec 25, 2023

Checked the other variants as well but seems only 4bit models are supported right now, with 3bit and 8bit variants getting the following error:

Traceback (most recent call last):
  File "/opt/conda/envs/vllm_2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/vllm_2/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/usr/vllm/vllm/entrypoints/api_server.py", line 82, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/usr/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/home/usr/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/usr/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/usr/vllm/vllm/engine/llm_engine.py", line 113, in __init__
    self._init_workers_ray(placement_group)
  File "/home/usr/vllm/vllm/engine/llm_engine.py", line 200, in _init_workers_ray
    self._run_workers(
  File "/home/usr/vllm/vllm/engine/llm_engine.py", line 768, in _run_workers
    self._run_workers_in_batch(workers, method, *args, **kwargs))
  File "/home/usr/vllm/vllm/engine/llm_engine.py", line 745, in _run_workers_in_batch
    all_outputs = ray.get(all_outputs)
  File "/home/usr/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/usr/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/usr/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayWorkerVllm.execute_method() (pid=160685, ip=10.128.0.3, actor_id=090d22607e677ff90c8b386b01000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f068c68cf70>)
  File "/home/usr/vllm/vllm/engine/ray_utils.py", line 31, in execute_method
    return executor(*args, **kwargs)
  File "/home/usr/vllm/vllm/worker/worker.py", line 79, in load_model
    self.model_runner.load_model()
  File "/home/usr/vllm/vllm/worker/model_runner.py", line 60, in load_model
    self.model = get_model(self.model_config)
  File "/home/usr/vllm/vllm/model_executor/model_loader.py", line 41, in get_model
    quant_config = get_quant_config(model_config.quantization,
  File "/home/usr/vllm/vllm/model_executor/weight_utils.py", line 95, in get_quant_config
    return quant_cls.from_config(hf_quant_config)
  File "/home/usr/vllm/vllm/model_executor/layers/quantization/gptq.py", line 64, in from_config
    return cls(weight_bits, group_size, desc_act)
  File "/home/usr/vllm/vllm/model_executor/layers/quantization/gptq.py", line 33, in __init__
    raise ValueError(
ValueError: Currently, only 4-bit weight quantization is supported for GPTQ, but got 3 bits.

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify --load-format pt for Mixtral
6 participants