Fix sparsity arg in Engine/ModelArgs #179

mgoin · 2024-04-10T22:15:47Z

The sparsity argument was being ignored because it was switched around with the new quantization_param_path arg.

>>> from vllm import LLM
>>> model = LLM("nm-testing/OpenHermes-2.5-Mistral-7B-pruned50", sparsity="sparse_w16a16", max_model_len=1024)
INFO 04-10 18:02:12 llm_engine.py:81] Initializing an LLM engine (v0.2.0) with config: model='nm-testing/OpenHermes-2.5-Mistral-7B-pruned50', speculative_config=None, tokenizer='nm-testing/OpenHermes-2.5-Mistral-7B-pruned50', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=None, sparsity=None, enforce_eager=8192, kv_cache_dtype=auto, quantization_param_path=False, device_config=cuda, seed=0)

Notice the sparsity=None in the log, which doesn't match the arg we specified.

Because the model config construction in create_engine_config() doesn't use named arguments, these args got flipped around, see the mass of:

    def create_engine_config(self, ) -> EngineConfig:
        device_config = DeviceConfig(self.device)
        model_config = ModelConfig(
            self.model, self.tokenizer, self.tokenizer_mode,
            self.trust_remote_code, self.download_dir, self.load_format,
            self.dtype, self.seed, self.revision, self.code_revision,
            self.tokenizer_revision, self.max_model_len, self.quantization,
            self.quantization_param_path, self.sparsity, self.enforce_eager,
            self.max_context_len_to_capture, self.max_logprobs)
            ...

bnellnm · 2024-04-10T22:38:01Z

vllm/engine/arg_utils.py

+            self.model,
+            self.tokenizer,
+            self.tokenizer_mode,
+            self.trust_remote_code,
+            self.download_dir,
+            self.load_format,
+            self.dtype,
+            self.seed,
+            self.revision,
+            self.code_revision,
+            self.tokenizer_revision,
+            self.max_model_len,
+            self.quantization,
+            self.quantization_param_path,
+            # UPSTREAM SYNC: keep sparsity argument
+            self.sparsity,
+            self.enforce_eager,
+            self.max_context_len_to_capture,
+            self.max_logprobs)


considering how this broke, should we be using names for some/most of these arguments?

i do want to, but that would further diverge from the upstream so i didn't go for it

lets do an upstream PR for this (i.e. for bills idea of making these named args)

mgoin added 2 commits April 10, 2024 18:14

Fix sparsity arg in Engine/ModelArgs

0e2529f

Re-add UPSTREAM SYNC

fd27111

bnellnm approved these changes Apr 10, 2024

View reviewed changes

mgoin merged commit dcd4973 into main Apr 11, 2024
2 checks passed

mgoin deleted the fix-sparsity-engine-arg branch April 11, 2024 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sparsity arg in Engine/ModelArgs #179

Fix sparsity arg in Engine/ModelArgs #179

mgoin commented Apr 10, 2024 •

edited

Loading

bnellnm Apr 10, 2024 •

edited

Loading

mgoin Apr 10, 2024

robertgshaw2-neuralmagic Apr 10, 2024 •

edited

Loading

Fix sparsity arg in Engine/ModelArgs #179

Fix sparsity arg in Engine/ModelArgs #179

Conversation

mgoin commented Apr 10, 2024 • edited Loading

bnellnm Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

mgoin Apr 10, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

mgoin commented Apr 10, 2024 •

edited

Loading

bnellnm Apr 10, 2024 •

edited

Loading

robertgshaw2-neuralmagic Apr 10, 2024 •

edited

Loading