You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately will be deprecated in the next release.
@@ -194,11 +206,10 @@ A few important things to consider when using the EAGLE based draft models:
194
206
be able to be loaded and used directly by vLLM after [PR 12304](https://github.com/vllm-project/vllm/pull/12304).
195
207
If you are using vllm version before [PR 12304](https://github.com/vllm-project/vllm/pull/12304), please use the
196
208
[script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
197
-
and specify `speculative_model="path/to/modified/eagle/model"`. If weight-loading problems still occur when using
198
-
the latest version of vLLM, please leave a comment or raise an issue.
209
+
and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue.
199
210
200
211
2. The EAGLE based draft models need to be run without tensor parallelism
201
-
(i.e. speculative_draft_tensor_parallel_size is set to 1), although
212
+
(i.e. draft_tensor_parallel_size is set to 1 in `speculative_config`), although
202
213
it is possible to run the main model using tensor parallelism (see example above).
203
214
204
215
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
0 commit comments