Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UX] sampling with vllm #1624

Merged
merged 2 commits into from
Mar 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -86,15 +86,19 @@ def translate_vllm_params(self, parameters: dict) -> dict:

:return: The same parameters dict, but with VLLM style parameter names.
"""
parameters.pop('do_sample', None)
parameters["max_tokens"] = parameters.pop("max_new_tokens", 30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 30? is this the same default for other backends as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the same default for other backends in our handlers.

if "seed" in parameters.keys():
parameters["seed"] = int(parameters["seed"])
if "max_new_tokens" in parameters.keys():
parameters["max_tokens"] = parameters.pop("max_new_tokens")
if not parameters.pop('do_sample', False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is do_sample also an alias for sampling with some default temperature in other backends? if so, are the default values for temperature when do_sample is set in parity across backends?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we probably want to define the do sample behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, do_sample means sampling here. But the default for temperature for greedy in other backends is 1.0.

And vllm seems to do sampling only when temperature is less than 1.0. I guess, they go for the logic that settign temp=0, means disabling the randomness, so they choose greedy when temp is set to 0.

vLLM chooses Sampling as their default method for choosing next token. But all other frameworks has Greedy as their default. So to unify our handlers behavior, here we are trying to make the sampling method in parity. But yes, default temp value would not be the same as other backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It can be in another PR, or it might not be feasible at all, but if the default sampling behavior is different across backends when do_sample=True, that could cause confusion when users switch between backends. I wonder if it's possible (or woth it) to try to unify the behavior when do_sample=True across all backends (meaning same temperature/other params set to same value)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, Will write an internal quip doc for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# if temperature is zero, vLLM does greedy sampling
parameters['temperature'] = 0
if "stop_sequences" in parameters.keys():
parameters["stop"] = parameters.pop("stop_sequences")
if "ignore_eos_token" in parameters.keys():
parameters["ignore_eos"] = parameters.pop("ignore_eos")
if "num_beams" in parameters.keys():
parameters["best_of"] = parameters.pop("num_beams")
parameters["use_beam_search"] = True
return parameters

@stop_on_any_exception
Expand Down
15 changes: 7 additions & 8 deletions serving/docs/lmi/user_guides/lmi_input_output_schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ When providing inputs following the input schema as a string, the output's gener

### Common rolling batch input parameters
```
'do_sample' : boolean (default = False),
'seed' : integer (default = ramdom value),
'temperature' : float (default= 1.0),
'repetition_penalty': float (default= 1.0),
'top_k' : integer (default = 0),
Expand All @@ -68,9 +70,7 @@ Apart from these common parameters, there are other parameters that are specific

```
DeepSpeedRollingBatchParameters : {
'typical_p' : float (default= 1.0),
'do_sample' : boolean (default = False),
'seed' : integer (default = 0),
'typical_p' : float (default= 1.0),
'stop_sequences' : list (default = None),
'truncate' : integer (default = None),
}
Expand All @@ -83,8 +83,6 @@ DeepSpeedRollingBatchParameters : {
```
LmiDistRollingBatchParameters : {
'typical_p' : float (default= 1.0),
'do_sample' : boolean (default = false),
'seed' : integer (default = 0),
'stop_sequences' : list (default = None),
'truncate' : integer (default = None),
'ignore_eos_token' : boolean (default = false)
Expand All @@ -98,11 +96,13 @@ LmiDistRollingBatchParameters : {
```
vLLMRollingBatchParameters : {
'stop_sequences' : list,
'best_of' : int (default = None),
'temperature' : float (default= 0),
'top_k' : integer (default = -1)

'min_p': float (default = 0.0),
'presence_penalty': float (default = 0.0),
'frequency_penalty' : float (default = 0.0),
'use_beam_search': boolean (default = false),
'num_beams': integer (default = 1), (set this greater than 1 to enable beam search)
'stop_token_ids': list (default = None),
'include_stop_str_in_output' : boolean (default = false),
'ignore_eos_token' : boolean (default = false),
Expand All @@ -124,7 +124,6 @@ TensorRTLLMRollingBatchParameters : {
'max_new_tokens' : integer (default = 128),
'top_k' : integer (default = 5),
'top_p' : float (default= 0.85),
'seed' : integer (default = None),
'details' : boolean (default = false),
'stop' : boolean,
'presence_penalty': float,
Expand Down
Loading