-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) #3290
Conversation
Add non-MI300 compatible alternative for bulk conversions Removed bf8 (e5m2) and renamed f8 to fp8 to explicitly specify that it is e4m3 Removed stochastic rounding for simplicity Put bulk fp8 conversion hip intrinsics behind a define. Disabled by default Using types from the proper vllm headers. Added namespace Move amd specific headers under amd_detail
Greg/fp8 tests
Reduce fp8 range in the conversion test to match e4m3 Add other MI300 architectures to the list Simplify device guard use in conversion kernel
Rename remaining fp8_e5m2 to general fp8
…m3-kvcache-rocm
Enable FP8 E4M3 KV Cache
…ing factor instead of Tensor, should be working out the box
…; Isolating math works
…, remove PT support from scales extraction utility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall now the pr is lgtm. @zhuohan123 could you please take time to review it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve according to @zhaoyang-star
3rdparty/README.md
Outdated
@@ -0,0 +1,32 @@ | |||
### Quantizer Utilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm confused on why do we need to vendor the script? is this the same script from ammo or written by you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo , this is a script taken from nVIDIA, as part of their quantizer examples, we kept them under 3rdparty
, and their license banner unchanged. It is included only for the reference and convenience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no change, let's not include it please. We are should comfortable referring users to the ammo/quantizer repo to perform the quantization.
#include "../quantization/fp8_e5m2_kvcache/quant_utils.cuh" | ||
#elif defined(ENABLE_FP8_E4M3) | ||
#include "../quantization/fp8/amd_detail/quant_utils.cuh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought E4M3 is NV compatible as well? Also is it possible to enable both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo , OCP E4M3 is NV compatible. But we don't address NV platform in this PR.
|
||
These scaling factors can be specified by passing an optional quantization param JSON to the LLM engine at load time. If | ||
this JSON is not specified, scaling factors default to 1.0. These scaling factors are typically obtained when running an | ||
unquantized model through a quantizer tool (e.g. AMD quantizer or nVIDIA AMMO). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unquantized model through a quantizer tool (e.g. AMD quantizer or nVIDIA AMMO). | |
unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a link to these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detail to fetch nVIDIA AMMO is in the README, do you want to have that here also?
unquantized model through a quantizer tool (e.g. AMD quantizer or nVIDIA AMMO). | ||
|
||
Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent | ||
silicon offerings e.g. AMD MI300, nVIDIA Hopper or later support native hardware conversion to/from fp32, fp16, bf16, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
silicon offerings e.g. AMD MI300, nVIDIA Hopper or later support native hardware conversion to/from fp32, fp16, bf16, etc. | |
silicon offerings e.g. AMD MI300, NVIDIA Hopper or later support native hardware conversion to/from fp32, fp16, bf16, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, will do a full replacement, btw nVIDIA
is official.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a retired name: https://forums.tomshardware.com/threads/nvidias-name-change.3644596/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo good update me on this 👍 and those were changed.
|
||
from vllm import LLM, SamplingParams | ||
sampling_params = SamplingParams(temperature=1.2, top_p=0.9) | ||
llm = LLM(model="/data/models/llama-2-7b-chat-hf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use off the shelf one from huggingface so users can use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/data/models/llama-2-7b-chat-hf
is the local path of a converted HF model (converted from Meta released LL2 model in standard manner). Changing the pointer of HF model path will be the same, will update after some verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to HuggingFace off-the-shelf model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we only doing this for llama? is other models supported as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo , we plan to enable all other models (that both vLLM and quantizer/ammo support), once this PR (design) is approved. At present, other models will only be using default scaling factor 1.0. We will send follow-up PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because so, please note in documentation about this.
# quantized_value * scaling_factor ~= true_value | ||
# which is consistent with the practice of setting | ||
# scaling_factor = tensor_amax / FPtype_max | ||
scaling_factor *= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment make sense, but not the *=2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do *2
only for HIP, to deal with the difference in numeric from our chip. after *2
overall effect is identical as without it on NV.
@@ -275,6 +275,107 @@ def hf_model_weights_iterator( | |||
torch.cuda.empty_cache() | |||
|
|||
|
|||
def kv_cache_scales_loader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the best way to distribute the scaling factors? i wonder whether there can be a convention to include it in the model's weights dictionary, similar to how other quantization methods support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most other methods is weight quantization only, we sure looked.
Here, we first introduced scaling factors to KV cache (not weight sort), soon we will also add scaling factors for activations (tensor X
s). We found adding them as extended parameters this way making sense and also light weight.
vllm/model_executor/weight_utils.py
Outdated
# Since the number of layers is small and (for now) we use scalar | ||
# scaling factors (so the size they use is also small), this is | ||
# not a concern at present. | ||
schema = json.load(f, parse_int=int, parse_constant=float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema = json.load(f, parse_int=int, parse_constant=float) | |
schema = json.load(f) |
@@ -275,6 +275,107 @@ def hf_model_weights_iterator( | |||
torch.cuda.empty_cache() | |||
|
|||
|
|||
def kv_cache_scales_loader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not simple. We should not do these validation ourselves. Please use pydantic
library to define the schema to be read into. https://docs.pydantic.dev/latest/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was looking to refactor the logic with jsonschema/pydantic once a formal schema is established for quantization params so we can generalize across models and also integrate it elsewhere (potential candidates: scales extraction utility, quantization config). But good call on at least doing the basic structure checking with Pydantic for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerns about the API naming (I would like to hear @zhuohan123 and @WoosukKwon's thought on this):
--quantization-param-path
: should it be named more narrowly, something like--scaling-factor-per-layer-json
? Even with this I find it difficult for users without understanding the JSON format.- Naming of
kv_scale
in the paged attention kernel. Should it be calledscaling_factor
orfp8_scaling
?
Overall I think once the remaining comments are settled and these two API design questions resolved, this PR is in good shape to merge.
For the future, it would be a lot easier to review if the renaming part is isolated out to a separate PR.
csrc/attention/attention_kernels.cu
Outdated
Quant_vec k_vec_quant = *reinterpret_cast<const Quant_vec*>(k_ptr + offset1 * BLOCK_SIZE * x + offset2); | ||
// Vector conversion from Quant_vec to K_vec. | ||
k_vecs[j] = fp8_e5m2_unscaled::vec_conversion<K_vec, Quant_vec>(k_vec_quant); | ||
#elif defined(ENABLE_FP8_E4M3) | ||
Quant_vec k_vec_quant = *reinterpret_cast<const Quant_vec*>(k_ptr + offset1 * BLOCK_SIZE * x + offset2); | ||
// Vector conversion from Quant_vec to K_vec. Scaled conversion: FP8 => higher precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "Scaled conversion: FP8 => higher precision" mean here. Please use full sentence here to help maintainers understand. Do you mean "We use the scaled_vec_conversion
library function for better precision"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solved
================== | ||
|
||
Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the | ||
cache, improving throughput. OCP specifies two common floating point data formats: E5M2 (5 exponent bits and 2 mantissa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is OCP? explain in this doc for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solved
|
||
These scaling factors can be specified by passing an optional quantization param JSON to the LLM engine at load time. If | ||
this JSON is not specified, scaling factors default to 1.0. These scaling factors are typically obtained when running an | ||
unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO [pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatting is off. Please use code block properly
https://vllm--3290.org.readthedocs.build/en/3290/quantization/fp8_e4m3_kvcache.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solved
unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO [pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo]). | ||
|
||
Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent | ||
silicon offerings e.g. AMD MI300, NVIDIA Hopper or later support native hardware conversion to/from fp32, fp16, bf16, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MI300 or MI300x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to/from -> to and from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In codebase, we use MI300 not to differentiate sub-models of MI300, though MI300X is a common SKU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solved
# two float8_e4m3fn kv cache scaling factor files are provided at tests/fp8_kv, | ||
# refer to tests/fp8_kv/README.md to generate kv_cache_scales.json of your own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide github link instead. Something like
https://github.com/vllm-project/vllm/blob/main/tests/fp8_kv/README.md
It will 404 now but works later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is Ping on this
below, code seems to be different, previously added a comment line to show non-local HF model path to llama-2-7b-chat-hf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to HF-remote llama-2-7b path.
vllm/model_executor/models/llama.py
Outdated
@@ -402,3 +414,28 @@ def load_weights(self, | |||
weight_loader = getattr(param, "weight_loader", | |||
default_weight_loader) | |||
weight_loader(param, loaded_weight) | |||
|
|||
# Should not be called unless the KV cache dtype is FP8 on ROCm (AMD GPU) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case please use assert to confirm the invariant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the comment, it was meant to be more informative ("FYI, this function is for scaled KV cache, which is currently enabled on ROCm only") rather than bad behavior we should guard against. In particular, the scaling factor is only used when the KV cache dtype is FP8 and on ROCm, so calling this function in other settings has no observable side effects. So there's no need for an assert here.
More broadly, the current design largely decouples the KV cache implementation from the model implementation (which makes sense, as KV caches are not theoretically necessary). IMO, guarding against potential misuse (which is side effect free anyway) isn't a strong enough reason to newly endow the model with the ability to introspect KV cache details.
vllm/model_executor/weight_utils.py
Outdated
logger.warning("Defaulting to KV cache scaling factors = 1.0 " | ||
f"for all layers in TP rank {tp_rank} " | ||
"as an error occurred during loading.") | ||
return () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning an empty tuple is not common in Python, it took me a while to understand.
return () | |
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for the list-stans.
vllm/model_executor/weight_utils.py
Outdated
# schemas out into a separate file that deals solely with quantization | ||
# params and their related schemas so that they can be generalized and | ||
# shared across various use cases. | ||
class KVCacheQuantSchema(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move out of this function please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
vllm/model_executor/weight_utils.py
Outdated
f"TP rank {tp_rank}.") | ||
return self | ||
|
||
class QuantParamSchema(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this out of the function please.
inline class definition has bad performance for evaluated each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance should not be a significant consideration here. This is called at most tp_size times during the loading process and the runtime of this function is far eclipsed by the time it takes to load GB-scale weights. Good object-oriented decomposition considerations are more salient. In any case, they've been moved out of the function.
vllm/model_executor/weight_utils.py
Outdated
@computed_field | ||
@property | ||
def rank_keyword(self) -> str: | ||
# Each TP rank key should be prefixed by a common rank_keyword. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didn't realize this is tp dependent. please add this to documentation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rank keyword has been removed: the extra processing was more trouble than it's worth.
|
…mments, bring in NCCL fixes from upstream
Seems like this PR fails the main branch now . |
Adding functionality to ingest scaling factors upon merge of the PR vllm-project#3290
) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Can the files |
That was the plan, once we have common fp8 header released. |
As part of a series of FP8 development in vLLM, we address an OCP format (nVIDIA compatible) FP8 KV cache in this pull request. We elaborated upon previous #2279, but made following change, enhancement and extensions:
Design reference:
Scope:
activation
andweights
sections from the JSON schema proposed in #2461. Quantizer's output may need to be formatted to that schema based JSON file for vLLM code to consider, an utility script (3rdparty/quantizer/extract_scales.py
) is provided for JSON generation from AMMO's output.3rdparty/quantizer/quantize.py
) is provided for using AMMO to quantize HF model to FP8 with FP8 KV cache s.t. KV cache scaling factors will be generated (over a calibartion dataset, which you can change to your domain of interests), details in3rdparty/README.md
.e4m3fn
is enabled, this comes with HW support (so performant) on AMD MI3xx GPUs. Same design is still functional but less performant on earlier AMD GPUs, current design does not cover CUDA device.Scaling semantics:
scaled_to_fp8_quant: fp8_tensor = fp8_quant(higher_precision_tensor / scaling_factor)
scaled_fr_fp8_dequant: higher_precision_tensor = fp8_dequant(fp8_tensor) * scaling factor
Usage:
To start, please refer to:
Two example JSON files are provided under:
If you run vLLM with
kv_cache_dtype="fp8"
but not provide JSON file containing scaling factors, then no scaling will be applied towards FP8 (e4m3fn) quantization, which may lead to less accurate results.manual execution:
Performance:
We observed 20~30% performance increases from FP16 baseline by just turning KV cache to FP8 (e4m3fn), even on the 70B model served by a single MI300X.
WizardCoder-34b score, dataset: HumanEval-Python-EN on 1-GPU MI300X
Contributors:
@HaiShaw, @AdrianAbeyta, @gshtras, @mawong-amd, @Alexei-V-Ivanov-AMD