-
-
Notifications
You must be signed in to change notification settings - Fork 2
Add nvfp4 support #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yi Liu <yiliu4@habana.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the vllm framework by adding full support for NVIDIA FP4 (NVFP4) quantization. It provides the necessary infrastructure for quantizing and dequantizing tensors, and seamlessly integrates this new low-precision format into the model executor's linear and Mixture-of-Experts (MoE) layers. The changes are designed to optimize model inference, especially on Habana Processing Units, by enabling efficient handling and computation with NVFP4 weights.
Highlights
- New Quantization Scheme: Introduced comprehensive support for NVIDIA FP4 (NVFP4) quantization, enabling models to leverage this low-precision format for improved efficiency.
- Core Quantization Utilities: Added a new utility file (
nvfp4_qdq.py) containing the fundamental logic for NVFP4 quantization and dequantization, including functions for packing/unpacking, scaling, and casting to FP4. - MoE Layer Integration: Integrated NVFP4 support into the Mixture-of-Experts (MoE) layers, modifying weight processing to unpack NVFP4 weights into FP8 format for efficient execution, particularly on Habana Processing Units (HPUs).
- Linear Layer Implementation: Updated the
compressed_tensors_w4a4_nvfp4scheme to directly handle NVFP4 linear layer operations, including weight unpacking and the application of quantization-aware matrix multiplications.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces nvfp4 support, including a new utility file and modifications to MoE and linear layers. However, there are critical issues in the CUDA path of compressed_tensors_w4a4_nvfp4.py and in the new nvfp4_qdq.py file, which contains module-level side effects and unused/broken code. Addressing these points will greatly improve the robustness and maintainability of the new functionality.
| def process_weights_after_loading(self, layer) -> None: | ||
| from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import ( | ||
| unpacked_nvfp4_to_fp8, | ||
| ) | ||
|
|
||
| logger.debug( | ||
| f"start processing weights for {getattr(layer, 'prefix', 'unknown')}" | ||
| ) | ||
| global_input_scale = layer.input_global_scale.max().to(torch.float32) | ||
| layer.input_global_scale = Parameter(global_input_scale, | ||
| requires_grad=False) | ||
| layer.input_global_scale = Parameter( | ||
| global_input_scale, requires_grad=False | ||
| ) | ||
|
|
||
| layer.weight_global_scale = Parameter( | ||
| layer.weight_global_scale.max().to(torch.float32), | ||
| requires_grad=False) | ||
|
|
||
| swizzled_weight_scale = self.swizzle_blockscale(layer.weight_scale) | ||
| layer.weight_scale_swizzled = Parameter(swizzled_weight_scale, | ||
| requires_grad=False) | ||
|
|
||
| # required by cutlass kernel; need Parameter, not ModelWeightParameter | ||
| layer.weight = Parameter(layer.weight_packed.data, requires_grad=False) | ||
|
|
||
| if self.cutlass_nvfp4_supported: | ||
| layer.alpha = Parameter(layer.input_global_scale * | ||
| layer.weight_global_scale, | ||
| requires_grad=False) | ||
| requires_grad=False, | ||
| ) | ||
|
|
||
| # weight_global_scale_data = layer.weight_global_scale.data | ||
| weight_scale_data = layer.weight_scale.data | ||
| weight_packed_data = layer.weight_packed.data | ||
| weight_unpacked = unpacked_nvfp4_to_fp8(weight_packed_data) | ||
| layer.weight_scale = torch.nn.Parameter( | ||
| weight_scale_data, requires_grad=False | ||
| ) | ||
|
|
||
| layer.weight_unpacked = torch.nn.Parameter( | ||
| weight_unpacked, requires_grad=False | ||
| ) | ||
| del layer.weight_packed | ||
| torch.hpu.synchronize() | ||
|
|
||
| # swizzled_weight_scale = self.swizzle_blockscale(layer.weight_scale) | ||
| # layer.weight_scale_swizzled = Parameter(swizzled_weight_scale, | ||
| # requires_grad=False) | ||
|
|
||
| # # required by cutlass kernel; need Parameter, not ModelWeightParameter | ||
| # layer.weight = Parameter(layer.weight_packed.data, requires_grad=False) | ||
|
|
||
| # if self.cutlass_nvfp4_supported: | ||
| # layer.alpha = Parameter(layer.input_global_scale * | ||
| # layer.weight_global_scale, | ||
| # requires_grad=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original CUDA path in process_weights_after_loading has been commented out and replaced with HPU-specific logic, breaking functionality for CUDA devices using the nvfp4 quantization scheme. Preserve the CUDA path and add the new HPU logic within a conditional block, checking for self.cutlass_nvfp4_supported to execute the appropriate logic for each platform.
def process_weights_after_loading(self, layer) -> None:
global_input_scale = layer.input_global_scale.max().to(torch.float32)
layer.input_global_scale = Parameter(
global_input_scale, requires_grad=False
)
layer.weight_global_scale = Parameter(
layer.weight_global_scale.max().to(torch.float32),
requires_grad=False,
)
if self.cutlass_nvfp4_supported:
swizzled_weight_scale = self.swizzle_blockscale(layer.weight_scale)
layer.weight_scale_swizzled = Parameter(swizzled_weight_scale,
requires_grad=False)
# required by cutlass kernel; need Parameter, not ModelWeightParameter
layer.weight = Parameter(layer.weight_packed.data, requires_grad=False)
layer.alpha = Parameter(layer.input_global_scale *
layer.weight_global_scale,
requires_grad=False)
else:
from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import (
unpacked_nvfp4_to_fp8,
)
logger.debug(
f"start processing weights for {getattr(layer, 'prefix', 'unknown')}"
)
weight_scale_data = layer.weight_scale.data
weight_packed_data = layer.weight_packed.data
weight_unpacked = unpacked_nvfp4_to_fp8(weight_packed_data)
layer.weight_scale = torch.nn.Parameter(
weight_scale_data, requires_grad=False
)
layer.weight_unpacked = torch.nn.Parameter(
weight_unpacked, requires_grad=False
)
del layer.weight_packed
torch.hpu.synchronize()| os.environ["PT_HPU_LAZY_MODE"] = "1" | ||
| os.environ["PT_HPU_ENABLE_LAZY_COLLECTIVES"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import os | ||
|
|
||
| os.environ["PT_HPU_LAZY_MODE"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def unpack_fp4_from_uint8( | ||
| a: torch.Tensor, | ||
| m: int, | ||
| n: int, | ||
| dtype: Optional[torch.dtype] = torch.bfloat16, | ||
| ) -> torch.Tensor: | ||
| """ | ||
| Unpacks uint8 values into fp4. Each uint8 consists of two fp4 values | ||
| (i.e. first four bits correspond to one fp4 value, last four corresond to a consecutive | ||
| fp4 value). The bits represent an index, which are mapped to an fp4 value. | ||
| :param a: tensor to unpack | ||
| :param m: original dim 0 size of the unpacked tensor | ||
| :param n: original dim 1 size of the unpacked tensor | ||
| :param dtype: dense dtype to cast the unpacked tensor to | ||
| """ | ||
| assert a.dtype == torch.uint8, f"Ex got{a.dtype}" | ||
|
|
||
| # Vectorized nibble processing | ||
| a_flat = a.flatten() | ||
| high = (a_flat & 0xF0) >> 4 # Upper nibbles | ||
| low = a_flat & 0x0F # Lower nibbles | ||
|
|
||
| # Combine nibbles for batch processing | ||
| combined = torch.stack((low, high), dim=1).flatten() | ||
|
|
||
| # Vectorized sign and magnitude extraction | ||
| signs = (combined & 0x08).to(torch.bool) # Sign bits | ||
| abs_vals = (combined & 0x07).to(torch.long) # Magnitude indices | ||
|
|
||
| # Device-aware lookup and sign application | ||
| kE2M1 = kE2M1ToFloat.to(device=a.device) | ||
| values = kE2M1[abs_vals] * torch.where(signs, -1.0, 1.0) | ||
| # breakpoint() | ||
| # Reshape to final form | ||
| return values.reshape(m, n).to(dtype=dtype) | ||
|
|
||
|
|
||
| # >>>>>>>>>>>>>>>>>> | ||
|
|
||
| SBITS, EBITS_F32, MBITS_F32 = 1, 8, 23 | ||
| EBITS_BF16, MBITS_BF16 = 8, 7 | ||
| EBITS_F4_E2M1, MBITS_F4_E2M1 = 2, 1 | ||
| EBITS_F6_E2M3, MBITS_F6_E2M3 = 2, 3 | ||
| EBITS_F6_E3M2, MBITS_F6_E3M2 = 3, 2 | ||
| EBITS_F8_E4M3, MBITS_F8_E4M3 = 4, 3 | ||
| EBITS_F8_E5M2, MBITS_F8_E5M2 = 5, 2 | ||
| from torchao.prototype.mx_formats.mx_tensor import to_dtype as ao_to_dtype | ||
|
|
||
| import os | ||
|
|
||
| _USE_CT_UNPACK = os.getenv("USE_CT_UNPACK", "0").lower() in ("1", "true", "yes") | ||
|
|
||
| from enum import Enum, auto | ||
| from typing import Callable, Dict, Union | ||
|
|
||
| import torch | ||
| import os | ||
| from torchao.prototype.mx_formats.constants import ( | ||
| F4_E2M1_MAX, | ||
| F8E4M3_MAX, | ||
| ) | ||
|
|
||
|
|
||
| def per_tensor_amax_to_scale(amax: torch.Tensor) -> torch.Tensor: | ||
| """Convert per-tensor amax to per-tensor scale. | ||
| Used to scale fp32 scales down to fp8 scales | ||
| Args: | ||
| amax: Per-tensor amax tensor | ||
| Returns: | ||
| torch.Tensor: Per-tensor scale tensor | ||
| """ | ||
| return torch.clamp(amax / F8E4M3_MAX, min=E4M3_EPS, max=F8E4M3_MAX).to( | ||
| torch.float32 | ||
| ) | ||
|
|
||
|
|
||
| E4M3_EPS = torch.finfo(torch.float8_e4m3fn).tiny | ||
| from compressed_tensors.quantization.quant_args import FP4_E2M1_DATA | ||
| from compressed_tensors.compressors.quantized_compressors.nvfp4_quantized import ( | ||
| pack_fp4_to_uint8, | ||
| unpack_fp4_from_uint8, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module contains several issues related to imports and unused code:
- The function
unpack_fp4_from_uint8is defined locally (line 51) but is then re-imported fromcompressed_tensors.compressors.quantized_compressors.nvfp4_quantized(line 134), shadowing the local definition. If the imported version is the one intended for use, the local definition should be removed. - There are many unused imports, such as
FP4_E2M1_DATA,ao_to_dtype,dequant_mx_fp8,quant_mx_fp8, etc. Remove these to clean up the code. _USE_CT_UNPACKis defined but never used.
| def test_qdq(args): | ||
| hidden_size = args.hidden_size | ||
| intermediate_size = args.intermediate_size | ||
| cpu_tensor = torch.randn( | ||
| hidden_size, intermediate_size, dtype=torch.bfloat16 | ||
| ) | ||
| hpu_tensor = cpu_tensor.to("hpu") | ||
| # scale_e8m0_biased, quant_tensor = quant_mx_fp8(hpu_tensor) | ||
| # dequant_tensor = dequant_mx_fp8(quant_tensor, scale_e8m0_biased, block_size=32) | ||
| # print(f"quant_tensor shape: {quant_tensor.shape}, dtype: {quant_tensor.dtype}") | ||
| dequant_tensor = quant_dequant_mxfp4(hpu_tensor) | ||
|
|
||
| diff = torch.abs(hpu_tensor - dequant_tensor) | ||
| print(f"diff: max: {diff.max()}, min: {diff.min()}, mean: {diff.mean()}") | ||
| ht.hpu.synchronize() | ||
|
|
||
|
|
||
| def time_fn(func, times, warmup=50): | ||
| torch.hpu.synchronize() | ||
| for _ in range(warmup): | ||
| func() | ||
| torch.hpu.synchronize() | ||
| gc.collect() | ||
|
|
||
| start = time.time() | ||
| for _ in range(times): | ||
| func() | ||
| torch.hpu.synchronize() | ||
| end = time.time() | ||
| return (end - start) / times | ||
|
|
||
|
|
||
| def test_linear(args): | ||
| hidden_size = args.hidden_size | ||
| intermediate_size = args.intermediate_size | ||
| batch_size = args.bs | ||
| cpu_tensor = torch.randn(batch_size, hidden_size, dtype=torch.bfloat16) | ||
| hpu_tensor = cpu_tensor.to("hpu") | ||
| linear = torch.nn.Linear( | ||
| in_features=hidden_size, out_features=intermediate_size, bias=False | ||
| ).to("hpu") | ||
| nvfp4_linear = NVFP4Linear.from_linear(linear) | ||
| nvfp4_linear = nvfp4_linear.to("hpu") | ||
| nvfp4_unpacked_linear = NVFP4LinearUnpacked.from_linear(linear) | ||
| nvfp4_unpacked_linear = nvfp4_unpacked_linear.to("hpu") | ||
|
|
||
| linear = wrap_in_hpu_graph(linear) | ||
| nvfp4_linear = wrap_in_hpu_graph(nvfp4_linear) | ||
| nvfp4_unpacked_linear = wrap_in_hpu_graph(nvfp4_unpacked_linear) | ||
|
|
||
| out_ref = linear(hpu_tensor) | ||
| out1 = nvfp4_linear(hpu_tensor) | ||
| out2 = nvfp4_unpacked_linear(hpu_tensor) | ||
| print( | ||
| f"diff ref vs nvfp4: max: {torch.abs(out_ref - out1).max()}, min: {torch.abs(out_ref - out1).min()}, mean: {torch.abs(out_ref - out1).mean()}" | ||
| ) | ||
| print( | ||
| f"diff ref vs nvfp4 unpacked: max: {torch.abs(out_ref - out2).max()}, min: {torch.abs(out_ref - out2).min()}, mean: {torch.abs(out_ref - out2).mean()}" | ||
| ) | ||
| print( | ||
| f"diff nvfp4 vs nvfp4 unpacked: max: {torch.abs(out1 - out2).max()}, min: {torch.abs(out1 - out2).min()}, mean: {torch.abs(out1 - out2).mean()}" | ||
| ) | ||
| torch.hpu.synchronize() | ||
| latency0 = time_fn(lambda: linear(hpu_tensor), args.bench_steps) | ||
| latency1 = time_fn(lambda: nvfp4_linear(hpu_tensor), args.bench_steps) | ||
| latency2 = time_fn( | ||
| lambda: nvfp4_unpacked_linear(hpu_tensor), args.bench_steps | ||
| ) | ||
| print( | ||
| f"latency linear: {latency0:.6f} s, nvfp4_linear: {latency1:.6f} s, nvfp4_unpacked_linear: {latency2:.6f} s" | ||
| ) | ||
|
|
||
| print( | ||
| f"speed up nvfp4_linear: {latency0 / latency1:.2f}x, nvfp4_unpacked_linear: {latency0 / latency2:.2f}x" | ||
| ) | ||
| print( | ||
| f"speed up nvfp4_unpacked_linear vs nvfp4_linear: {latency1 / latency2:.2f}x" | ||
| ) | ||
| exit() | ||
| profile_steps = args.profile_steps | ||
| warmup_steps = args.warmup_steps | ||
| if profile_steps > 0: | ||
| activities = [ | ||
| torch.profiler.ProfilerActivity.CPU, | ||
| torch.profiler.ProfilerActivity.HPU, | ||
| ] | ||
| schedule = torch.profiler.schedule( | ||
| wait=0, warmup=warmup_steps, active=profile_steps, repeat=1 | ||
| ) | ||
| print(f"Profiling steps {profile_steps} with warmup {warmup_steps}") | ||
| with torch.profiler.profile( | ||
| activities=activities, | ||
| schedule=schedule, | ||
| on_trace_ready=torch.profiler.tensorboard_trace_handler( | ||
| f"./profile_mxfp4/" | ||
| ), | ||
| record_shapes=True, | ||
| with_stack=True, | ||
| ) as profiler: | ||
| for i in range(warmup_steps + profile_steps): | ||
| out_ref = linear(hpu_tensor) | ||
| out1 = nvfp4_linear(hpu_tensor) | ||
| out2 = nvfp4_unpacked_linear(hpu_tensor) | ||
| # result2 = dist.all_reduce(result2, op=dist.ReduceOp.SUM) | ||
| ht.hpu.synchronize() | ||
| profiler.step() | ||
| profiler.stop() | ||
| ht.hpu.synchronize() | ||
|
|
||
|
|
||
| # if __name__ == "__main__": | ||
| # parser = argparse.ArgumentParser() | ||
| # parser.add_argument("--bs", default=32, type=int) | ||
| # parser.add_argument("--hidden_size", "-H", default=7168, type=int) | ||
| # parser.add_argument("--intermediate_size", "-I", default=2048, type=int) | ||
| # parser.add_argument("--num_total_experts", "-E", default=256, type=int) | ||
| # parser.add_argument("--ep_size", "-P", default=8, type=int) | ||
| # parser.add_argument("--topk", "-K", default=8, type=int) | ||
| # parser.add_argument("--warmup_steps", default=5, type=int) | ||
| # parser.add_argument("--bench_steps", "-S", default=1000, type=int) | ||
| # parser.add_argument("--profile_steps", default=0, type=int) | ||
| # args = parser.parse_args() | ||
|
|
||
| # test_linear(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code inside the if __name__ == "__main__": block appears to be for testing or experimentation and is currently commented out. It also references an undefined function quant_dequant_mxfp4 and contains an exit() call. Move this block to a dedicated test file under the tests/ directory or remove it entirely if it's just a scratchpad.
| model_path = "/software/users/yiliu4/HF_HOME/weiweiz1/DeepSeek-V2-Lite-MXFP4-autoround" | ||
| model_path = "/software/users/yiliu4/HF_HOME/Yi30/DeepSeek-V2-Lite-NVFP4-llm-compressor/" | ||
| model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor" | ||
| # model_path = "/software/users/yiliu4/HF_HOME/Yi30/Llama-3.2-1B-Instruct-NVFP4-llm-compressor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def nvfp4_unpacked_weight_gemm( | ||
| x, weight_unpacked, weight_scale, weight_global_scale | ||
| ): | ||
| # return self.run_nvfp4_emulations(x, layer) | ||
| from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import ( | ||
| unpacked_nvfp4_to_fp8, | ||
| dequant_nvfp4, | ||
| qdq_nvfp4, | ||
| ) | ||
|
|
||
| # bs, seq_len, hidden_size = x.shape | ||
| # x = x.reshape(bs * seq_len, hidden_size) | ||
| hp_weight = dequant_nvfp4( | ||
| data_lp=weight_unpacked, | ||
| out_scales=weight_scale, | ||
| per_tensor_scale=weight_global_scale, | ||
| original_dtype=x.dtype, | ||
| packed=False, | ||
| ) | ||
|
|
||
| # breakpoint() | ||
| x = qdq_nvfp4(x) | ||
| out = x @ hp_weight.t() | ||
| # out = out.reshape(bs, seq_len, -1) | ||
| return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new function contains a few items that should be cleaned up:
- The
from vllm.model_executor...import is inside the function. While sometimes necessary, it's generally better to have imports at the top of the file for readability. unpacked_nvfp4_to_fp8is imported but not used within this function and should be removed.- There are several commented-out lines (e.g.,
# breakpoint(), reshaping logic) that should be removed. - A docstring explaining the function's purpose, parameters, and return values would improve maintainability.
def nvfp4_unpacked_weight_gemm(
x, weight_unpacked, weight_scale, weight_global_scale
):
"""Performs GEMM for nvfp4 using unpacked weights.
This is an emulation path for HPU.
"""
from vllm.model_executor.layers.quantization.utils.nvfp4_qdq import (
dequant_nvfp4,
qdq_nvfp4,
)
hp_weight = dequant_nvfp4(
data_lp=weight_unpacked,
out_scales=weight_scale,
per_tensor_scale=weight_global_scale,
original_dtype=x.dtype,
packed=False,
)
x = qdq_nvfp4(x)
out = x @ hp_weight.t()
return out| w13_weight_unpacked_lst = [] | ||
| w2_weight_unpacked_lst = [] | ||
| for expert_index in range(num_experts): | ||
| w13_weight_unpacked = unpacked_nvfp4_to_fp8( | ||
| w13_weight_packed[expert_index] | ||
| ) | ||
| w13_weight_unpacked_lst.append(w13_weight_unpacked) | ||
| w2_weight_unpacked = unpacked_nvfp4_to_fp8( | ||
| layer.w2_weight_packed.data[expert_index] | ||
| ) | ||
| w2_weight_unpacked_lst.append(w2_weight_unpacked) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loops for creating w13_weight_unpacked_lst and w2_weight_unpacked_lst can be made more concise using list comprehensions to improve readability.
w13_weight_unpacked_lst = [
unpacked_nvfp4_to_fp8(w13_weight_packed[expert_index])
for expert_index in range(num_experts)
]
w2_weight_unpacked_lst = [
unpacked_nvfp4_to_fp8(layer.w2_weight_packed.data[expert_index])
for expert_index in range(num_experts)
]| packed=False, | ||
| ) | ||
|
|
||
| # breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # x[(x >= 0.0) & (x <= 0.25)] = 0.0 | ||
| # x[(x > 0.25) & (x < 0.75)] = 0.5 | ||
| # x[(x >= 0.75) & (x <= 1.25)] = 1.0 | ||
| # x[(x > 1.25) & (x < 1.75)] = 1.5 | ||
| # x[(x >= 1.75) & (x <= 2.5)] = 2.0 | ||
| # x[(x > 2.5) & (x < 3.5)] = 3.0 | ||
| # x[(x >= 3.5) & (x <= 5.0)] = 4.0 | ||
| # x[x > 5.0] = 6.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Essential Elements of an Effective PR Description Checklist
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)