-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
Describe the bug
I found 4 related zero-handling issues where divisors/scales are not validated before use.
Three can raise ZeroDivisionError; one can silently propagate inf/nan.
Affected locations
-
deepspeed/utils/groups.py
_ensure_divisibility(numerator, denominator)usesnumerator % denominatorwithout guardingdenominator == 0. -
deepspeed/utils/timer.py
ThroughputTimer._is_report_boundary()checksNonebut not0before:
self.global_step_count % self.steps_per_output. -
deepspeed/inference/v2/inference_utils.py
ceil_div(a, b)returns-(-a // b)without guardingb == 0. -
op_builder/hpu/fp_quantizer.py
FPQuantizer.dequantize()computes(1.0 / scale)without guarding zero elements inscale, which can produce non-finite values and corrupt outputs silently.
To Reproduce
_ensure_divisibility(8, 0)->ZeroDivisionError_is_report_boundary()withsteps_per_output=0->ZeroDivisionErrorceil_div(10, 0)->ZeroDivisionError1.0 / torch.tensor([0.0, 1.0])->tensor([inf, 1.])(same pattern used in HPU dequantize path)
Expected behavior
- Explicit validation for invalid zero values.
- Clear user-facing error messages (or safe clamping where appropriate).
- No raw modulo/division-by-zero exceptions.
- No silent non-finite propagation in dequantization.
Suggested fixes
_ensure_divisibility: add guard before modulo (denominator != 0)._is_report_boundary: treat0as invalid/disabled (if not self.steps_per_output:or explicit<= 0validation).ceil_div: rejectb == 0with clear error.FPQuantizer.dequantize: clamp scale totorch.finfo(scale.dtype).tiny(or validate and fail clearly) before inversion.
System info
- Can provide full
ds_reportoutput if needed. - Repros above are minimal and mostly pure-Python, except item 4 which is on the HPU backend path.