[BUG] Multiple missing zero-guards cause ZeroDivisionError / non-finite values across DeepSpeed (4 locations)

**Describe the bug**
I found 4 related zero-handling issues where divisors/scales are not validated before use.  
Three can raise `ZeroDivisionError`; one can silently propagate `inf`/`nan`.

**Affected locations**
1. `deepspeed/utils/groups.py`  
   `_ensure_divisibility(numerator, denominator)` uses `numerator % denominator` without guarding `denominator == 0`.

2. `deepspeed/utils/timer.py`  
   `ThroughputTimer._is_report_boundary()` checks `None` but not `0` before:  
   `self.global_step_count % self.steps_per_output`.

3. `deepspeed/inference/v2/inference_utils.py`  
   `ceil_div(a, b)` returns `-(-a // b)` without guarding `b == 0`.

4. `op_builder/hpu/fp_quantizer.py`  
   `FPQuantizer.dequantize()` computes `(1.0 / scale)` without guarding zero elements in `scale`, which can produce non-finite values and corrupt outputs silently.

**To Reproduce**
1. `_ensure_divisibility(8, 0)` -> `ZeroDivisionError`
2. `_is_report_boundary()` with `steps_per_output=0` -> `ZeroDivisionError`
3. `ceil_div(10, 0)` -> `ZeroDivisionError`
4. `1.0 / torch.tensor([0.0, 1.0])` -> `tensor([inf, 1.])` (same pattern used in HPU dequantize path)

**Expected behavior**
- Explicit validation for invalid zero values.
- Clear user-facing error messages (or safe clamping where appropriate).
- No raw modulo/division-by-zero exceptions.
- No silent non-finite propagation in dequantization.

**Suggested fixes**
1. `_ensure_divisibility`: add guard before modulo (`denominator != 0`).
2. `_is_report_boundary`: treat `0` as invalid/disabled (`if not self.steps_per_output:` or explicit `<= 0` validation).
3. `ceil_div`: reject `b == 0` with clear error.
4. `FPQuantizer.dequantize`: clamp scale to `torch.finfo(scale.dtype).tiny` (or validate and fail clearly) before inversion.

**System info**
- Can provide full `ds_report` output if needed.
- Repros above are minimal and mostly pure-Python, except item 4 which is on the HPU backend path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Multiple missing zero-guards cause ZeroDivisionError / non-finite values across DeepSpeed (4 locations) #7838

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Multiple missing zero-guards cause ZeroDivisionError / non-finite values across DeepSpeed (4 locations) #7838

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions