Fix flash decoding in GPU. (apple#999)

ds-hwang · web-flow · commit fdadfd8221c1 · 2025-02-18T02:23:21.000Z
target_positions used to be time_step, but after PR apple#995, it now represents the actual target positions with shape [batch, step_len]. apple#995 Updating the GPU decoding code to align with this change. CI did not cover GPU unit tests. TEST=test_extend_step10 of axlearn/common/flash_attention/layer_test.py in GPU
diff --git a/axlearn/common/flash_attention/utils.py b/axlearn/common/flash_attention/utils.py
@@ -212,7 +212,7 @@ def get_segment_ids(segment_ids: SegmentIdAttentionBias) -> Optional[Tensor]:
                 if mask is None or mask.target_positions is None:
                     raise RuntimeError("Cannot retrive MaskFnAttentionBias or target_positions.")
                 mask_fn = mask.mask
-                kv_seq_len = mask.target_positions + 1
+                kv_seq_len = mask.target_positions[:, -1] + 1
                 logging.info("Using mask_fn=%s for FlashDecoding.", mask_fn)
 
                 bias = explicit_bias.value()