Deepspeed Ulysses sequence parallel is not working for Gemma4 #8002
Replies: 1 comment
-
|
The root issue is that Gemma4 uses different head dimensions for local vs global attention (256 and 512 respectively), and Ulysses assumes uniform One workaround: intercept the attention forward call and handle local/global heads separately. Split QKV by head type before the all-to-all, run Ulysses scatter independently for each group (within each group There's an open issue somewhere about non-uniform head_dim support in Ulysses. Worth checking if there's been any movement there, since hybrid attention with different head sizes is showing up in a lot of recent architectures. If you can share the debug log with the exact reshape error, I can probably point to the specific line that needs patching. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have long context > 64k , and need to use deepspeed zero3 with Deepspeed Ulysses sequence parallelism. However due to model architecture, the head dimension of the local and global (256, 512) are different , my QKV tensor dimension could not match with the target size (always take the global head dim), meanwhile my QKV is with shape if local head dim. Appreciate if anyone have insights on this. I can share my debug logs if needed
Beta Was this translation helpful? Give feedback.
All reactions