Support XPU DDP training and autocast for LowBitMatmul #9167

yangw1234 · 2023-10-15T05:59:20Z

Description

Support XPU DDP training by patching transformers
Add autocast support for LowBitMatmul to speedup mixed precision training

1. Why the change?

Faster end to end training time using multiple intel GPUs

2. User API changes

None

3. Summary of the change

4. How to test?

Manually tested using 4 PVC 1100 on alpaca dataset

jason-dai · 2023-10-16T00:22:51Z

python/llm/src/bigdl/llm/transformers/xpu_customize_fwd.py

+    if isinstance(value, torch.Tensor):
+        is_eligible = (
+            value.is_floating_point()
+            and value.is_cuda


why is_cuda?

jason-dai

LGTM

hkvision · 2023-10-18T09:43:30Z

python/llm/src/bigdl/llm/transformers/low_bit_linear.py

@@ -378,8 +385,7 @@ def forward(self, x: torch.Tensor):
                result = result.view(new_shape)
                if self.bias is not None:
                    result += self.bias
-
-        return result.to(x.dtype)
+        return result


This line of change causes this issue: https://github.com/analytics-zoo/nano/issues/639
@yangw1234 Please take a look.

* support autocast in low bit matmul * Support XPU DDP training * fix amp

yangw1234 added 2 commits October 15, 2023 04:30

support autocast in low bit matmul

9a64478

Support XPU DDP training

59124eb

jason-dai reviewed Oct 16, 2023

View reviewed changes

jason-dai approved these changes Oct 16, 2023

View reviewed changes

fix amp

a25a553

yangw1234 merged commit 3fd05fe into intel-analytics:main Oct 17, 2023
16 checks passed

hkvision reviewed Oct 18, 2023

View reviewed changes

liu-shaojun pushed a commit that referenced this pull request Mar 25, 2024

Support XPU DDP training and autocast for LowBitMatmul (#9167)

7160afd

* support autocast in low bit matmul * Support XPU DDP training * fix amp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support XPU DDP training and autocast for LowBitMatmul #9167

Support XPU DDP training and autocast for LowBitMatmul #9167

yangw1234 commented Oct 15, 2023

jason-dai Oct 16, 2023

yangw1234 Oct 17, 2023

jason-dai left a comment

hkvision Oct 18, 2023

Support XPU DDP training and autocast for LowBitMatmul #9167

Support XPU DDP training and autocast for LowBitMatmul #9167

Conversation

yangw1234 commented Oct 15, 2023

Description

1. Why the change?

2. User API changes

3. Summary of the change

4. How to test?

jason-dai Oct 16, 2023

Choose a reason for hiding this comment

yangw1234 Oct 17, 2023

Choose a reason for hiding this comment

jason-dai left a comment

Choose a reason for hiding this comment

hkvision Oct 18, 2023

Choose a reason for hiding this comment