Patch ORTTrainer's compatibility with DeepSpeed #148

JingyaHuang · 2022-04-15T12:52:48Z

What does this PR do?

Add missing dependencies
Fix compatibility with deepspeed
Fix compatibility with fairscale(simple ✅ dp2/dp3❌)
Test ZeRO: stage 1
Test ZeRO: stage 2
Test ZeRO: stage 3 ❌
Test ZeRO: NVMe support ❌
Test ZeRO: BF16 ❌
ZeRO Inference when ort inference enabled(need to coordinate with transformers.onnx.export_pytorch)

HuggingFaceDocBuilderDev · 2022-04-15T13:02:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

jambayk · 2022-04-15T18:13:25Z

Hi, I saw that this PR is linked to issue #145 I opened.

I do not think this resolves the issue which came from model being used as if it's of type ORTModule. But it isn't when deepspeed or any other wrappers such as DDP are used.

EDIT: I apologize if this is still work in progress and you had a fix in the works. Please ignore this comment if so. Just wanted to make sure I provided enough context for the issue.

JingyaHuang · 2022-04-20T21:29:26Z

Hi, I saw that this PR is linked to issue #145 I opened.

I do not think this resolves the issue which came from model being used as if it's of type ORTModule. But it isn't when deepspeed or any other wrappers such as DDP are used.

EDIT: I apologize if this is still work in progress and you had a fix in the works. Please ignore this comment if so. Just wanted to make sure I provided enough context for the issue.

Hi @jambayk , no worries, I opened this PR so that there will be more transparency on the progress. Sorry for the confusion that it might have caused, and thanks a lot for the information on the issue.

JingyaHuang · 2022-05-10T13:44:52Z

The compatibility of ORTTrainer and DeepSpeed/Fairscale is implemented. Check the result of the tests here.

lewtun

Thanks for adding these warnings @JingyaHuang ! LGTM 🚀 !

optimum/onnxruntime/trainer.py

lewtun · 2022-05-11T10:09:44Z

optimum/onnxruntime/trainer.py

@@ -773,6 +780,13 @@ def evaluation_loop_ort(
            )

            logger.info("[INFO] Exporting the model to ONNX...")
+            if args.deepspeed and args.fp16:
+                warnings.warn(


I wonder if we should check the transformers version and then raise a warning if the detected version doesn't match the required one for ONNX export on CUDA?

For now, this warning is OK but we'll probably want to revisit this once we bump transformers with your PR :)

@lewtun Yes, exactly. I put it this way since I am not sure in which version of transformers will include it. Will check the transformers version once I have the information.

Sounds good!

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

optimum/onnxruntime/trainer.py

… into ort-trainer-ds

…ed yet)

JingyaHuang · 2022-05-18T08:49:06Z

PR#17183 to support ONNX export on CUDA. Will refactor the code once transformers includes it in a release.

JingyaHuang · 2022-05-24T18:17:33Z

Merge this to enable DeepSpeed with ORTTrainer, and pay attention to the following two points:

If need to export mixed-precision trained(fp16) ONNX model, the export need to be done on CUDA. And this demand transformers version > 4.19.2(will be included in the next release).
DeepSpeed BF16 is not compatible with onnxruntime-training. Will track the progress from ONNX Runtime side on supporting bf16 for essential ops.

Add missing dependencies

52dd355

JingyaHuang added 2 commits April 15, 2022 13:02

Fix missing dependencies

8e00f8d

Replace deprecated

1f4fb98

Unwrap model to get ORTModule & add ds configs for further tests

37d8e93

JingyaHuang mentioned this pull request Apr 20, 2022

ORTTrainer doesn't work with distributed training and/or DeepSpeed #145

Closed

JingyaHuang added 7 commits May 3, 2022 13:19

Merge main

be13341

Fix style

b110c1d

Fix import

ef1ade5

Validate compatibility with fairscale

683cfb5

Merge with main

1b8b6ab

Merge branch 'main' into ort-trainer-ds

024bffd

Fix compatibility of ort inference with DeepSpeed

941c3a2

JingyaHuang requested review from echarlaix, mfuntowicz and lewtun May 10, 2022 13:45

Add warnings

7b95a8c

lewtun approved these changes May 11, 2022

View reviewed changes

JingyaHuang and others added 6 commits May 11, 2022 16:46

Update optimum/onnxruntime/trainer.py

f263c13

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update optimum/onnxruntime/trainer.py

902635c

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update optimum/onnxruntime/trainer.py

09c62c7

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update optimum/onnxruntime/trainer.py

55bdfa7

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update optimum/onnxruntime/trainer.py

fe0a316

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update optimum/onnxruntime/trainer.py

f1b2d1f

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

echarlaix reviewed May 16, 2022

View reviewed changes

optimum/onnxruntime/trainer.py Outdated Show resolved Hide resolved

JingyaHuang added 2 commits May 16, 2022 15:32

Fix undefined

57a9753

Merge branch 'ort-trainer-ds' of https://github.com/huggingface/optimum…

c6be6a0

… into ort-trainer-ds

JingyaHuang added 2 commits May 16, 2022 15:58

Merge branch 'main' into ort-trainer-ds

6d23446

Modify the _export with the flag in transformers.onnx.export(not merg…

ebc333b

…ed yet)

Modify send to device

1947390

echarlaix self-requested a review May 23, 2022 16:42

echarlaix approved these changes May 23, 2022

View reviewed changes

JingyaHuang added 3 commits May 24, 2022 08:37

Merge branch 'main' into ort-trainer-ds

280a82c

Add transformers version demand

9d4ab38

Add assets for testing BF16

4dc035a

JingyaHuang merged commit 021ae43 into main May 24, 2022

JingyaHuang deleted the ort-trainer-ds branch May 24, 2022 18:18

JingyaHuang mentioned this pull request May 24, 2022

ShardedDDP and FullyShardedDDP not imported in optimum/onnxruntime/trainer.py #146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch ORTTrainer's compatibility with DeepSpeed #148

Patch ORTTrainer's compatibility with DeepSpeed #148

JingyaHuang commented Apr 15, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 15, 2022

jambayk commented Apr 15, 2022 •

edited

Loading

JingyaHuang commented Apr 20, 2022

JingyaHuang commented May 10, 2022

lewtun left a comment

lewtun May 11, 2022

JingyaHuang May 11, 2022

lewtun May 12, 2022

JingyaHuang commented May 18, 2022

JingyaHuang commented May 24, 2022

Patch ORTTrainer's compatibility with DeepSpeed #148

Patch ORTTrainer's compatibility with DeepSpeed #148

Conversation

JingyaHuang commented Apr 15, 2022 • edited Loading

What does this PR do?

Contents

HuggingFaceDocBuilderDev commented Apr 15, 2022

jambayk commented Apr 15, 2022 • edited Loading

JingyaHuang commented Apr 20, 2022

JingyaHuang commented May 10, 2022

lewtun left a comment

Choose a reason for hiding this comment

lewtun May 11, 2022

Choose a reason for hiding this comment

JingyaHuang May 11, 2022

Choose a reason for hiding this comment

lewtun May 12, 2022

Choose a reason for hiding this comment

JingyaHuang commented May 18, 2022

JingyaHuang commented May 24, 2022

JingyaHuang commented Apr 15, 2022 •

edited

Loading

jambayk commented Apr 15, 2022 •

edited

Loading