Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch ORTTrainer's compatibility with DeepSpeed #148

Merged
merged 26 commits into from
May 24, 2022
Merged

Conversation

JingyaHuang
Copy link
Contributor

@JingyaHuang JingyaHuang commented Apr 15, 2022

What does this PR do?

Fixes #145 and #146

Contents

  • Add missing dependencies
  • Fix compatibility with deepspeed
  • Fix compatibility with fairscale(simple ✅ dp2/dp3❌)
  • Test ZeRO: stage 1
  • Test ZeRO: stage 2
  • Test ZeRO: stage 3 ❌
  • Test ZeRO: NVMe support ❌
  • Test ZeRO: BF16 ❌
  • ZeRO Inference when ort inference enabled(need to coordinate with transformers.onnx.export_pytorch)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@jambayk
Copy link
Contributor

jambayk commented Apr 15, 2022

Hi, I saw that this PR is linked to issue #145 I opened.

I do not think this resolves the issue which came from model being used as if it's of type ORTModule. But it isn't when deepspeed or any other wrappers such as DDP are used.

EDIT: I apologize if this is still work in progress and you had a fix in the works. Please ignore this comment if so. Just wanted to make sure I provided enough context for the issue.

@JingyaHuang
Copy link
Contributor Author

Hi, I saw that this PR is linked to issue #145 I opened.

I do not think this resolves the issue which came from model being used as if it's of type ORTModule. But it isn't when deepspeed or any other wrappers such as DDP are used.

EDIT: I apologize if this is still work in progress and you had a fix in the works. Please ignore this comment if so. Just wanted to make sure I provided enough context for the issue.

Hi @jambayk , no worries, I opened this PR so that there will be more transparency on the progress. Sorry for the confusion that it might have caused, and thanks a lot for the information on the issue.

@JingyaHuang
Copy link
Contributor Author

The compatibility of ORTTrainer and DeepSpeed/Fairscale is implemented. Check the result of the tests here.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these warnings @JingyaHuang ! LGTM 🚀 !

optimum/onnxruntime/trainer.py Outdated Show resolved Hide resolved
optimum/onnxruntime/trainer.py Outdated Show resolved Hide resolved
optimum/onnxruntime/trainer.py Outdated Show resolved Hide resolved
optimum/onnxruntime/trainer.py Outdated Show resolved Hide resolved
optimum/onnxruntime/trainer.py Outdated Show resolved Hide resolved
optimum/onnxruntime/trainer.py Outdated Show resolved Hide resolved
@@ -773,6 +780,13 @@ def evaluation_loop_ort(
)

logger.info("[INFO] Exporting the model to ONNX...")
if args.deepspeed and args.fp16:
warnings.warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should check the transformers version and then raise a warning if the detected version doesn't match the required one for ONNX export on CUDA?

For now, this warning is OK but we'll probably want to revisit this once we bump transformers with your PR :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewtun Yes, exactly. I put it this way since I am not sure in which version of transformers will include it. Will check the transformers version once I have the information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

JingyaHuang and others added 6 commits May 11, 2022 16:46
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
@JingyaHuang
Copy link
Contributor Author

PR#17183 to support ONNX export on CUDA. Will refactor the code once transformers includes it in a release.

@echarlaix echarlaix self-requested a review May 23, 2022 16:42
@JingyaHuang
Copy link
Contributor Author

Merge this to enable DeepSpeed with ORTTrainer, and pay attention to the following two points:

  • If need to export mixed-precision trained(fp16) ONNX model, the export need to be done on CUDA. And this demand transformers version > 4.19.2(will be included in the next release).
  • DeepSpeed BF16 is not compatible with onnxruntime-training. Will track the progress from ONNX Runtime side on supporting bf16 for essential ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ORTTrainer doesn't work with distributed training and/or DeepSpeed
5 participants