Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add reproducible compilation environment #3943

Merged
merged 10 commits into from
Jul 31, 2023
Merged

add reproducible compilation environment #3943

merged 10 commits into from
Jul 31, 2023

Conversation

fecet
Copy link
Contributor

@fecet fecet commented Jul 13, 2023

Currently, deepspeed only verifies the compilation process on Docker, which may not work/without privilege on many clusters . This makes precompiled deepspeed ops very challenging, especially considering that the compilation chain tools can vary significantly between different systems. There have been many issues complaining about their inability to compile ops in their own environment, say #3890 pytorch/pytorch#100557 #3358 #3067
#3944

Conda-forge provides a cross-platform compilation toolchain that, if we maintain a robust Conda environment, can make precompiled ops available to everyone and solve the issue of reproducibility.

I verify the environment on Arch LInux (should unset CUDA_PATH firstly, this is caused by https://archlinux.org/packages/extra/x86_64/cuda/) and Ubuntu 20.04 for pytorch and pytorch-nightly. For pytorch nightly, DS_BUILD_AIO should be used as it seems that op doesn't support c++17 yet #3944. And parallel parallel build option should be disabled as #2885. The command is

DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 \
    pip install . --global-option="build_ext" \
    2>&1 | tee out # GOOD for release

result:
nightly

[2023-07-13 12:37:36,308] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0+3c400e7818), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/rok/.conda/envs/dl-dev2/lib/python3.10/site-packages/torch']
torch version .................... 2.1.0.dev20230712
deepspeed install path ........... ['/home/rok/.conda/envs/dl-dev2/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.0+aef6c65c, aef6c65c, master
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8

release

[2023-07-13 13:12:48,317] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/rok/.conda/envs/dl-dev2/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1
deepspeed install path ........... ['/home/rok/.conda/envs/dl-dev2/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.0+f3467c95, f3467c95, doc/conda-env
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

@fecet fecet requested review from jeffra and mrwyattii as code owners July 13, 2023 04:36
@fecet
Copy link
Contributor Author

fecet commented Jul 13, 2023

@microsoft-github-policy-service agree

@tjruwase
Copy link
Contributor

@fecet, thanks for this amazing PR that also includes documentation. We will review right away, apologies for the delay.

@@ -0,0 +1,20 @@
channels:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to find somewhere not at the root of the repo to put this file, @mrwyattii - thoughts?

@loadams loadams enabled auto-merge July 31, 2023 15:37
@loadams loadams added this pull request to the merge queue Jul 31, 2023
Merged via the queue into microsoft:master with commit f763b93 Jul 31, 2023
polisettyvarma pushed a commit to polisettyvarma/DeepSpeed that referenced this pull request Aug 7, 2023
* add reproducible compilation environment

* fix ci

* fix typo for formatting check

* Fix casing for format

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
@fecet fecet deleted the doc/conda-env branch October 29, 2023 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants