add reproducible compilation environment #3943

fecet · 2023-07-13T04:36:45Z

Currently, deepspeed only verifies the compilation process on Docker, which may not work/without privilege on many clusters . This makes precompiled deepspeed ops very challenging, especially considering that the compilation chain tools can vary significantly between different systems. There have been many issues complaining about their inability to compile ops in their own environment, say #3890 pytorch/pytorch#100557 #3358 #3067
#3944

Conda-forge provides a cross-platform compilation toolchain that, if we maintain a robust Conda environment, can make precompiled ops available to everyone and solve the issue of reproducibility.

I verify the environment on Arch LInux (should unset CUDA_PATH firstly, this is caused by https://archlinux.org/packages/extra/x86_64/cuda/) and Ubuntu 20.04 for pytorch and pytorch-nightly. For pytorch nightly, DS_BUILD_AIO should be used as it seems that op doesn't support c++17 yet #3944. And parallel parallel build option should be disabled as #2885. The command is

DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 \
    pip install . --global-option="build_ext" \
    2>&1 | tee out # GOOD for release

result:
nightly

[2023-07-13 12:37:36,308] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0+3c400e7818), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/rok/.conda/envs/dl-dev2/lib/python3.10/site-packages/torch']
torch version .................... 2.1.0.dev20230712
deepspeed install path ........... ['/home/rok/.conda/envs/dl-dev2/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.0+aef6c65c, aef6c65c, master
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8

release

[2023-07-13 13:12:48,317] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/rok/.conda/envs/dl-dev2/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1
deepspeed install path ........... ['/home/rok/.conda/envs/dl-dev2/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.0+f3467c95, f3467c95, doc/conda-env
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

fecet · 2023-07-13T04:38:41Z

@microsoft-github-policy-service agree

tjruwase · 2023-07-24T14:16:31Z

@fecet, thanks for this amazing PR that also includes documentation. We will review right away, apologies for the delay.

loadams · 2023-07-24T15:47:37Z

environment.yml

@@ -0,0 +1,20 @@
+channels:


We may want to find somewhere not at the root of the repo to put this file, @mrwyattii - thoughts?

* add reproducible compilation environment * fix ci * fix typo for formatting check * Fix casing for format --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>

add reproducible compilation environment

f3467c9

fecet requested review from jeffra and mrwyattii as code owners July 13, 2023 04:36

This was referenced Jul 13, 2023

[BUG] AIO Build doesn't support c++17 #3944

Closed

[BUG] No supported gcc/g++ host compiler found. (torchCuda11.8 + transformer + deepspeed ) #3890

Closed

fecet added 3 commits July 14, 2023 09:42

Merge branch 'master' into doc/conda-env

88262eb

Merge branch 'master' into doc/conda-env

3a57bf8

Merge branch 'master' into doc/conda-env

2652697

loadams reviewed Jul 24, 2023

View reviewed changes

tjruwase and others added 6 commits July 26, 2023 19:14

Merge branch 'master' into doc/conda-env

c0e9359

fix ci

0557ae9

fix typo for formatting check

59afb49

Merge branch 'master' into doc/conda-env

c48991c

Merge branch 'master' into doc/conda-env

4ca4c76

Fix casing for format

7921123

loadams approved these changes Jul 31, 2023

View reviewed changes

loadams enabled auto-merge July 31, 2023 15:37

loadams added this pull request to the merge queue Jul 31, 2023

Merged via the queue into microsoft:master with commit f763b93 Jul 31, 2023

fecet deleted the doc/conda-env branch October 29, 2023 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add reproducible compilation environment #3943

add reproducible compilation environment #3943

fecet commented Jul 13, 2023 •

edited

Loading

fecet commented Jul 13, 2023

tjruwase commented Jul 24, 2023

loadams Jul 24, 2023

add reproducible compilation environment #3943

add reproducible compilation environment #3943

Conversation

fecet commented Jul 13, 2023 • edited Loading

fecet commented Jul 13, 2023

tjruwase commented Jul 24, 2023

loadams Jul 24, 2023

Choose a reason for hiding this comment

fecet commented Jul 13, 2023 •

edited

Loading