enable cpu adam op on powerpc architectures #1213

adammoody · 2021-07-09T02:06:26Z

I am trying to build DeepSpeed on a PowerPC architecture (Power9). I run into two issues.

First, the -march option is not supported in gcc on this platform, though it seems like -mcpu could be a substitute for it. To support that, one change replaces that C++ flag.

Second, the cpuid.h and x86intrin.h headers do not exist on this system. To work around that, I've protected including those behind an x86 compile guard.

This allows the CPU_ADAM op to build, however, I don't have a good way to verify that the resulting build is actually valid.

ghost · 2021-07-09T02:06:38Z

All CLA requirements met.

adammoody · 2021-07-12T05:27:00Z

The asynchronous I/O op similarly has an -march=native flag in op_builder/async_io.py and includes for cpuid.h and x86intin.h in csrc/aio/py_lib/deepspeed_py_copy.h.

tjruwase · 2021-07-12T13:34:59Z

@adammoody, thanks so much for porting DeepSpeed to PowerPC, this is very important and greatly appreciated. In terms of validating the builds for both cpu_adam and aio, is it possible to run the unit tests? To get started you may want to initially focus on the specific unit tests for cpu_adam and aio.

tjruwase · 2021-07-12T14:24:20Z

@adammoody, this PR is failing CI now because of code formatting issues. Please see this. Thanks!

adammoody · 2021-07-13T08:16:36Z

@adammoody, thanks so much for porting DeepSpeed to PowerPC, this is very important and greatly appreciated.

Thanks. I'm happy to be help where I can.

In terms of validating the builds for both cpu_adam and aio, is it possible to run the unit tests? To get started you may want to initially focus on the specific unit tests for cpu_adam and aio.

Yes, I'll try to give this a shot.

stas00 · 2021-07-14T21:13:59Z

Has this by chance changed something in the nvme extension build process, as our nvme test started failing to jit build after 0.4.3 release: huggingface/transformers#12715

stas00 · 2021-07-14T23:51:51Z

update: false alarm, the problem was solved by rm -rf ~/.cache/torch_extensions/ as suggested by
@tjruwase

adammoody · 2021-07-15T23:47:41Z

@tjruwase , is this the proper way to run the cpu_adam test?

>>: pytest tests/unit/test_cpu_adam.py
============================================================ test session starts =============================================================
platform linux -- Python 3.7.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /path/to/DeepSpeed
collected 6 items                                                                                                                            

tests/unit/test_cpu_adam.py ......                                                                                                     [100%]

============================================================= 6 passed in 11.89s =============================================================
Config: alpha=0.001000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1

tjruwase · 2021-07-16T00:00:11Z

Yes, it is. And it looks like all 6 tests passed.

Also, you can run verbose mode by adding -sv flags: pytest -sv tests/unit/test_cpu_adam.py.

adammoody · 2021-07-16T00:06:06Z

For the aio test, I get:

============================================================ test session starts =============================================================
platform linux -- Python 3.7.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /path/to/DeepSpeed
collected 28 items                                                                                                                           

tests/unit/test_aio.py ............................                                                                                    [100%]

======================================================= 28 passed in 129.68s (0:02:09) =======================================================

I'm guessing both of these might be slow compared to x86, since on PowerPC, both fall back to use a non-SIMD implementation. Can you judge performance from this?

adammoody · 2021-07-16T00:12:06Z

To get aio to actually build on my Redhat system, I also had to make this change:

diff --git a/op_builder/async_io.py b/op_builder/async_io.py
index 78aa2fe..9c6dde6 100644
--- a/op_builder/async_io.py
+++ b/op_builder/async_io.py
@@ -50,7 +50,7 @@ class AsyncIOBuilder(OpBuilder):
         return ['-laio']
 
     def is_compatible(self):
-        aio_libraries = ['libaio-dev']
+        aio_libraries = ['libaio-dev', 'libaio-devel']
         aio_compatible = self.libraries_installed(aio_libraries)
         if not aio_compatible:
             self.warning(
diff --git a/op_builder/builder.py b/op_builder/builder.py
index 3eeb4e4..bd6bfa8 100644
--- a/op_builder/builder.py
+++ b/op_builder/builder.py
@@ -151,9 +151,8 @@ class OpBuilder(ABC):
 
     def libraries_installed(self, libraries):
         valid = False
-        check_cmd = 'dpkg -l'
         for lib in libraries:
-            result = subprocess.Popen(f'dpkg -l {lib}',
+            result = subprocess.Popen(f'rpm -q -l {lib}',
                                       stdout=subprocess.PIPE,
                                       stderr=subprocess.PIPE,
                                       shell=True)

That's related to the issue @stas00 opened here: #1126

And actually, I have DeepSpeed compiling and linking to a libaio that is installed via conda, rather than the system install from Redhat. I just hacked the builder.py to find the Redhat rpm to enable the build.

stas00 · 2021-07-16T00:16:16Z

I wrote the code to handle 3 flavors of linux here: #1126 (comment)
I think solution #2 would be the easiest in the long run IMHO.

@adammoody, if you'd like to integrate and PR it that would be super helpful! Thank you! That is if @tjruwase is in favor of using that solution and not others (or some other solutions).

tjruwase · 2021-07-16T00:33:40Z

Solution #2 sounds good to me, and such a PR would be greatly appreciated. Thanks @stas00 and @adammoody.

tjruwase · 2021-07-16T00:35:00Z

I'm guessing both of these might be slow compared to x86, since on PowerPC, both fall back to use a non-SIMD implementation. Can you judge performance from this?

SIMD perf is not important to aio as it is an NVMe library.

For cpu_adam, understanding perf implication is a bit more tricky as it heavily uses SIMD and MKL. So it might require experiments with multi-billion parameter models with offloading to get the practical impact. You might find these scripts useful for starters.

adammoody · 2021-07-16T00:48:17Z

Sure, I'll work on solution #2 to get started.

However, I suspect my conda setup doesn't quite fit properly with those three package managers, since my build isn't actually using the libaio from the rpm. Solution #4 to test a compile/link might work. Maybe something with compiler.has_function like described on this page would work?

https://www.cac.cornell.edu/wiki/index.php?title=Python_Distutils_Tips#How_to_find_if_a_library_exists

compiler=distutils.ccompiler.new_compiler()
if compiler.has_function('timer_create',libraries=('rt',)):
  user_macros.append(('HAVE_POSIX_TIMER','1'))

I can investigate that, as well. In my build of DeepSpeed, I'm externally setting CC, CXX, and CFLAGS to point to conda-specific items (like the include path to libaio.h) and that gets picked up in the build.

stas00 · 2021-07-16T00:53:51Z

I think there should be an easy way to do this for a typical user, which is what approach 2 was trying to do and then allow anybody to do it in their own way by overriding the default checking.

The problem with solution 4 is that we can't tell the user what to do if it fails, so we still need to know what system they are on and what library to tell them to install - as you can see they are differently named on different flavors of linux.

Chances are very low that users will already have this library installed.

So it might be a combination of different solutions.

Also may I ask why the rpm library doesn't work for you? No sudo access to install it?

adammoody · 2021-07-16T03:55:08Z

Thanks @stas00 , I'll stick with the plan to implement solution 2.

In my case, I don't have sudo access on this particular system. The system admins have installed the libaio-devel rpm on the host system.

Having said that, in this case, I'm trying to avoid using the existing system install. That's because I'm using a base install of IBM's OpenCE for PyTorch and other software, which uses conda. Under this conda environment, it's best practice to only build against software installed within the same conda environment rather than using "external" system-installed packages. I don't think this will be a common problem for others, though other OpenCE users might be interested.

stas00 · 2021-07-16T18:16:09Z

Thank you for explaining your particular needs.

This is great, so as suggested earlier the solution should include the generic automatic package-manager-based approach which should cater to the majority, and the manual solution for when either the automatic version doesn't cover the yet to be supported platform and when one doesn't want to use the former and wants to provide their own (your case). So there needs to be a way to tell the builder - I'm taking over, here is all the details you need.

FarzanT · 2023-02-14T15:35:34Z

Hi,
I wanted to add that a minor issue still persists on PowerPC architectures, cause by the following:

DeepSpeed/deepspeed/ops/adam/cpu_adam.py

Line 78 in 98cc35b

self.cpu_vendor = get_cpu_info()["vendor_id_raw"].lower()

Which results in KeyError: 'vendor_id_raw'.
I fixed it by simply replacing it with self.cpu_vensor="PowerPC". I guess the string doesn't really matter.
Should I create a new pull request?

stas00 · 2023-02-14T18:25:00Z

Should I create a new pull request?

This is an excellent idea, @FarzanT

* #1213: Fix CPUAdam for when `vendor_id_raw` is not provided * formatting (yapf) fix --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* microsoft#1213: Fix CPUAdam for when `vendor_id_raw` is not provided * formatting (yapf) fix --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* refactor to use mem_access (#2317) * add quant unit test (#2315) * add quant unit test * add codeowner * format fix * fix undefined symbol: curandSetPseudoRandomGeneratorSeed * modify ref fn name and add comment * add comments * add 4bit quant 16groups * fix * modify groups in ref code * parameterize tensor shape * single param * detach tensor * remove -lcurand flag * add back -lcurand flag Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * only override forward if using cuda-graph (#2291) * Add more options to inference benchmark (#2325) * bump to 0.7.4 * MOE residual matmult unit test (#2323) MOE residual matmul unit tests Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * MOE matmult with memaccess (#2336) * Fix formatting * Remove redundant variable * Refactor residual add kernels (#2333) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * mem access for quantize kernel (#2331) * mem access for quantize kernel * format * format fp32 * modify quant kernel * modify quant kernel2 * modify format * format * fix comments in pytest * fix comments in pytest * format * rerun Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * increase min pre-commit versions (#2346) * Extend scratch buffer for long prompts (#2212) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix zero docs (#2350) * Inference profiling updates/fixes (#2348) (#2349) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Kernel Data Conversion Utility (#2327) * Unify macro definitions and constants in a single file * Conversion utility implementation. * Fix reversion from formatting * Bugfixes after testing with correct DeepSpeed * Inline markers are available on both HIP + CUDA * Add Onebit Optimzers in __init__ (#2340) Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * docs(mixture-of-experts-inference): fix typo in tuto (#2345) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * download cifar to blob storage (#2342) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor gptj_residual_add kernels for better readability (#2358) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * Updated issue templates (#2363) * Update issue templates * fix cuda invalid config error in dequant kernel (#2362) * format * remove round fn * Add missing pytest fixture scope (#2353) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Extend residual_add kernel tests to conver pre_attn_norm (#2354) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Refactor fused_bias_residual kernels for better readability (#2356) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Capture error message during sweep tests (#2351) * Collect error messages in results.csv Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix an exception when recursively casting dicts to fp16 (#2370) * Refactor remaining distributed tests (#2216) * batch of refactored tests * more test refactoring * fp16 test refactor * more refactors * added DistributedFixture class * applied DistributedFixture to first batch of tests as a trial * added DistributedFixture test and documentation * last tests * fixes for refactored tests * remove subdirs in workflow files * fix pytest syntax error * fix another syntax error * update imports * use DistFixture with elastic checkpoint test * missing import * update to shared class tmpdir for elastic test * moved test files * avoid duplicate test file name * last refactor and moving test files * formatting * fix broken import * testing forked AMD tests * update abstract method * use blob storage for accelerate and transformers tests * upgrade torch for acclerate CI Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix the MLP output tensor's shape (#2380) * allow building with latest CUDA (11.8), it is backwards compatible (#2390) * pin transformers version for unit tests (#2402) * Change type to tuple in replace_wo_policy isinstance check (#2387) Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type. Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Molly Smith <mosm@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Checkpoint backwards-compatbility workaround (#2384) * Add predicated global load (#2373) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * MII blog post (#2418) Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Fix figure reference (#2419) * [docs] update news items * [docs] add mii repo link * Add SLURM Multinode Runner (#2404) Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix issue with corrupted output on long generation for GPT (#2359) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * MII blog title update on Readme * DeepSpeed-MII title change in website * Fix GPT Neo-X multi-gpu inference (#2401) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * MII-Public and MII-Azure subheading in mii post * CI fixes related to triton (#2422) * [docs] update mii blog title (#2423) * add SD injection policy (#2381) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * fix checkpoint loading when it is a dictionary (#2425) * Make error regex more generic in collect_results.py (#2415) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fixes #2389 (#2411) truncating expert param storage for checkpointing Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Fix for inference gpt-j test (#2430) * fix for gpt-j failing due to tokenizer error * limit number of gpt-j tokens generated due to low memory * Fixing bug 2361 (#2410) * fixing bug 2361 * adding pytest for config initialization * chaning expected output to FusedAdam * remove print statement * running yapf on modified files * running pre-commit formatting Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Universal checkpoint for zero stage 1 (#2284) * Refactor universal checkpointing and tensor fragments * Formatting * Support zero stage1; Expand TP dim * Remove debug prints * Detect sharded optimizer state * Format fixes * Encode reshaping guide * More symbolic constants Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * only add deps if extra is explictly called (#2432) * Add TestInjectionPolicy inference unittest class for testing custom injection policies (#2426) This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies. This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API. The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified. This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see GH-2387). Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * [memory estimators] new config args sync (#2431) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * parallelize writing of layer checkpoint files across data parallel instances (#1419) * parallelize layer checkpoints across data parallel groups * use partition_uniform to determine start/end index values * formatting fix * config: add option for parallel write of layer checkpoints in pipeline stage * yapf fixes * enable parallel layer write according to config param * avoid extraneous makedir when rank 0 writes all layers Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix broken link to DeepSpeed Megatron fork (#2440) Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> * bump to 0.7.5 * Fix Bug #2319 (#2438) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * update pytorch pool operator function signiture (#2443) * update pytorch pool operator function signiture * fix the case where kwargs is None * Fix build issues on Windows (#2428) * Fix build issues on Windows * small fix to complie with new version of Microsoft C++ Build Tools Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * rollback ds config changes (#2395) * rollback ds config changes * fix format * Fix error when output_file is a relative path without a prefix (#2397) Co-authored-by: Benjamin Steenhoek <benjaminjsteenhoek@gmail.com> * fix restuls and exprs path to use absolute path * write out optimial config after tuning * fix format * assert tuning result dir creation Co-authored-by: Benjamin Steenhoek <benjaminjsteenhoek@gmail.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Use CUDA events for inference model profiling (#2371) * use cuda event timers for model profiling * Fixing a mismatch in basic adam test. (#2447) * Reduction Kernel Utility (#2436) * Initial reduction_utils.h implementation * Add initialization helper, ensures correct min/max behavior * Remove unnecessary warp sync * deepspeed/launcher/launch.py: add option '--enable_each_rank_log logdir' (#2409) * Fixes for various CI problems (#2457) * check only major CUDA version in CI * update expected torch latest version * pin torch latest to 1.12 until issues with 1.13 are resolve * wrong expected torch version * Update nv-torch18-v100.yml * remove forked from pytest option due to cuda re-initialization errors * removed expected torch version from inference tests, causing errors currently * fix various bugs that popped up * move all tests over to cu111 runners, cu113 runners having problems * Cache Allocation and Softmax Fixes (#2433) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fixing the checkpoint loading at inference-engine (#2429) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> * Create a new folder structure to isolate model-specific code in DS (#2464) * don't gather partitioned activations for mp size 1 (#2454) * don't gather partitioned activations for mp size 1 * add inline comment for the change Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Updating autotune json default in docs. (#2476) * Updating autotune default in docs. * Running pre-commit. * Added MLFLOW environment variables for logging metrics within trainig… (#2477) * Added MLFLOW environment variables for logging metrics within trainign script * exporting MLFlow env variables from AML env Co-authored-by: Cheng Li <pistasable@gmail.com> * fix accelerate link (#2481) * Add correct memory-allocation at DeepSpeed-Attention (#2474) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * Fix CI issues related to cupy install (#2483) * remove any cupy install when setting up environments * revert previous changes to run on cu111 runners * fix for when no cupy is installed * remove cupy uninstall for workflows not using latest torch version * update to cu116 for inference tests * fix pip uninstall line * move python environment list to after DS install * remove cupy uninstall * re-add --forked * fix how we get cupy version (should be based on nvcc version) * [docs] add SD tutorial to news * [docs] add SD tutorial to deepspeed.ai news * Add `scale_attn_by_inverse_layer_idx` feature (#2486) * Add scale_attn_by_inverse_layer_idx feature * Fix layer_id bug * Fix scaling value Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * Stable Diffusion Enhancements (#2491) Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> * stage_1_and_2.py: no allreduce needed when mp size is 1 (#2494) * Make bf16_optimizer work for non pipeline (#2470) * Fix nightly CI tests (#2493) * fix for lm-eval nightly tests and add gpt-j to MPtest because OOM on single GPU * add nv-nightly badge * Make data contiguous before the inplace reshape-copy_ function (#2489) Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Fix typos: deepseed -> deepspeed (#2499) * bump to 0.7.6 * DeepSpeed inference config. (#2459) (#2472) Changes to inference API to use accept a config dict and cleaning up Inference Engine to utilize the newly added inference config. Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Update docs to autogenerate pydantic config model docs (#2509) * update zero config docs * add autogenerated docs for pydantic models used in ZeRO and Inference configs * Add max_tokens alias to max_out_tokens arg to maintain backwards compatibility (#2508) This PR adds a max_tokens alias to the max_out_tokens argument in the init_inference API to support backwards compatibility after the config refactor PR https://github.com/microsoft/DeepSpeed/pull/2472. Thanks @molly-smith and @mrwyattii. * Deepspeed quantization library v0.1 (#2450) * Initial commit Deepspeed quantization library * Match function signatures * Add Quantization Kernel * adding offset comparision and precommit changes * format fixes * FIle name changes * pt_binding_changes * test name change * Integer quantization, minor refactors * Add directed test_case * format fixes * Move param calculation to constructor of params class * Use local function and add elemsPerBlock * change function to be specalized * sub block reduce * add new schedule * Add new schedule test case * fix illegal writes in sch1 * Style fixes in comments Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * Fix backward compatibility for InferenceConfig (#2516) * Make new InferenceConfig backwards compatible with previous init_inference API Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Add missing Inference sub-configs (#2518) * Add note about nvcc/hipcc requirement (#2519) * Update codeowners (#2525) * Initial dequant library implementation (#2521) * Fixes for torch 1.14 due to new torch.numel return type (#2522) * fixes for new torch.numel return type * address comment * Ensure is initialized for SD (#2534) * Make DS-Inference config readable from JSON (#2537) * Add MII tests (#2533) Adding MII tests to ensure changes to DS-Inference do not break MII * Remove mutable default parameter in init_inference() (#2540) A mutable default value is dangerous because editing it will change the value for all future calls to the function. The value is itself edited later in the function, so this problem will likely be encountered sooner or later. Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Change Where DS/Triton is Used in Stable Diffusion (#2536) * Change utilization of DS/Triton kernels * add config at Clip-encoder Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * Pass down the new DS inference config to replace_transformer_layer. (#2539) * pass down the new DS inference config to replace_transformer_layer. * remove quantize_settings and rename the ep_mp_group. * Fix model_config passing. Fixes gptj issue with wrong output. * fix small bug in gpt-neo. Co-authored-by: Reza Yazdani and Michael Wyatt * Adding Gradient Accumulation Data Type Config (#2512) * Adding gradient accumulation dtype config. * Switching to new DtypeEnum * Adding standalone check function, and unit tests * Variable disambiguation * Adding checks for unsupported states. * Updating for PR comments. * Reorganizing unit test. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Report progress at gradient accumulation boundary (#2553) * report progress at gradient accumulation boundary * format * format * encoded ds config into command line argument when launching child processes in autotuning (#2524) * rollback ds config changes * fix format * Fix error when output_file is a relative path without a prefix (#2397) Co-authored-by: Benjamin Steenhoek <benjaminjsteenhoek@gmail.com> * fix restuls and exprs path to use absolute path * use base64 encoded ds config as cmd arg * fix format * remove assert * write out optimial config after tuning * fix format * no need to update ds config path when encoding ds config * udpate * do not use abs path for result and expr dir * fix conflicts * fix run mode * fix format * fix format Co-authored-by: Benjamin Steenhoek <benjaminjsteenhoek@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * add missing moe deprecated fields to inference config (#2556) * Abstract accelerator (step 1) (#2504) * Establish building block of abstract accelerator * Change .*Tensor variable to @property * [op builder] add op builder reflection to allow enumerate of builders in all_ops.py and builder_names.py * change @abstractproperty to @property @abstractmethod Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix invalid check of recorded parameter orders in zero stage3. (#2550) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * bump to 0.7.7 * docs: Update the recent url for Megatron-LM (#2564) * use get_global_rank if available (#2567) * Add Determined to open-source DL frameworks (#2573) * Support fp32 gradaccum for bf16 model (#2566) * allow bf16 model with fp32 gradient accumulation datatype * allow fp32 gradient accumulation and bfloat16 model in amp mode * alternative fix for grad accumulation type mismatch. In the case of zero optimizer we should have grad accum type == model data type Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Drop Maxwell Support (#2574) * Officially drop Maxwell support * Formatting * Comparison mismatch fix * Fix quantized-inference & Add generic support of checkpoint loading (#2547) * fix checkpoint loading when it is a dictionary * fix some issues with saving ckpt & int8 inference * fix quantized-inference & add generic support of checkpoint loading * remove int8 hard-coded flag * fix mlp return tensors * fix several issue to load checkpoints of GPT-J, GPT-NEOX, and OPT with different TP-size * add more comments & description for checkpoint-loading module Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Fix MegatronLayerPolicy to have megatron_v2=True (#2579) This PR updates the MegatronLayerPolicy to set megatron_v2=True, which is required in order to properly transpose in the replace_with_policy() function. After the change in this PR, in conjunction with PR #99 in the Megatron-DeepSpeed fork, the Megatron text-generation example works with DS inference. * Update barrier and reduce_scatter_base to conform to PyTorch signatures (#2570) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Support N-dimension input in quantization kernel (#2575) * Add support for inputs > 2D * use vec * Add N-Dim support to Dequant kernel * merge master and fix format * Bug Fix * fix num_bits * Fix dequant Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * Add checkpoint sharding unit tests (#2561) * added checkpopint sharding tests * Updating docs README (#2587) * Updating docs README with API update procedure. * Addressing comments. Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Updating API docs (#2586) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix issues w. python 3.6 + add py-version checks to CI (#2589) * get mask token from tokenizer (#2592) * bump to 0.7.8 * DeepSpeed Data Efficiency Library (#2585) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix blog link (#2600) * Migrate ops tests to new inference_ops marker (#2599) * Migrate ops tests to new inference_ops marker * Disable by default * Add missing test cases * Reorder such that inference_ops will run[fail] first * Move layer norm to new schedule (#2590) * Move layer norm to new schedule * Pre-commit fixes * fix comments * format fixes * Merge unrolls * format fixes * camelCase * format fixes * revert unwanted file * move pow2 function * format fixes Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * [deepspeed/autotuner] Bug fix for binary search for batch size (#2162) * bug fix for binary search for batch size * fix binary search termination condition * add fix for older pydantic versions (#2611) * Use rocm/pytorch:latest (#2613) * skip torch.zeros and tensor.copy_ when model parallel is not used (#2479) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * call empty_cache to really free up GPU memory as described in comment (#2620) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Remove GatheredParameters context from replace_with_policy (#2591) This PR removes the zero-infernece GatheredParameters context from replace_with_policy due to no longer needing zero-inference after the introduction of meta tensor support for BLOOM. * fixes #2498 (#2603) taking gradient accumulation steps into account for throughput calculation Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Update AVX512 Detection (#2621) * Update cpuinfo AVX512 detection * Missing conversion from `_mm256` to `_mm256i` Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Add Megatron CI workflow (#2614) * added megatron unit test * Update nv-megatron.yml Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [inference] check for unsupported model generate args (#2627) * [launcher] parse hostfile via regex and added error checks (#2626) * Unit tests setup own venv (#2628) add reusable workflow that sets up fresh venv for each test and prints relevant environment info * add enable_each_rank_log to deepspeed/launcher/runner.py (#2571) * Fix typo in autotuner.py (#2639) * [zero-3] Handle forward parameter return correctly in nested cases (#2642) Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [inference] ds-attention refactor w.r.t. ops (#2623) * Fix issue w. bloom when changing tp size (#2645) * fix assertion error in zero stage 3 (#2647) * tweaks to ds-attn, distilbert policy, and mup (#2649) * [doc] fix `min_loss_scale` default (#2660) * [doc] fix `min_loss_scale` default * align * [launcher] fail gracefully if hostname -i doesn't work as expected (#2631) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix Opt injection (#2541) * fix Opt injection & add injection verification check at inference test * fix several issues * remove fixture * remove check_injection when no kerenl is injected Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Abstract accelerator (step 2) (#2560) * Abstract accelerator (step 2) * more flex op_builder path for both installation and runtime * add SpatialInferenceBuilder into cuda_accelerator.py * use reflection to make cuda_accelerator adapt to CUDA op builder change automatically * clean up deepspeed/__init__.py * add comments in cuda_accelerator for no torch path * Update deepspeed/env_report.py Change env_report.py according to suggestion Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> * reduce the range of try...except for better code clarity * Add porting for deepspeed/ops/random_ltd/dropping_utils.py * move accelerator to top directory and create symlink under deepspeed Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Remove unnecessary device synchronization for stage 2 (#2500) * Remove unnecessary device synchronization for stage 2 * Remove unnecessary device synchronization for stage 2 Co-authored-by: liyidong.lyd <liyidong.lyd@alibaba-inc.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [Bug Fixed] use torch.cuda.is_available() (#2661) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [fp16] lower initial_scale_power (#2663) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix Tensor contiguous bug in model_compression (#2671) double check the unit tests * [inference] ds-mlp refactor w.r.t. ops (#2668) * real_accelerator validation check for both accelerator and deepspeed.accelerator path (#2685) * remove duplicated code in ZeRO (#2655) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Add mlflow logging for aml (#2495) * add logging changes * try w/out abspath * undo last change * start mlflow debug * remove mlflow from export_envs * add mlflow logging for reversed * remove mlflow.start_run * add back start run * don't clean cmd * print os environment variables * remove first start run * add run_id to mlflow star * remove context managers * move last end run * add extra parent start_runs * add run id logging * add logging to run_ds_config * change run_id to run_name * add back context managers and run_id logs * remove context mng * debug environment variable * reset environment variables * add env variable deletion * clean up * remove unused import * fix yapf/whitespace errors Co-authored-by: Cheng Li <pistasable@gmail.com> * fix import path to op_builder (#2687) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Pass training flag to forward call from Eval (#2604) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> * Extend quantization utils features (#2683) * Extend quantization utils features * remove unwanted files * fix cahce setting Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * [GatheredParameters] add support for any iterator (#2664) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix for latest diffusers (#2699) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * exclude benchmarks during install (#2698) * using correct loss scale in zero step (#2695) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * non-MoE stage 1 requires CG disabled (#2703) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * remove print side effect from importing deepspeed (#2704) * ZeRO3 handling frozen weights] (#2653) * bump to 0.8.1 * CUDA optional deepspeed ops (#2507) * CPU-Adam: add compile-flag to enable param-copy from CPU to GPU * guarde the CUDA-related include files and variables * remove CUDA dependency from op_builder when building against CPU * fixing the builder issues * fix formatting * return true when there is no mismatch on the cuda version * guard for when cuda is not available & test with cpu-only environment * Update cpu_adam and cpu_adagrad * Format fixes * Add configurable half precision type; Build/run in CUDA environment * Run cpu_adam and cpu_adagrad in cpu only environment * Mark CUDA only unit tests * CPU environment CI * Format fixes * Remove --forked * Add --forked * CPU only CI should pass * Format fixes * Format fixes * Remove scattered pytest.skip * Fix cpu_adam unit test * Update .github/workflows/nv-torch-latest-cpu.yml Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Update .github/workflows/nv-torch-latest-cpu.yml Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Address PR feedback * OpenMP linking * Fix unit tests Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * remove master branch from CI triggers (#2712) * [install] only add deepspeed pkg at install (#2714) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * update for lm-eval==0.3.0 (#2713) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * BF16 optimizer for BF16+ZeRO Stage 1 (#2706) * BF16 optimizer only with ZeRO stage 1. * Updating to grad accum of fp32 for BF16 ZeRO1 case. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix typo (#2718) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Inference Refactor (replace_with_policy, model_implementations) (#2554) Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Change zero_grad() argument to match pytorch (#2741) * Automatic tensor parallelism v2 (#2670) * loop through pipe.model * tp_parser first draft * client_module must be type object * Simplify layernorm tracking. Add unittest. * cleanup * Add more models to unittest * cleanup inference pytest for merging * Add unittest * cleanup * pre-commit * unittest id and pytest marker * try marian for unittest * precommit * Move tp code to seperate file * Add new auto tp file * pre-commit and type * Update deepspeed/module_inject/auto_tp.py Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Update deepspeed/module_inject/auto_tp.py Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Update tests/unit/inference/test_inference.py Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * remove unused fillmask function Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * fixing optimizer sanity check (#2742) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [GatheredParameters] fix memory leak (#2665) * [GatheredParameters] fix memory leak * simplify * cleanup and move * style * Formatting * fix test * fix test * fix test take 2 * Trigger CI Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> * Abstract accelerator (step 3) (#2677) * Integrate accelerator abstraction interface into deepspeed/ * Fix error message in fp16/fused_optimizer * fix error message in fp16/unfused_optimizer.py * assign get_accelerator().pin_memory() result to input Tensor name * no need to check cuda and whether nvtx supported * move try-except into inner most block * call Event() and Stream() in get_accelerator() for data type * Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed * Apply op_builder backend api change from #2705 from @jeffra * fix tests where Builder NAME is used * keep original ...Builder.NAME interface instead of ...Builder().NAME interface * fix builder closure for installation * fix randomltd builder * add comments to clarify create_op_builder and get_op_builder * fix compatibility with pip install -e Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix autotuning so that it records Floating Point Operations per second, not microsecond (#2711) * Fix how autotuning reports TFLOPS so that they are reported in FLOPS per second, not millisecond Co-authored-by: Nick Sarkauskas <nsarka00@gmail.com> Co-authored-by: Quentin Anthony <anthony.301@osu.edu> Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Actually it is microseconds -> seconds Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Actually it is microseconds -> seconds Signed-off-by: Dashiell Stander <dstander@protonmail.com> Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Nick Sarkauskas <nsarka00@gmail.com> Co-authored-by: Quentin Anthony <anthony.301@osu.edu> * fix a mispelled attribute (#2750) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [zero] remove misleading dtype log (#2732) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix softmax backward (#2709) * Reset KV-cache at the beginning of text-generation * Add new backward kernel to handle large softmax-length * remove unrelated changes Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> * Skip test_bias_gelu unit test if torch < 1.12 (#2754) This PR adds a torch version check in the test_bias_gelu unit test to skip if the torch version < 1.12. This is due to gelu implementation differences in versions prior to 1.12. * Add environment variable to make nvcc compilation more verbose (#2759) * Bing/formatting correction (#2764) * modify engine.py for formatting * commit formatting changes on engine.py * Add links to new azureML examples (#2756) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. (#2743) * Remove hardcoded instances to fp16 in log messages. * Add model_dtype to print the correct format * Respond to PR feedback --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor/Pydantify monitoring config (#2640) * pydantify monitoring configs --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Pin minimum `packaging` requirement (#2771) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix for diffusers v0.12.0 (#2753) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * some fix in flops_profiler (#2068) * bugs in profiler: 1. Tensor.bmm missed in _patch_tensor_methods function 2. missed funtions in _reload_functionals and _reload_tensor_methods functions 3. torch.mm and torch.Tensor.mm will have same __name__ in wrapFunc, my suggustion is use __str__ instead. * formatting --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * fix upsample flops compute by skipping unused kargs (#2773) * fix upsample flops compute by skipping unused kargs * fix format * Fix broken kernel inject bug (#2776) * Fix Checkpoint-loading with Meta-tensor (#2781) * Reset KV-cache at the beginning of text-generation * Pass the ckpt-loading arguments to work with meta-tensor * remove unrelated changes * add support for hjson config files (#2783) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Reset KV-cache at the beginning of text-generation (#2669) Co-authored-by: Martin Cai <martincai@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Container param cleanup + remove qkv_merging (#2780) This PR cleans up some container items and removes an unused qkv_merging parameter: - Remove qkv_merging=True from BERT containers - Change containers config object to ds_model_config - Remove qkv_merging param * Common location to install libaio-dev (#2779) * Common location to install libaio-dev * Update .github/workflows/setup-venv/action.yml Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> --------- Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Fixing broken link to azureml-examples recipes (#2795) * remove outdated comment (#2786) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Enable page-locked tensors without CUDA (#2775) * Enable page-locked memory in cpu only env * Enable page-locked memory in cpu only env * Formatting * Add TODOs; Release page-locked memory * Update perf microbenchmark; Reduce unit test memory * Reduce CI mem usage * Add container load checkpoint error reporting + refactor (#2792) This PR refactors the organization of meta tensor checkpoint loading as follows: - Move get_param_names() abstract method definition from TransformerPolicy into MetaTensorContainer - Model-specific get_param_names() definitions moved from policy into model-specific container - selected_policy_g, megatron_v2_g, and transformer_config_g globals replaced with a single container_g global, since the container will contain all of the information those globals previously captured - ckpt_load_enabled flag added to containers that's set to False by default in the base.py container and gets set to True when the MetaTensorContainer feature is inherited - Assertion added to replace_transformer_layer before performing checkpoint loading to check if ckpt_load_enabled ==True, otherwise an error message will be printed saying that the container does not support meta tensor checkpoint loading. The aim of these changes is to more closely couple meta tensor checkpoint loading code to the MetaTensorContainer and to allow for better error reporting of load checkpoint use on model types that don't support this feature. * Add user defined launcher args for PDSH launcher (#2804) * Add user defined launcher args for PDSH launcher * Formatting fixes * Fix Slurm launcher user args (#2806) Fix missing connections from --launcher_args to Slurm srun command. * Handle hanged tests in CI (#2808) * Fix inference CI device error (#2824) * Fix permissions issue with pip upgrade (#2823) * fix permissions issue with pip upgrade * install to .local instead of use sudo * upgrade pip in venv * Update action.yml * fix typos * Fix cpu-only CI hangs (#2825) * don't run tests in parallel * make AsyncIO test sequential * Fix Pipeline Parallel resize unit test (#2833) * fix overlapping checkpoint names in unit tests * remove running cpu-only on master merge * Fix auto TP for duplicate modules with different gems (#2784) * Fix auto TP for duplicate modules with different gems * precommit and comments * Comment * Combine gem list of same named modules * remove duplicates from gem_list before updating policy * Add module attribute with name variation for ProphetNet --------- Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Refactor DS inference API. No longer need replace_method. (#2831) Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Port Reza's INT8-quantization fix to container architecture (#2725) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Fix gpt-Neox rotary embedding implementation (#2782) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix for cpu-only tests (#2849) * bump to 0.8.2 * add auto-generated PR workflow (#2822) * add auto-generated PR for private repo * change variable names * fix typo in autosync workflow (#2850) * Fix example command when building wheel with dev version specified (#2815) * Create tensor parallelism blog/tutorial (#2766) Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Data efficiency library update (#2866) * data efficiency library update * data efficiency library update * data efficiency update * data efficiency update * Make z3 respect comm dtype (#2807) * Make z3 respect comm dtype * Support fp32 comm dtype * Remove obsolete assert * Code cleanup * Automatic Tensor Parallelism Blog Links (#2877) * Modify table for compatible web format * Add tutorial links to navigation * Add news bit to main readme * Update docs/_tutorials/automatic-tensor-parallelism.md Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> --------- Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Check device count before running dist tests (#2799) * Check device count before running dist tests * fixing format for "Check device count before running dist tests" * Check device count against max world size * Check GPU count before launching dist tests * double-check GPU actually exists --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * AutoTP tutorial web formatting and news (#2883) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Remove deprecated `torch._six` imports (#2863) * Remove deprecated `torch._six` imports Closes #2845. * Support older versions of PyTorch as well. --------- Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Reduce I/O size (#2814) * add missing license info to top of all source code (#2889) Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Enable tensor fragments for zero 2 & 3 (#2727) * Enable tensor fragments for zero 2 * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Support offload * Support multi-gpu * Cleanup * WIP * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Support padding * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * z3 optimizer state support; aligned api * Support frozen z3 params * Unit tests * Check NVMe offload capability * Formatting * Docs * More docs * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * More docs * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * More docs * Support unsharded fp32 grad * Remove debug prints * Fix off-by-one detection of empty grads * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Fix off-by-one error * Skip ranks with no gradient data * Formatting * Add license * Fix license --------- Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * better eval sampler (#2907) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * using container when loading inference checkpoints (#2875) This PR updates the replace_fn function when loading inference checkpoints. The container will now be passed to the load_model_with_checkpoint() so we can call load_params() from there. load_params() is also updated to access the variables in the policy. * Fix CPUAdam for when `vendor_id_raw` is not provided (#2836) * #1213: Fix CPUAdam for when `vendor_id_raw` is not provided * formatting (yapf) fix --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Always convert input mask to half (#2851) * Fixes `AttributeError` in #2853 (#2854) Updates `deepspeed/monitor/monitor.py` to instantiate objects with correct configs Relevant issue: https://github.com/microsoft/DeepSpeed/issues/2853 Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Add MPICH Multinode Runner (#2839) * MPICH support * MPICH changes * MPICH changes * MPICH changes * MPICH changes * accelerator runtime modifications * Accelerator runtime changes * Accelerator runtime modifications * Remove redundant print from single node * Move hostfile to tmp * Code cleanup for MPICH class * Code cleanup, rm whitespace * Removing mpiexec environment check details * Not needed tmp hostfile as pass directly * Remove debugging comments * rm print statement * Revert comm changes as WA not needed * Use MPICHRunner name for class * Use MPICHRunner as class name * No need to use args.force_multi and args.launcher . This should be set in deepspeedexamples gpt-3.6b .sh script as: $launcher=MPICH run_cmd=" deepspeed --hostfile=${hostfile_ds} --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}" * Adhere to code pattern * Rm empty lines in MPICHRunner class * Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh * pass MPICH hostfile through launcher_args in gpt-3.6b.sh * Clean code and remove args hostfile * fix merge * fix merge --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> * clean up and fix format * add ut --------- Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * TP unsupported models and assertions (#2810) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * AutoTP Assert Kernel Injection Support (#2939) * check kernel injection supported models * Clarify why user should use kernel injection * Check for local CUDA graphs when enable_cuda_graph=True (#2941) * Improve overflow handling (#2944) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [RFC] add device abstraction to allow other device than CUDA be used (#2221) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * deepspeed.init_distributed() support for TCP protocols (#2905) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.8.3 * bug fix for skipping mbs (#2171) Co-authored-by: Rajhans Samdani <rajhans@gmail.com> * Fix issue between our abstract accelerator and colossalai's version of op_builder (#2963) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * [zero] prevent poor configs from running w. zero-offload (#2971) --------- Signed-off-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Saeyeol Lee <78332687+l4d2boomer@users.noreply.github.com> Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> Co-authored-by: Matt Smith <matt@mjksmith.com> Co-authored-by: Thomas-MMJ <112830596+Thomas-MMJ@users.noreply.github.com> Co-authored-by: lekurile <113481193+lekurile@users.noreply.github.com> Co-authored-by: Lev Kurilenko <lekurile@microsoft.com> Co-authored-by: Molly Smith <mosm@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Dashiell Stander <dstander@protonmail.com> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Andrey Chernykh <andrew.chernyh@gmail.com> Co-authored-by: Alexander Jipa <alexander.jipa@gmail.com> Co-authored-by: Alexander Jipa <azzhipa@amazon.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Adam Moody <moody20@llnl.gov> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: eltonzheng <eltonz@microsoft.com> Co-authored-by: Benjamin Steenhoek <benjaminjsteenhoek@gmail.com> Co-authored-by: Guo Yejun <yejun.guo@intel.com> Co-authored-by: savitamittal1 <39776179+savitamittal1@users.noreply.github.com> Co-authored-by: kyoto7250 <50972773+kyoto7250@users.noreply.github.com> Co-authored-by: Kevin Ko <gusdnd852@naver.com> Co-authored-by: lokoppakmsft <112720551+lokoppakmsft@users.noreply.github.com> Co-authored-by: iLeGend <youzhi.jin@intel.com> Co-authored-by: Alex Hedges <aphedges@users.noreply.github.com> Co-authored-by: ShijieZZZZ <116392778+ShijieZZZZ@users.noreply.github.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: AGUL <mingzhi.liu@intel.com> Co-authored-by: Jeongseok Kang <jskang@lablup.com> Co-authored-by: Hayden <hayden.barnes@hpe.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Rahil Bathwal <87332510+rahilbathwal5@users.noreply.github.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Ikko Ashimine <eltociear@gmail.com> Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> Co-authored-by: li-yi-dong <73142299+li-yi-dong@users.noreply.github.com> Co-authored-by: liyidong.lyd <liyidong.lyd@alibaba-inc.com> Co-authored-by: JackieWu <wkcn@live.cn> Co-authored-by: Xiaoxia (Shirley) Wu <94406484+xiaoxiawu-microsoft@users.noreply.github.com> Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: loadams <114770087+loadams@users.noreply.github.com> Co-authored-by: Nick Sarkauskas <nsarka00@gmail.com> Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com> Co-authored-by: Martin Cai <martincai@users.noreply.github.com> Co-authored-by: Razvan Tanase <ratanase@microsoft.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Yasyf Mohamedali <yasyfm@gmail.com> Co-authored-by: Mayank Mishra <32954280+mayank31398@users.noreply.github.com> Co-authored-by: Farzan Taj <farzantaj@outlook.com> Co-authored-by: Sam Foreman <saforem2@gmail.com> Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com> Co-authored-by: noabauma <62301037+noabauma@users.noreply.github.com> Co-authored-by: Rajhans Samdani <rajhans@gmail.com>

adammoody requested review from awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners July 9, 2021 02:06

enable cpu adam operation on powerpc

9c7fadb

adammoody force-pushed the ppcadam branch from 10e444d to 9c7fadb Compare July 9, 2021 02:09

adammoody mentioned this pull request Jul 9, 2021

[FEEDSTOCK REQUEST] DeepSpeed v0.3.16 open-ce/open-ce#415

Closed

Merge branch 'master' into ppcadam

4297089

jeffra added 2 commits July 12, 2021 09:41

fix formatting

6bdfbe6

Merge branch 'master' into ppcadam

dde4dff

tjruwase approved these changes Jul 12, 2021

View reviewed changes

tjruwase merged commit f65ff90 into microsoft:master Jul 12, 2021

adammoody deleted the ppcadam branch July 13, 2021 08:14

stas00 mentioned this pull request Jul 14, 2021

[testing] failing tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_stage3_nvme_offload huggingface/transformers#12715

Closed

adammoody mentioned this pull request Jul 23, 2021

test link against libaio using distutils #1247

Merged

FarzanT added a commit to FarzanT/DeepSpeed that referenced this pull request Feb 15, 2023

microsoft#1213: Fix CPUAdam for when vendor_id_raw is not provided

7408f8c

FarzanT mentioned this pull request Feb 15, 2023

Fix CPUAdam for when vendor_id_raw is not provided #2836

Merged

tjruwase added a commit that referenced this pull request Feb 28, 2023

Fix CPUAdam for when vendor_id_raw is not provided (#2836)

9886d6d

* #1213: Fix CPUAdam for when `vendor_id_raw` is not provided * formatting (yapf) fix --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable cpu adam op on powerpc architectures #1213

enable cpu adam op on powerpc architectures #1213

adammoody commented Jul 9, 2021

ghost commented Jul 9, 2021 •

edited by ghost

Loading

adammoody commented Jul 12, 2021

tjruwase commented Jul 12, 2021

tjruwase commented Jul 12, 2021

adammoody commented Jul 13, 2021

stas00 commented Jul 14, 2021

stas00 commented Jul 14, 2021

adammoody commented Jul 15, 2021

tjruwase commented Jul 16, 2021 •

edited

Loading

adammoody commented Jul 16, 2021

adammoody commented Jul 16, 2021 •

edited

Loading

stas00 commented Jul 16, 2021 •

edited

Loading

tjruwase commented Jul 16, 2021

tjruwase commented Jul 16, 2021 •

edited

Loading

adammoody commented Jul 16, 2021

stas00 commented Jul 16, 2021 •

edited

Loading

adammoody commented Jul 16, 2021

stas00 commented Jul 16, 2021

FarzanT commented Feb 14, 2023

stas00 commented Feb 14, 2023

enable cpu adam op on powerpc architectures #1213

enable cpu adam op on powerpc architectures #1213

Conversation

adammoody commented Jul 9, 2021

ghost commented Jul 9, 2021 • edited by ghost Loading

adammoody commented Jul 12, 2021

tjruwase commented Jul 12, 2021

tjruwase commented Jul 12, 2021

adammoody commented Jul 13, 2021

stas00 commented Jul 14, 2021

stas00 commented Jul 14, 2021

adammoody commented Jul 15, 2021

tjruwase commented Jul 16, 2021 • edited Loading

adammoody commented Jul 16, 2021

adammoody commented Jul 16, 2021 • edited Loading

stas00 commented Jul 16, 2021 • edited Loading

tjruwase commented Jul 16, 2021

tjruwase commented Jul 16, 2021 • edited Loading

adammoody commented Jul 16, 2021

stas00 commented Jul 16, 2021 • edited Loading

adammoody commented Jul 16, 2021

stas00 commented Jul 16, 2021

FarzanT commented Feb 14, 2023

stas00 commented Feb 14, 2023

ghost commented Jul 9, 2021 •

edited by ghost

Loading

tjruwase commented Jul 16, 2021 •

edited

Loading

adammoody commented Jul 16, 2021 •

edited

Loading

stas00 commented Jul 16, 2021 •

edited

Loading

tjruwase commented Jul 16, 2021 •

edited

Loading

stas00 commented Jul 16, 2021 •

edited

Loading