Samyamr/full precision for ZeRO Stage2 and Stage3 #1004

samyam · 2021-04-23T22:53:21Z

No description provided.

deepspeed/runtime/zero/partition_parameters.py

deepspeed/runtime/zero/stage2.py

deepspeed/runtime/engine.py

stas00 · 2021-04-29T16:24:18Z

And just to indicate priority to this PR, we have all those bfloat16 models that won't train under fp16/mixed precision, and users want to use DeepSpeed to overcome GPU memory limitations, so they badly need this. Thank you!

stas00 · 2021-04-29T20:17:22Z

When you feel this looks good enough to test please let me know and I will start testing this branch on the transformers side. Thank you.

Assert to check if param.dtype is torch.half for ZeRO3 should only happen if the model was initialized in ZeRO3 context.

stas00 · 2021-04-30T03:34:10Z

This is awesome - thank you!

I encountered only one issue:

As I am writing HF transformers tests for fp32, I found that zero.Init doesn't get dtype from the config file, I have to explicitly do:

           ds_config = deepspeed_config()
            # XXX: Fixme - we shouldn't need to figure dtype out, it should be in the config file
            dtype = torch.float16 if ds_config.get("fp16", {}).get("enabled", True) else torch.float
            with deepspeed.zero.Init(dtype=dtype, config=ds_config):
                model = cls(config, *model_args, **model_kwargs)

I thought the whole point of passing config to zero.Init is so that we don't need to manually parse the file in multiple places, we we discussing this to work:

           ds_config = deepspeed_config()
            with deepspeed.zero.Init(config=ds_config):
                model = cls(config, *model_args, **model_kwargs)

I'm not sure if this is the best approach but with microsoft#1004 I still have to pass `zero.Init(dtype)` because this branch never gets executed: ``` def _set_dtype(self, ds_config, dtype): if ds_config is not None and dtype is None: _ds_config = DeepSpeedConfig(ds_config) self.dtype = torch.half if _ds_config.fp16_enabled else torch.float ```

samyam added 12 commits April 20, 2021 20:24

Adding tf32 and fp32 support for ZeRO Stage 3

f4f28b9

Changing to location of self.dtype assignment

08fc9e6

Exhaustive setting of self.dtype

ed679c9

Adding fp32 and tf32 support for ZeRO Stage 2

2eda161

fix loss scale value for static loss scale

5ce4d3b

Adding tf32 and fp32 support for ZeRO Stage 3

639d487

Changing to location of self.dtype assignment

e3b3534

Exhaustive setting of self.dtype

bed15a0

Adding fp32 and tf32 support for ZeRO Stage 2

bfe1e84

fix loss scale value for static loss scale

7d76279

adding documentation for dtype in zero init

0bf7efd

fixing conflict

6b250c2

samyam requested review from jeffra and tjruwase April 23, 2021 22:53

samyam requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, minjiaz, niumanar, RezaYazdaniAminabadi and ShadenSmith as code owners April 23, 2021 22:53

tjruwase approved these changes Apr 26, 2021

View reviewed changes

deepspeed/runtime/zero/partition_parameters.py Show resolved Hide resolved

deepspeed/runtime/zero/stage2.py Show resolved Hide resolved

samyam commented Apr 26, 2021

View reviewed changes

deepspeed/runtime/engine.py Show resolved Hide resolved

This was referenced Apr 27, 2021

[wav2vec] deepspeed eval bug in the case of >1 gpus huggingface/transformers#11446

Closed

[DeepSpeed] ZeRO-Infinity integration: getting started and issues huggingface/transformers#11464

Closed

Merge branch 'master' into samyamr/full-precision-for-stage3

f8a4c8c

stas00 mentioned this pull request Apr 28, 2021

[docs] critical API documentation is missing #776

Open

tjruwase added 4 commits April 29, 2021 09:46

Merge branch 'master' into samyamr/full-precision-for-stage3

0d37b88

Disable cpu-adam update_copy api for fp32

7260cb2

Disable gradient clipping in engine for ZeRO

b31088b

fp16 mode init required for ZeRO-3

cec2377

samyam changed the title ~~Samyamr/full precision for stage3~~ Samyamr/full precision for ZeRO Stage2 and Stage3 Apr 29, 2021

samyam and others added 3 commits April 29, 2021 13:24

Update engine.py

de76124

Assert to check if param.dtype is torch.half for ZeRO3 should only happen if the model was initialized in ZeRO3 context.

Formatting fix

bb46f58

bump DSE

ab35410

jeffra approved these changes Apr 29, 2021

View reviewed changes

Merge branch 'master' into samyamr/full-precision-for-stage3

eb7901d

jeffra merged commit dad2642 into master Apr 29, 2021

jeffra deleted the samyamr/full-precision-for-stage3 branch April 29, 2021 22:06

stas00 mentioned this pull request Apr 30, 2021

[DeepSpeed] fp32 support huggingface/transformers#11499

Merged

3 tasks

stas00 mentioned this pull request Apr 30, 2021

[fp32] fix default dtype #1023

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Samyamr/full precision for ZeRO Stage2 and Stage3 #1004

Samyamr/full precision for ZeRO Stage2 and Stage3 #1004

samyam commented Apr 23, 2021

stas00 commented Apr 29, 2021

stas00 commented Apr 29, 2021

stas00 commented Apr 30, 2021 •

edited

Loading

Samyamr/full precision for ZeRO Stage2 and Stage3 #1004

Samyamr/full precision for ZeRO Stage2 and Stage3 #1004

Conversation

samyam commented Apr 23, 2021

stas00 commented Apr 29, 2021

stas00 commented Apr 29, 2021

stas00 commented Apr 30, 2021 • edited Loading

stas00 commented Apr 30, 2021 •

edited

Loading