File tree Expand file tree Collapse file tree 1 file changed +11
-1
lines changed Expand file tree Collapse file tree 1 file changed +11
-1
lines changed Original file line number Diff line number Diff line change @@ -10,7 +10,11 @@ efficient, and effective.
10
10
11
11
DeepSpeed can train DL models with over a hundred billion parameters on current
12
12
generation of GPU clusters, while achieving over 5x in system performance
13
- compared to the state-of-art.
13
+ compared to the state-of-art. Early adopters of DeepSpeed have already produced
14
+ a language model (LM) with over 17B parameters called
15
+ [ Turing-NLG] ( https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft ) ,
16
+ establishing a new SOTA in the LM category.
17
+
14
18
15
19
# Table of Contents
16
20
@@ -84,6 +88,12 @@ replicated across data-parallel processes, ZeRO partitions model states to save
84
88
significant memory. The current implementation (stage 1 of ZeRO) reduces memory by up to
85
89
4x relative to the state-of-art. You can read more about ZeRO in our [ paper] ( https://arxiv.org/abs/1910.02054 ) .
86
90
91
+ With this impressive memory reduction, early adopters of DeepSpeed have already
92
+ produced alanguage model (LM) with over 17B parameters called
93
+ [ Turing-NLG] ( https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft ) ,
94
+ establishing a new SOTA in the LM category.
95
+
96
+
87
97
## Scalability
88
98
DeepSpeed supports efficient data parallelism, model parallelism, and their
89
99
combination. ZeRO boosts the scaling capability and efficiency further.
You can’t perform that action at this time.
0 commit comments