File tree Expand file tree Collapse file tree 1 file changed +11
-0
lines changed Expand file tree Collapse file tree 1 file changed +11
-0
lines changed Original file line number Diff line number Diff line change @@ -842,4 +842,15 @@ https://arxiv.org/abs/2210.02969
842
842
- distillation done via KL divergence loss on ** unlabeled data** (eg pseudolabeling)
843
843
- q: would it perform even better if smaller model was finetuned on the labeled data afterwards? would be interesting to check
844
844
845
+ ## MPT-7b
846
+ https://www.mosaicml.com/blog/mpt-7b
847
+
848
+ - 7b param model better than llama or at least competitive
849
+ - because trained on 1T tokens like llama
850
+ - Uses GPT-NeoX20B tokenizer (slightly better than standard gpt2 tokenizer)
851
+ - set vocab size from 50,257 -> 50,432 (multiple of 128) and improved MFU by 4 percentage points
852
+ - Uses streaming dataset
853
+ - Uses ALIBI over positional encoding (improves stability)
854
+ - uses Lion optimizer over AdamW
855
+ - more stable update magnitutes AND less optimizer state mem
845
856
You can’t perform that action at this time.
0 commit comments