Skip to content

[ENHANCEMENT] New MPT 30B + CUDA support. #1971

Closed
@casper-hansen

Description

@casper-hansen

MosaicML released its MPT 30B version today with 8k context, with Apache 2.0 license.

image

Why you should support MPT 30B

Let me present my argumentation for why MPT should be supported including CUDA support. Arguably, LLaMa models or Falcon models are great on paper and in evaluation, but what they really lack is commercial licensing (in the case of LLaMa) and an actively maintained tech stack (in the case of Falcon).

Tech stack:

  1. MosaicML has 8 employees actively contributing to their own open-source repo LLM-Foundry and another few researching for improvements. Recently they upgraded to PyTorch 2.0 and added H100 support just before this 30B version was released.
  2. A streaming library; train and fine-tune models while streaming your dataset from S3/GCP/Azure data storage options. This reduces costs at train time and you can easily resume upon hardware failures.
  3. They have developed tools like Composer that lets you train and fine-tune models much faster (e.g. GPT-2 for roughly $145 with Composer, and $255 with vanilla PyTorch).

Performance:

Evaluation: The performance on generic benchmarks of LLaMa 33B, Falcon 40B, and MPT 30B is mostly the same. Although MPT 30B is the smallest model, the performance is incredibly close, and the difference is negligible except for HumanEval where MPT 30B (base) scores 25%, LLaMa 33B scores 20%, while Falcon scores 1.2% (did not generate code) in MPTs tests.

Inference speed: The inference speed of MPT models is roughly 1.5-2.0x faster than LLaMa models because of FlashAttention and Low Precision Layernorm.

Memory usage: The MPT 30B model fits on 1x A100-80GB at 16 bits. Falcon 40B requires 85-100GB VRAM at 16 bits which means it conventionally needs 2x GPUs without the use of quantization.

Cost:

LLaMa is roughly 1.44x more expensive and Falcon 1.27x more expensive in compute power used to train the full models. This is remarkable because it means the MPT models can achieve the same performance as more expensive models at a much lower cost.

MPT-30B FLOPs ~= 6 * 30e9 [params] * 1.05e12 [tokens] = 1.89e23 FLOPs
LLaMa-30B FLOPs ~= 6 * 32.5e9 [params] * 1.4e12 [tokens] = 2.73e23 FLOPs (1.44x more)
Falcon-40B FLOPs ~= 6 * 40e9 [params] * 1e12 [tokens] = 2.40e23 FLOps (1.27x more)

Conclusion

If the community decides to support MPT models with CUDA support, we gain the following benefits:

  1. Being able to train and fine-tune LLMs at a lower cost than LLaMa models and enable commercial usage using llama.cpp/ggml for inference.
  2. Faster LLMs compared to LLaMa. Even faster once quantized and CUDA support is enabled.
  3. Much larger default context size (8k vs 2k), but also the ability to extend context size using ALiBi.

Links

https://www.mosaicml.com/blog/mpt-30b
https://huggingface.co/mosaicml/mpt-30b

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions