Skip to content

Quantized Training #554

Open
Open
@msaroufim

Description

@msaroufim

Inspired by a recent back and forth with @gau-nernst we should add some quantized training recipes in AO for small models (600M param range)

Character.ai recently shared that they're working on quantized training https://research.character.ai/optimizing-inference/ where per @stephenroller they train models from scratch in int8 https://x.com/stephenroller/status/1816636257717436779

Historically we've invested more in QAT which @andrewor14 has led which is more of a technique to reduce perplexity when we do an eventual post training quantization.

Quantized training on the other hand actually quantizes the model at training time and so memory savings are observed both for training and inference

So when discussing quantized training there's a few aspects

  1. Weights they can be in one: fp16, fp8, int8, int4 and below
  2. Activations most likely limited to fp8, fp16
  3. Optimizer can be in one of: fp32, fp16, bf16, fp8, int8 and below

And if one were to ship this work, a bad combination can be validated at small scale (~600M parameter range) but a good idea needs to continuously be tested from (8b to 405b range) so each of these will need loss curves

When choosing the starting point, we could either pretrain a model using quantized training or just finetune it and as long as the loss curves match the fp16 baselines then we are good. We'd also need to of course validate that memory savings are there and what the speedups/slowdowns are.

And while we can merge a lot of the dtype conversion in AO and have some toy training loop in AO what I'm more optimistic about is having some end to end trainig recipe in https://github.com/pytorch/torchtitan @awgu and an end to end finetuning recipe https://github.com/pytorch/torchtune @ebsmothers @joecummings

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions