A score-based Diffusion LM approach.
Using simple-diffusion-lm, this repository attempts a score-based approach. There are some additional design changes, the biggest change is to a multi-level diffusion step - more details below.
Aside from PyTorch, the training script requires tqdm
and two other packages;
SentencePiece, and
RotaryEmbedding.
They can be installed with the following commands:
pip install sentencepiece
pip install rotary-embedding-torch
pip install tqdm
First generate a .txt
corpus where each line is an example.
It's recommended to apply some normalisation on the text so the data is quite clean for the next step and training, e.g.
lower-case, change numbers to words, removing unnecessary symbols.
The training script won't perform these normalisations, so data should be cleaned externally.
With a clean text corpus, the SentencePiece model can then be trained. Follow the guides on their repository or here on PyPI. If the text corpus is very large, then creating a subset of the text can get around memory issues. Here is an exert from the script that created the BPE model:
spm.SentencePieceTrainer.train(
input=text_path,
model_prefix=name,
model_type='bpe',
vocab_size=size,
user_defined_symbols=[str(i) for i in range(10)],
bos_id=0,
eos_id=1,
pad_id=2,
unk_id=3
)
The model can be trained with the command:
python train.py -d=TXT_CORPUS -spm=SPM_MODEL -mdir=MODEL_DIRECTORY
There's a bunch of other arguments which can be altered, but above is enough to get the model working.
Many of the details in architecture are mirrors of simple-diffusion-lm, so here only the changes differences will be listed.
When training on variable length sequences there can often be redundant computation due to padding of short sequences. To combat this, if sequences are sufficiently short they are concatenated together. These concatenated sequences will not be longer than the maximum sequence length for that batch - this max length is capped during training. The attention mask is created in the dataloader such that these concatenated sequences do not affect one another.
This increases the number of trainable tokens in each batch:
Batch Size | Max Length | Packing | Tokens | Efficiency |
---|---|---|---|---|
128 | 64 | No | 6820 | 83% |
128 | 64 | Yes | 7420 | 90% |
One of the main inspirations for this approach is AR-Diffusion. The rate that words diffuse varies depends on their complexity, less informative words typically diffuse faster (see here, and here). In AR-Diffusion, a multi-level diffusion strategy is used where each index in the sequence generation can have its own diffusion time-step. This means that the velocity of diffusion can be made faster for tokens. In their case the earlier in the sequence, the faster the diffusion, thus AR-Diffusion.
What if instead of AR-Diffusion there were instead other diffusion strategies that changes the velocity in a way that's independent of sequence position? For example, it may be good to accelerate the diffusion for positions that consistently predict the same token with a high probability. Possible complex strategies would require the model to be robust to a variety of noise levels in the sequence. To accommodate this possibility the training loss uses random amounts of scheduled noise for each embedding vector. Each batch is using a 2D array of perturbation scaling, instead of the conventional 1D.
In previous works, such as CDCD, conditional embeddings are also fed into the model with a corresponding conditional mask. As this implementation targets a model that can use multi-level diffusion, the conditional masking is handled differently. Any conditional positions have the diffusion step set to 0, i.e. no noise, during training. The model isn't explicitly told that those positions are conditional, other than their diffusion has "completed".
Democratized Diffusion Language Model suggests different noise levels to what was proposed in CDCD. This implementation adopts the same, lower noise levels.