Description
🧐 Problem Description
Fast-LLM lacks support for Llama 3.x models due to missing compatibility with Llama-3-style RoPE scaling. This prevents us from effectively training or using Llama 3.x checkpoints on long contexts.
To support Llama 3's full long-context pretraining (up to 128k tokens), Fast-LLM eventually needs to implement RoPE scaling. This includes handling scaling parameters like factor
, low_freq_factor
, and high_freq_factor
, which allow positional embeddings to adapt for long sequences.
As an interim solution, Fast-LLM could ignore the rope_scaling
dictionary and proceed with training for contexts up to 8k tokens, allowing basic Llama 3 compatibility without the need for immediate scaling support.
Example config for Llama 3.x models:
{
"architectures": [
"LlamaForCausalLM"
],
...
"rope_scaling": {
"factor": 8.0,
"low_freq_factor": 1.0,
"high_freq_factor": 4.0,
"original_max_position_embeddings": 8192,
"rope_type": "llama3"
},
...
}
💡 Proposed Solution
-
Interim Solution: Ignore the
rope_scaling
dictionary in the configuration and proceed with standard 8k token contexts for immediate compatibility with Llama 3.x. This enables Llama 3 models to be trained with the default positional embeddings. -
Long-Term Solution: Implement full Llama-3-style RoPE scaling in Fast-LLM to support training and inference with extended context windows up to 128k tokens. This would involve adding support for the
rope_scaling
parameters and applying these consistently across training and inference.
🔄 Alternatives Considered
Not supporting Llama 3's long-context capabilities would limit Fast-LLM’s compatibility with these models. While ignoring the RoPE scaling parameters enables immediate training on up to 8k tokens, adding full RoPE scaling support is necessary for training on longer contexts.
📈 Potential Benefits
- Immediate Compatibility: The interim solution provides basic compatibility with Llama 3.x, allowing Fast-LLM to train models with up to 8k context lengths.
- Future-Ready: Implementing full RoPE scaling will enable Fast-LLM to support long-context adaptation, making it suitable for tasks requiring large context windows.
- Critical for StarDoc: This is especially important for projects like StarDoc, which depend on Llama 3.x support and effective handling of extended contexts.
📝 Additional Context
For reference, the RoPE scaling mechanism is implemented here: