[feat] Llama 3.x rope scaling support

# 🧐 Problem Description

Fast-LLM lacks support for Llama 3.x models due to missing compatibility with Llama-3-style RoPE scaling. This prevents us from effectively training or using Llama 3.x checkpoints on long contexts.

To support Llama 3's full long-context pretraining (up to 128k tokens), Fast-LLM eventually needs to implement RoPE scaling. This includes handling scaling parameters like `factor`, `low_freq_factor`, and `high_freq_factor`, which allow positional embeddings to adapt for long sequences. 

As an **interim solution**, Fast-LLM could ignore the `rope_scaling` dictionary and proceed with training for contexts up to 8k tokens, allowing basic Llama 3 compatibility without the need for immediate scaling support.

Example config for Llama 3.x models:

```json
{
  "architectures": [
    "LlamaForCausalLM"
  ],
  ...
  "rope_scaling": {
    "factor": 8.0,
    "low_freq_factor": 1.0,
    "high_freq_factor": 4.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  ...
}
```

# 💡 Proposed Solution

1. **Interim Solution**: Ignore the `rope_scaling` dictionary in the configuration and proceed with standard 8k token contexts for immediate compatibility with Llama 3.x. This enables Llama 3 models to be trained with the default positional embeddings.

2. **Long-Term Solution**: Implement full Llama-3-style RoPE scaling in Fast-LLM to support training and inference with extended context windows up to 128k tokens. This would involve adding support for the `rope_scaling` parameters and applying these consistently across training and inference.

# 🔄 Alternatives Considered

Not supporting Llama 3's long-context capabilities would limit Fast-LLM’s compatibility with these models. While ignoring the RoPE scaling parameters enables immediate training on up to 8k tokens, adding full RoPE scaling support is necessary for training on longer contexts.

# 📈 Potential Benefits

- **Immediate Compatibility**: The interim solution provides basic compatibility with Llama 3.x, allowing Fast-LLM to train models with up to 8k context lengths.
- **Future-Ready**: Implementing full RoPE scaling will enable Fast-LLM to support long-context adaptation, making it suitable for tasks requiring large context windows.
- **Critical for StarDoc**: This is especially important for projects like StarDoc, which depend on Llama 3.x support and effective handling of extended contexts.

# 📝 Additional Context

For reference, the RoPE scaling mechanism is implemented here:

https://github.com/huggingface/transformers/blob/3ea3ab62d80d91f9bdd16bd3cacd8133fb0d4566/src/transformers/modeling_rope_utils.py#L310-L350


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Llama 3.x rope scaling support #39

🧐 Problem Description

💡 Proposed Solution

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feat] Llama 3.x rope scaling support #39

Description

🧐 Problem Description

💡 Proposed Solution

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions