Skip to content

[feat] Llama 3.x rope scaling support #39

Closed
@tscholak

Description

@tscholak

🧐 Problem Description

Fast-LLM lacks support for Llama 3.x models due to missing compatibility with Llama-3-style RoPE scaling. This prevents us from effectively training or using Llama 3.x checkpoints on long contexts.

To support Llama 3's full long-context pretraining (up to 128k tokens), Fast-LLM eventually needs to implement RoPE scaling. This includes handling scaling parameters like factor, low_freq_factor, and high_freq_factor, which allow positional embeddings to adapt for long sequences.

As an interim solution, Fast-LLM could ignore the rope_scaling dictionary and proceed with training for contexts up to 8k tokens, allowing basic Llama 3 compatibility without the need for immediate scaling support.

Example config for Llama 3.x models:

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  ...
  "rope_scaling": {
    "factor": 8.0,
    "low_freq_factor": 1.0,
    "high_freq_factor": 4.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  ...
}

💡 Proposed Solution

  1. Interim Solution: Ignore the rope_scaling dictionary in the configuration and proceed with standard 8k token contexts for immediate compatibility with Llama 3.x. This enables Llama 3 models to be trained with the default positional embeddings.

  2. Long-Term Solution: Implement full Llama-3-style RoPE scaling in Fast-LLM to support training and inference with extended context windows up to 128k tokens. This would involve adding support for the rope_scaling parameters and applying these consistently across training and inference.

🔄 Alternatives Considered

Not supporting Llama 3's long-context capabilities would limit Fast-LLM’s compatibility with these models. While ignoring the RoPE scaling parameters enables immediate training on up to 8k tokens, adding full RoPE scaling support is necessary for training on longer contexts.

📈 Potential Benefits

  • Immediate Compatibility: The interim solution provides basic compatibility with Llama 3.x, allowing Fast-LLM to train models with up to 8k context lengths.
  • Future-Ready: Implementing full RoPE scaling will enable Fast-LLM to support long-context adaptation, making it suitable for tasks requiring large context windows.
  • Critical for StarDoc: This is especially important for projects like StarDoc, which depend on Llama 3.x support and effective handling of extended contexts.

📝 Additional Context

For reference, the RoPE scaling mechanism is implemented here:

https://github.com/huggingface/transformers/blob/3ea3ab62d80d91f9bdd16bd3cacd8133fb0d4566/src/transformers/modeling_rope_utils.py#L310-L350

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions