Open
Description
🎯 Goal (What & Why)
Enable fully modular, per-block configuration in Fast-LLM to follow up on hybrid architecture support introduced in #194.
Currently, hybrid models (e.g., interleaving Mamba 1, Mamba 2, and Transformer blocks) are limited by global block-type configurations: all transformer blocks share one config, and all SSM blocks another. This is too rigid.
We want to:
- Allow different configurations per block, even for the same type.
- Support named blocks with configurable weight sharing.
- Enable expressive, fine-grained architectures, useful for:
- Exploring different attention mechanisms in a single model.
- Tying weights across repeated block instances.
- Designing sparse, pruned, or ablation-based stacks.
- Preparing for model export pipelines with heterogeneous block stacks.
This would eliminate the current one-size-fits-all limitation and make model stacks in Fast-LLM truly composable and expressive.
🚀 Execution Plan
This is a config and model-construction feature. The main change is replacing the global transformer
and ssm
sections with a new per-block format.
Key Ideas
- Add
model.blocks
: a dict of named block configs (e.g.,alice
,bob
,claire
,potato
, etc., it doesn't matter what they are called, see example below). - Add
block_pattern
: a list specifying the block sequence by name. - Add
num_layers
: total depth of the model. The pattern repeats to reach this. - Allow block-level options like:
kind: transformer | ssm | ...
attention_type
,sliding_window
,dt_rank
, etc.shared_weights: true
for parameter sharinglora: ...
- Blocks reused by name will share configuration; if
shared_weights: true
, they’ll also reuse parameters.
Minimal Implementation Path
- Define new schema and validate it (e.g., every pattern entry must resolve to a block).
- Update model construction to instantiate blocks from
model.blocks
, repeat pattern to reachnum_layers
. - Add weight-sharing logic: instantiate shared blocks once, reuse parameters across layers.
- Add support for block-level LoRA injection.
- Maintain backwards compatibility: for existing models, fall back to current global
transformer
/ssm
layout ifmodel.blocks
is absent. Save new checkpoints using the new format. - Extend test coverage: - Stacks with different transformer configs
- Mixed MQA/GQA/sliding-window blocks
- Interleaved SSM and transformer blocks
- Shared and unshared weights
- Update documentation with examples and migration guide.
Example Config: One block
model:
blocks:
default_transformer:
kind: transformer
attention_type: mqa
use_flash_attention: true
num_heads: 16
hidden_size: 4096
block_pattern: ["default_transformer"]
num_layers: 48
Example Config: Many blocks
model:
blocks:
alice:
kind: transformer
attention_type: mqa
sliding_window: false
bob:
kind: transformer
attention_type: gqa
sliding_window: true
shared_weights: true
claire:
kind: ssm
variant: mamba1
dt_rank: auto
dave:
kind: ssm
variant: discrete_mamba2
state_size: 16
block_pattern: ["alice", "bob", "claire", "dave", "bob"]
num_layers: 15
Here:
- Pattern repeats 3 times in the 15 layers of the model.
bob
appears 6 times, but defines weights once (shared).- Each block can be configured independently.
📌 Acceptance Criteria
model.blocks
is supported with flexible per-block config.block_pattern
resolves correctly and builds a full stack of layers.- Shared weights reduce parameter count where
shared_weights: true
is set. - Legacy config format (
transformer
,ssm
) remains supported with deprecation warning for the time being. - Unit tests validate:
- Per-block config behaviour
- Mixed block types
- Shared vs non-shared blocks
- Documentation updated with clear example configs and usage patterns.
🛠️ Project Management
- Assign the project to the Fast-LLM project.
- Set the
Estimate
field (in days). - Use the
Size
field to categorize the PR size (Large). - Assign an owner when opening the issue.