Support block-modular architecture

# 🎯 **Goal (What & Why)**

Enable fully **modular, per-block configuration** in Fast-LLM to follow up on hybrid architecture support introduced in #194.

Currently, hybrid models (e.g., interleaving Mamba 1, Mamba 2, and Transformer blocks) are limited by **global block-type configurations**: all transformer blocks share one config, and all SSM blocks another. This is too rigid.

We want to:
- Allow **different configurations per block**, even for the same type.
- Support **named blocks** with configurable **weight sharing**.
- Enable expressive, fine-grained **architectures**, useful for:
  - Exploring different attention mechanisms in a single model.
  - Tying weights across repeated block instances.
  - Designing sparse, pruned, or ablation-based stacks.
  - Preparing for model export pipelines with heterogeneous block stacks.

This would eliminate the current one-size-fits-all limitation and make model stacks in Fast-LLM truly composable and expressive.

# 🚀 **Execution Plan**

This is a config and model-construction feature. The main change is replacing the global `transformer` and `ssm` sections with a new per-block format.

### Key Ideas

- Add `model.blocks`: a dict of named block configs (e.g., `alice`, `bob`, `claire`, `potato`, etc., it doesn't matter what they are called, see example below).
- Add `block_pattern`: a list specifying the block sequence by name.
- Add `num_layers`: total depth of the model. The pattern repeats to reach this.
- Allow block-level options like:
  - `kind: transformer | ssm | ...`
  - `attention_type`, `sliding_window`, `dt_rank`, etc.
  - `shared_weights: true` for parameter sharing
  - `lora: ...`
- Blocks reused by name will share configuration; if `shared_weights: true`, they’ll also reuse parameters.

### Minimal Implementation Path

1. Define new schema and validate it (e.g., every pattern entry must resolve to a block).
2. Update model construction to instantiate blocks from `model.blocks`, repeat pattern to reach `num_layers`.
3. Add weight-sharing logic: instantiate shared blocks once, reuse parameters across layers.
4. Add support for block-level LoRA injection.
5. Maintain backwards compatibility: for existing models, fall back to current global `transformer`/`ssm` layout if `model.blocks` is absent. Save new checkpoints using the new format.
6. Extend test coverage:    - Stacks with different transformer configs
    - Mixed MQA/GQA/sliding-window blocks
    - Interleaved SSM and transformer blocks
    - Shared and unshared weights
7. Update documentation with examples and migration guide.


### Example Config: One block

```yaml
model:
  blocks:
    default_transformer:
      kind: transformer
      attention_type: mqa
      use_flash_attention: true
      num_heads: 16
      hidden_size: 4096

  block_pattern: ["default_transformer"]
  num_layers: 48
```

### Example Config: Many blocks

```yaml
model:
  blocks:
    alice:
      kind: transformer
      attention_type: mqa
      sliding_window: false

    bob:
      kind: transformer
      attention_type: gqa
      sliding_window: true
      shared_weights: true

    claire:
      kind: ssm
      variant: mamba1
      dt_rank: auto

    dave:
      kind: ssm
      variant: discrete_mamba2
      state_size: 16

  block_pattern: ["alice", "bob", "claire", "dave", "bob"]
  num_layers: 15
```

Here:
- Pattern repeats 3 times in the 15 layers of the model.
- `bob` appears 6 times, but defines weights once (shared).
- Each block can be configured independently.

# 📌 **Acceptance Criteria**

- `model.blocks` is supported with flexible per-block config.
- `block_pattern` resolves correctly and builds a full stack of layers.
- Shared weights reduce parameter count where `shared_weights: true` is set.
- Legacy config format (`transformer`, `ssm`) remains supported with deprecation warning for the time being.
- Unit tests validate:
  - Per-block config behaviour
  - Mixed block types
  - Shared vs non-shared blocks
- Documentation updated with clear example configs and usage patterns.

# 🛠️ **Project Management**

- [x] Assign the project to the Fast-LLM project.
- [ ] Set the `Estimate` field (in days).
- [x] Use the `Size` field to categorize the PR size (Large).
- [ ] Assign an owner when opening the issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support block-modular architecture #242

🎯 Goal (What & Why)

🚀 Execution Plan

Key Ideas

Minimal Implementation Path

Example Config: One block

Example Config: Many blocks

📌 Acceptance Criteria

🛠️ Project Management

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support block-modular architecture #242

Description

🎯 Goal (What & Why)

🚀 Execution Plan

Key Ideas

Minimal Implementation Path

Example Config: One block

Example Config: Many blocks

📌 Acceptance Criteria

🛠️ Project Management

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions