Support block-modular architecture #277

oleksost · 2025-05-29T12:09:55Z

✨ Description

This draft PR addresses #242 by introducing a flexible, modular configuration system for hybrid model architectures.

TODOs:

add more testing to make sure legacy behaviour is well supported
implement weight sharing
support for block-specific learning rate scales
make sure model serialisation/conversion works as expected
review and unify naming conventions (block, layer) across codebase.
clean up & test

model:
  base_model:
    cross_entropy_impl: fused
    blocks:
      bob:
        type: transformer
        hidden_size: 512
        share_weights: true
        
      mamba:
        type: discrete_mamba2
        state_size: 16
        expansion_factor: 2
        hidden_size: 512
        
    hybrid_block_layout: ["bob", "mamba", "mamba", "bob"]
    num_layers: 4

Which will result in block layout like this: ["bob", "mamba_1", "mamba_2", "bob"], where bobs share weights and mamba do not share weights.

🔍 Type of change

Class hierarchy in the config system:

started moving functionality specific to BaseBlock into BaseBlockConfig in layers/common
transformer and SSM layer configs inherit from BaseBlockConfig, both holding functionality specific to their dedicated blocks (TransformerLayer, LlambaBlock)

Block-specific hyperparameters & tensor space definition:

HybridBlockConfigs implemented under models/hybrid/config allowing block-specific hyperparameters definition
the names of the elements in the tensor space now include block suffixes; no suffixes are used in the case of non-hybrid GPT models
still supports legacy behaviour with blocks defined using lists like [t,m2d,m] & non-hybrid GPT models

Layer freezing:

in case of PEFT layer freezing must be explicit: i.e. if LoRA is used, it dpoes nto automatically freeze other layers and lr_scales must be used.
we have per_block lr_scale and component specific scales like norm_lr_scale, mlp_lr_scale etc.. If both are passed, the resulting scale for a component is lr_scale of the block multiplied by the component specific lr (see 'get_lr_scale' function.
for GPT model (non-hybrid) 'lr_scale' should not be used as it would be applied to all layers, since all alyers share the same config.

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

…oleksiy/apriel-ssm

…to modular_hybrids

jlamypoirier

I had a quick look, will go deeped once updated with main

jlamypoirier · 2025-06-16T20:21:32Z

fast_llm/config.py

@@ -380,8 +380,8 @@ def validate[T: Config](self: T, *, _is_validating: bool = False) -> T:

        if expected_class is not None:
            # Should be handled in `from_dict`, but can fail if instantiating directly.
-            Assert.is_(self.__class__, expected_class)
-
+            # TODO: is this ok? i.e. we want the assigned class to be a subclass of the expected class, not neccessarily exactly the same class.


No, this is handled in from_dict. The expected class is not the same as the type hint.

jlamypoirier · 2025-06-16T20:24:58Z

fast_llm/layers/common/config.py

+
+
+@config_class()
+class BaseBlockConfig(BaseModelConfig):


This doesn't really belong in common. Maybe a base_block submodule?

jlamypoirier · 2025-06-16T20:29:06Z

fast_llm/layers/language_model/config.py

@@ -41,10 +42,10 @@ class LanguageModelKwargs:

 @config_class()
 class LanguageModelBaseConfig(BaseModelConfig):
-    transformer: TransformerConfig = Field(


Where is this gone?

jlamypoirier · 2025-06-16T20:29:41Z

fast_llm/layers/language_model/config.py

+        hint=FieldHint.feature,
+        valid=check_field(Assert.geq, 0),
+    )
+    head_normalization: NormalizationConfig | None = Field(


Implicit convention: Put sub-configs on top. I don't think we want None in the type hint since it's not a valid value after validation.

jlamypoirier · 2025-06-16T20:33:09Z

fast_llm/layers/language_model/config.py

+            if self.embeddings_hidden_dropout is None:
+                self.embeddings_hidden_dropout = 0.0
+            if self.head_normalization is None:
+                self.head_normalization = NormalizationConfig()


I'd rather keep the transformer normalization as the default.

jlamypoirier · 2025-06-16T20:36:01Z

fast_llm/layers/ssm/blocks.py

@@ -0,0 +1,55 @@
+import typing


Rename file to block for consistency

jlamypoirier added 30 commits March 26, 2025 00:10

stuff

5137757

Merge remote-tracking branch 'origin/main' into config_updates

f0cb32a

Update pretrained config

f26010e

stuff

b930a39

Merge branch 'config_updates' into update_pretrained_config

918a7a8

fixes

8117c47

fix

1c995d3

Merge branch 'main' into config_updates

3f90475

Merge branch 'config_updates' into update_pretrained_config

e389058

fixes

506fe92

fixes

971d3ef

Tests wip

6bf20cb

misc

c13fb19

tests

a20fcec

Merge branch 'main' into config_updates

9af26a7

Tests, fixes, remove tuple format

9af372d

fix

dded00a

Merge remote-tracking branch 'origin/main' into config_updates

42d5ca4

fix

986f9f3

Merge branch 'config_updates' into update_pretrained_config

5abc087

fixes

8e3e795

fixes

da6eb7b

Merge branch 'main' into config_updates

67e08aa

Merge branch 'config_updates' into update_pretrained_config

a09e6f3

fix

baad705

Test, fixes

b702837

Knowledge distillation, fix cross-entropy

a8684f8

Fixes, distillation

b781729

fixes

db6504b

Merge remote-tracking branch 'origin/main' into config_updates

7c2933a

jlamypoirier and others added 22 commits May 14, 2025 17:02

stuff

aa3bc0b

stuff

28d321e

stuff

1bbd7fb

Minimalistic dynamic configs

3595949

stuff

39b1a04

fix

8a8fa77

add test with frozen weights

8e25990

add description for tests

456a0c5

15b model apriel hybrid

87efd45

Merge remote-tracking branch 'origin/main' into oleksiy/apriel-ssm

95c7b53

Merge remote-tracking branch 'origin/raymond/fix-frozen-weight' into …

326387d

…oleksiy/apriel-ssm

nvm

aafbfb5

nvm

c7fe8d7

Merge remote-tracking branch 'origin/main' into oleksiy/apriel-ssm

848ef04

nvm

c285e8d

Merge remote-tracking branch 'origin/minimalistic_dynamic_classes' in…

26e4924

…to modular_hybrids

modeling

3eaa240

wip

4781d15

wip

ac4bfa9

wip

45008b5

wip hybrid block architecture

a378954

wip

38fc529

oleksost requested a review from nandahkrishna May 29, 2025 12:30

oleksost added 5 commits May 29, 2025 12:38

Merge remote-tracking branch 'origin/main' into modular_hybrids

852bb92

wip

e5534fd

added lr scales per block

6860c43

weight sharing

7178407

test

0553a4b

oleksost requested a review from jlamypoirier June 10, 2025 00:58

jlamypoirier reviewed Jun 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support block-modular architecture #277

Support block-modular architecture #277

Uh oh!

oleksost commented May 29, 2025 •

edited

Loading

Uh oh!

jlamypoirier left a comment

Uh oh!

jlamypoirier Jun 16, 2025

Uh oh!

jlamypoirier Jun 16, 2025

Uh oh!

jlamypoirier Jun 16, 2025

Uh oh!

jlamypoirier Jun 16, 2025

Uh oh!

jlamypoirier Jun 16, 2025

Uh oh!

jlamypoirier Jun 16, 2025

Uh oh!

Uh oh!

Support block-modular architecture #277

Are you sure you want to change the base?

Support block-modular architecture #277

Uh oh!

Conversation

oleksost commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oleksost commented May 29, 2025 •

edited

Loading