Take 2: use Hydra to build xformer model #93

jieru-hu · 2021-11-11T01:35:32Z

address the feedback in #59

from a higher level

move hydra dependency to be optional
refactor the model factory, mainly removed the StackConfig to make the config a bit simplified.

see the final config

python examples/build_model/my_model.py --cfg job

output

xformer:
  stack_configs:
    encoder_local:
      _target_: xformers.factory.block_factory.xFormerEncoderConfig
      reversible: false
      num_layers: 4
      user_triton: true
      dim_model: ${emb}
      layer_norm_style: pre
      position_encoding_config:
        name: vocab
        seq_len: 1024
        vocab_size: ${vocab}
        dropout: 0
      multi_head_config:
        num_heads: 4
        residual_dropout: 0
        attention:
          name: local
          dropout: 0.0
          causal: null
          window_size: null
          force_sparsity: null
      feedforward_config:
        name: MLP
        dropout: 0
        activation: relu
        hidden_layer_multiplier: 4
    encoder_random:
      _target_: xformers.factory.block_factory.xFormerEncoderConfig
      reversible: false
      num_layers: 4
      user_triton: true
      dim_model: ${emb}
      layer_norm_style: pre
      position_encoding_config:
        name: vocab
        seq_len: 1024
        vocab_size: ${vocab}
        dropout: 0
      multi_head_config:
        num_heads: 4
        residual_dropout: 0
        attention:
          name: random
          dropout: 0.0
          r: 0.01
          constant_masking: true
          force_sparsity: false
      feedforward_config:
        name: MLP
        dropout: 0
        activation: relu
        hidden_layer_multiplier: 4
    decoder_nystrom_favor:
      _target_: xformers.factory.block_factory.xFormerDecoderConfig
      reversible: false
      num_layers: 3
      block_type: decoder
      dim_model: ${emb}
      layer_norm_style: pre
      position_encoding_config:
        name: vocab
        seq_len: ${seq}
        vocab_size: ${vocab}
        dropout: 0
      multi_head_config_masked:
        num_heads: 4
        residual_dropout: 0
        attention:
          name: nystrom
          dropout: 0
          causal: true
          seq_len: ${seq}
      multi_head_config_cross:
        num_heads: 4
        residual_dropout: 0
        attention:
          name: favor
          dropout: 0.0
          dim_features: null
          dim_head: null
          iter_before_redraw: null
          feature_map: null
      feedforward_config:
        name: MLP
        dropout: 0
        activation: relu
        hidden_layer_multiplier: 4
  _target_: xformers.factory.model_factory.xFormer
emb: 384
seq: 1024
vocab: 64

model built

python examples/build_model/my_model.py

output

xFormer(
  (encoders): ModuleList(
    (0): xFormerEncoderBlock(
      (pose_encoding): VocabEmbedding(
        (dropout): Dropout(p=0, inplace=False)
        (position_embeddings): Embedding(1024, 384)
        (word_embeddings): Embedding(64, 384)
      )
      (mha): MultiHeadDispatch(
        (attention): LocalAttention(
          (attn_drop): Dropout(p=0.0, inplace=False)
        )
        (in_proj_container): InProjContainer()
        (resid_drop): Dropout(p=0, inplace=False)
        (proj): Linear(in_features=384, out_features=384, bias=True)
      )
      (feedforward): MLP(
        (mlp): Sequential(
          (0): Linear(in_features=384, out_features=1536, bias=True)
          (1): ReLU()
          (2): Dropout(p=0, inplace=False)
          (3): Linear(in_features=1536, out_features=384, bias=True)
          (4): Dropout(p=0, inplace=False)
        )
      )
      (wrap_att): Residual(
        (layer): PreNorm(
          (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (sublayer): MultiHeadDispatch(
            (attention): LocalAttention(
              (attn_drop): Dropout(p=0.0, inplace=False)
            )
            (in_proj_container): InProjContainer()
            (resid_drop): Dropout(p=0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
          )
        )
      )
      (wrap_ff): PostNorm(
        (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
        (sublayer): Residual(
          (layer): PreNorm(
            (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
            (sublayer): MLP(
              (mlp): Sequential(
                (0): Linear(in_features=384, out_features=1536, bias=True)
                (1): ReLU()
                (2): Dropout(p=0, inplace=False)
                (3): Linear(in_features=1536, out_features=384, bias=True)
                (4): Dropout(p=0, inplace=False)
              )
            )
          )
        )
      )
    )
    (1): xFormerEncoderBlock(
      (pose_encoding): VocabEmbedding(
        (dropout): Dropout(p=0, inplace=False)
        (position_embeddings): Embedding(1024, 384)
        (word_embeddings): Embedding(64, 384)
      )
      (mha): MultiHeadDispatch(
        (attention): RandomAttention(
          (attn_drop): Dropout(p=0.0, inplace=False)
        )
        (in_proj_container): InProjContainer()
        (resid_drop): Dropout(p=0, inplace=False)
        (proj): Linear(in_features=384, out_features=384, bias=True)
      )
      (feedforward): MLP(
        (mlp): Sequential(
          (0): Linear(in_features=384, out_features=1536, bias=True)
          (1): ReLU()
          (2): Dropout(p=0, inplace=False)
          (3): Linear(in_features=1536, out_features=384, bias=True)
          (4): Dropout(p=0, inplace=False)
        )
      )
      (wrap_att): Residual(
        (layer): PreNorm(
          (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (sublayer): MultiHeadDispatch(
            (attention): RandomAttention(
              (attn_drop): Dropout(p=0.0, inplace=False)
            )
            (in_proj_container): InProjContainer()
            (resid_drop): Dropout(p=0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
          )
        )
      )
      (wrap_ff): PostNorm(
        (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
        (sublayer): Residual(
          (layer): PreNorm(
            (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
            (sublayer): MLP(
              (mlp): Sequential(
                (0): Linear(in_features=384, out_features=1536, bias=True)
                (1): ReLU()
                (2): Dropout(p=0, inplace=False)
                (3): Linear(in_features=1536, out_features=384, bias=True)
                (4): Dropout(p=0, inplace=False)
              )
            )
          )
        )
      )
    )
  )
  (decoders): ModuleList(
    (0): xFormerDecoderBlock(
      (pose_encoding): VocabEmbedding(
        (dropout): Dropout(p=0, inplace=False)
        (position_embeddings): Embedding(1024, 384)
        (word_embeddings): Embedding(64, 384)
      )
      (mha): MultiHeadDispatch(
        (attention): NystromAttention(
          (attn_drop): Dropout(p=0, inplace=False)
        )
        (in_proj_container): InProjContainer()
        (resid_drop): Dropout(p=0, inplace=False)
        (proj): Linear(in_features=384, out_features=384, bias=True)
      )
      (cross_mha): MultiHeadDispatch(
        (attention): FavorAttention(
          (attn_drop): Dropout(p=0.0, inplace=True)
          (feature_map_query): SMReg()
          (feature_map_key): SMReg()
        )
        (in_proj_container): InProjContainer()
        (resid_drop): Dropout(p=0, inplace=False)
        (proj): Linear(in_features=384, out_features=384, bias=True)
      )
      (feedforward): MLP(
        (mlp): Sequential(
          (0): Linear(in_features=384, out_features=1536, bias=True)
          (1): ReLU()
          (2): Dropout(p=0, inplace=False)
          (3): Linear(in_features=1536, out_features=384, bias=True)
          (4): Dropout(p=0, inplace=False)
        )
      )
      (wrap_att): Residual(
        (layer): PreNorm(
          (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (sublayer): MultiHeadDispatch(
            (attention): NystromAttention(
              (attn_drop): Dropout(p=0, inplace=False)
            )
            (in_proj_container): InProjContainer()
            (resid_drop): Dropout(p=0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
          )
        )
      )
      (wrap_cross): Residual(
        (layer): PreNorm(
          (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (sublayer): MultiHeadDispatch(
            (attention): FavorAttention(
              (attn_drop): Dropout(p=0.0, inplace=True)
              (feature_map_query): SMReg()
              (feature_map_key): SMReg()
            )
            (in_proj_container): InProjContainer()
            (resid_drop): Dropout(p=0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
          )
        )
      )
      (wrap_ff): PostNorm(
        (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
        (sublayer): Residual(
          (layer): PreNorm(
            (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
            (sublayer): MLP(
              (mlp): Sequential(
                (0): Linear(in_features=384, out_features=1536, bias=True)
                (1): ReLU()
                (2): Dropout(p=0, inplace=False)
                (3): Linear(in_features=1536, out_features=384, bias=True)
                (4): Dropout(p=0, inplace=False)
              )
            )
          )
        )
      )
    )
  )
)

blefaudeux · 2021-11-11T19:33:58Z

examples/build_model/conf/stack/base_decoder.yaml

+reversible: False  # Optionally make these layers reversible to save memory
+num_layers: 3  # Optional this means that this config will repeat N times
+block_type: decoder
+dim_model: ${emb}


I'm new to Hydra, but this means that it will be inferred from some broader context ?

yes, the interpolation here is absolute - meaning it will interpolate the emb in the primary config file - examples/build_model/conf/config.yaml. Maybe it is not the best/most obvious way to config this. This will be better supported once Hydra supports partial instantiation, which should be coming soon in the latest release.

blefaudeux

LGTM, provided CI is happy ! Looks cleaner to me, thank you @jieru-hu !

blefaudeux · 2021-11-11T21:00:22Z

xformers/factory/block_factory.py

@@ -100,6 +100,8 @@ class xFormerBlockConfig:
    layer_norm_style: LayerNormStyle
    layer_position: LayerPosition
    use_triton: bool
+    reversible: bool 
+    num_layers: int


agree, feels like this is the right place

thinking about it, it also means that some of the tests with respect to reversibility can be removed, since this will mean that all the layers in the same block will have the same characteristic

blefaudeux · 2021-11-11T21:01:43Z

xformers/factory/model_factory.py

    xFormerDecoderBlock,
    xFormerDecoderConfig,
    xFormerEncoderBlock,
    xFormerEncoderConfig,
 )


-@dataclass(init=False)


blefaudeux · 2021-11-11T21:02:54Z

xformers/factory/model_factory.py

        # Check that the reversible setting is not alternating, which
        # - makes little sense, since you loose all the reversible benefits
        # - may break
        # Reversible is only allowed on the encoder side

-        reversible = [


ah perfect, exactly what I had in mind in a comment above

codecov-commenter · 2021-11-16T03:23:29Z

Codecov Report

Merging #93 (247dfff) into main (c08c620) will increase coverage by 0.13%.
The diff coverage is 98.18%.

@@            Coverage Diff             @@
##             main      #93      +/-   ##
==========================================
+ Coverage   87.61%   87.74%   +0.13%     
==========================================
  Files          50       51       +1     
  Lines        2567     2587      +20     
==========================================
+ Hits         2249     2270      +21     
+ Misses        318      317       -1

Flag	Coverage Δ
Python	`87.74% <98.18%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
xformers/factory/model_factory.py	`97.67% <96.15%> (+0.97%)`	⬆️
xformers/components/__init__.py	`100.00% <100.00%> (ø)`
xformers/components/attention/local.py	`100.00% <100.00%> (ø)`
xformers/factory/block_factory.py	`93.25% <100.00%> (+0.44%)`	⬆️
xformers/factory/hydra_helper.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c08c620...247dfff. Read the comment docs.

blefaudeux · 2021-11-16T17:24:59Z

Last minute demand @jieru-hu, would it be possible to have a look at the unit test coverage and make sure that it does not regress ? It's probably just a matter of a missing test case within the config space. Looks like the regression is in the model factory

SeanNaren · 2021-11-16T18:52:47Z

docs/source/tutorials/pytorch_encoder.rst

@@ -98,7 +98,7 @@ Building full models
 ====================


-This is the last example in the series, and goes one level up again, so that we consider building a whole Tranformer/xFormer model. Please note that this is just an example, because building the whole model from explicit parts is always an option, from pure PyTorch building blocks or adding some xFormers primitives.
+ Now let's build a full Tranformer/xFormer model. Please note that this is just an example, because building the whole model from explicit parts is always an option, from pure PyTorch building blocks or adding some xFormers primitives.


Suggested change

Now let's build a full Tranformer/xFormer model. Please note that this is just an example, because building the whole model from explicit parts is always an option, from pure PyTorch building blocks or adding some xFormers primitives.

Now let's build a full Transformer/xFormer model. Please note that this is just an example, because building the whole model from explicit parts is always an option, from pure PyTorch building blocks or adding some xFormers primitives.

SeanNaren · 2021-11-16T18:56:19Z

docs/source/tutorials/pytorch_encoder.rst

+.. code-block:: yaml
+
+    defaults:
+        - /stack@xformer.stack_configs: 


Would it be out of scope to add some more information as to what these do in this doc page for people who are unfamiliar? This is a pretty advanced usage of Hydra which may be alright for users of Hydra, but may require some onboarding.

Or maybe I'm completely wrong cc @omry

thanks for taking a look @SeanNaren - yea, you are right, this is "advanced" Hydra :) I've put more context in the example config files. I can certainly add a bit more context in the doc.

SeanNaren · 2021-11-16T19:11:04Z

hey @jieru-hu how about a more components based conf? I.e define the attention in a folder like:

conf/attention/local.yaml

num_heads: 4
residual_dropout: 0
attention: local

Set the default in the higher level default list, then use interpolation to bring it into the larger stack, i.e (not sure if this works, but with finnicking I think it will)

# base encoder settings that can be extended and overriden
# we leave out the attention part for other config to override

_target_: xformers.factory.block_factory.xFormerEncoderConfig
reversible: False
num_layers: 4
user_triton: True
dim_model: ${emb}
layer_norm_style: pre
position_encoding_config:
  name: vocab
  seq_len: 1024
  vocab_size: ${vocab}
  dropout: 0
multi_head_config: ${attention}

You're then able to set the attention in a nicer way I think, like

python examples/build_model/my_model.py  attention@xformer.attention=local

is there a reason why you can't just do this btw:

python examples/build_model/my_model.py  attention=local

jieru-hu · 2021-11-30T23:01:25Z

hey @jieru-hu how about a more components based conf? I.e define the attention in a folder like:

Thanks for the suggestion Sean. we can certainly do something like this, but, if I understand your example correctly, we won't be able to add multiple stack with different attentions to the same model, right?

jieru-hu · 2021-12-01T17:02:26Z

I was hoping to see the codedev coverage report with this new commit, but somehow that didn't work. So i ended up finding the link to the report in the Run unit tests with coverage step of the CI, and got the link to the report https://codecov.io/github/facebookresearch/xformers/commit/d3894dc61c4945b0424173e557e59528c33880a4, from the report, click on the commit sha and got to the coverage report line by line in the diff tab :)

https://codecov.io/gh/facebookresearch/xformers/compare/b0526eef53f44b23fea8a97947f80234c125318c...d3894dc61c4945b0424173e557e59528c33880a4/diff

blefaudeux · 2021-12-01T22:30:41Z

requirements-test.txt

@@ -16,3 +16,6 @@ pytest-cov == 2.10.0
 pytest-mpi == 0.4
 pytest-timeout == 1.4.2
 timm >= 0.4.5
+
+# Dependency for factory
+hydra-core >= 1.1


nit @jieru-hu, but duplicate right ?

hmm i don't think this is a duplicate - since i only added the hydra-core dependency to the examples.

see, right ?

woop you're right @jieru-hu , I missed the fact that the other requirements file was under example.. all good !

blefaudeux · 2021-12-01T22:31:21Z

Nice, thanks for the updates related to the unit tests @jieru-hu ! I think that it would be good to iterate with @SeanNaren on the interface, I especially like the python examples/build_model/my_model.py attention@xformer.attention=local suggestion for instance. Architecture searchs are planned, I think that it will help getting some practical feedback, in the meantime probably a good idea to land that I think since it's both touching a lot of files and relatively isolated.

jieru-hu · 2021-12-01T22:39:31Z

Nice, thanks for the updates related to the unit tests @jieru-hu ! I think that it would be good to iterate with @SeanNaren on the interface, I especially like the python examples/build_model/my_model.py attention@xformer.attention=local suggestion for instance. Architecture searchs are planned, I think that it will help getting some practical feedback, in the meantime probably a good idea to land that I think since it's both touching a lot of files and relatively isolated.

Yep, definitely! These examples are really V0 and I'm definitely planning on iterating on them. Will work with @SeanNaren to gather more feedback :)

blefaudeux · 2021-12-01T22:41:01Z

examples/requirements.txt

@@ -1 +1,2 @@
+hydra-core>1.1


~~@jieru-hu hydra-core is also required here, and requirements is used as part of requirements-test, so looks like a duplicate to me ?~~

…research#104) * Adding some helpers on SparseCS + small unit testing * unit test fix * adding a device test, checking for facebookresearch#93 * catching the padding bug, fixing

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 11, 2021

jieru-hu mentioned this pull request Nov 11, 2021

Use Hydra to build xformer model/block #59

Closed

2 tasks

jieru-hu marked this pull request as ready for review November 11, 2021 01:39

jieru-hu requested review from blefaudeux, dianaml0 and fmassa and removed request for blefaudeux November 11, 2021 01:44

blefaudeux reviewed Nov 11, 2021

View reviewed changes

blefaudeux approved these changes Nov 11, 2021

View reviewed changes

jieru-hu force-pushed the hydra-2 branch from 019f21f to 5bba9df Compare November 16, 2021 00:23

jieru-hu requested a review from blefaudeux November 16, 2021 01:51

SeanNaren reviewed Nov 16, 2021

View reviewed changes

jieru-hu marked this pull request as draft November 30, 2021 23:40

jieru-hu force-pushed the hydra-2 branch 2 times, most recently from a6963e9 to d3894dc Compare December 1, 2021 00:20

jieru-hu force-pushed the hydra-2 branch from d3894dc to b20e90a Compare December 1, 2021 18:15

jieru-hu added 7 commits December 1, 2021 10:41

take 2: use Hydra to build xformer model

d02810f

fix lints

0a3ee35

fix lints

3fbfd94

add doc

c8c360c

fix tests

7c8f405

remove block_config in LRA

f8c29d9

lints

e65dfcb

jieru-hu force-pushed the hydra-2 branch 2 times, most recently from 184a23b to 4420710 Compare December 1, 2021 20:46

update docs and add tests

247dfff

jieru-hu force-pushed the hydra-2 branch from 4420710 to 247dfff Compare December 1, 2021 20:53

jieru-hu marked this pull request as ready for review December 1, 2021 21:18

blefaudeux reviewed Dec 1, 2021

View reviewed changes

jieru-hu merged commit c1b0325 into main Dec 1, 2021

jieru-hu deleted the hydra-2 branch December 1, 2021 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take 2: use Hydra to build xformer model #93

Take 2: use Hydra to build xformer model #93

jieru-hu commented Nov 11, 2021

blefaudeux Nov 11, 2021

jieru-hu Nov 11, 2021

blefaudeux left a comment

blefaudeux Nov 11, 2021

blefaudeux Nov 11, 2021

blefaudeux Nov 11, 2021

blefaudeux Nov 11, 2021

codecov-commenter commented Nov 16, 2021 •

edited

Loading

blefaudeux commented Nov 16, 2021 •

edited

Loading

SeanNaren Nov 16, 2021

SeanNaren Nov 16, 2021

jieru-hu Nov 30, 2021

SeanNaren commented Nov 16, 2021

jieru-hu commented Nov 30, 2021

jieru-hu commented Dec 1, 2021

blefaudeux Dec 1, 2021

jieru-hu Dec 1, 2021

blefaudeux Dec 1, 2021

blefaudeux Dec 1, 2021

blefaudeux commented Dec 1, 2021

jieru-hu commented Dec 1, 2021

blefaudeux Dec 1, 2021 •

edited

Loading

	Now let's build a full Tranformer/xFormer model. Please note that this is just an example, because building the whole model from explicit parts is always an option, from pure PyTorch building blocks or adding some xFormers primitives.
	Now let's build a full Transformer/xFormer model. Please note that this is just an example, because building the whole model from explicit parts is always an option, from pure PyTorch building blocks or adding some xFormers primitives.

		@@ -1 +1,2 @@
		hydra-core>1.1

Take 2: use Hydra to build xformer model #93

Take 2: use Hydra to build xformer model #93

Conversation

jieru-hu commented Nov 11, 2021

see the final config

model built

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blefaudeux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 16, 2021 • edited Loading

Codecov Report

blefaudeux commented Nov 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanNaren commented Nov 16, 2021

jieru-hu commented Nov 30, 2021

jieru-hu commented Dec 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blefaudeux commented Dec 1, 2021

jieru-hu commented Dec 1, 2021

blefaudeux Dec 1, 2021 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Nov 16, 2021 •

edited

Loading

blefaudeux commented Nov 16, 2021 •

edited

Loading

blefaudeux Dec 1, 2021 •

edited

Loading