Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take 2: use Hydra to build xformer model #93

Merged
merged 8 commits into from
Dec 1, 2021
Merged

Take 2: use Hydra to build xformer model #93

merged 8 commits into from
Dec 1, 2021

Conversation

jieru-hu
Copy link
Contributor

address the feedback in #59

from a higher level

  1. move hydra dependency to be optional
  2. refactor the model factory, mainly removed the StackConfig to make the config a bit simplified.

see the final config

python examples/build_model/my_model.py --cfg job
output
xformer:
  stack_configs:
    encoder_local:
      _target_: xformers.factory.block_factory.xFormerEncoderConfig
      reversible: false
      num_layers: 4
      user_triton: true
      dim_model: ${emb}
      layer_norm_style: pre
      position_encoding_config:
        name: vocab
        seq_len: 1024
        vocab_size: ${vocab}
        dropout: 0
      multi_head_config:
        num_heads: 4
        residual_dropout: 0
        attention:
          name: local
          dropout: 0.0
          causal: null
          window_size: null
          force_sparsity: null
      feedforward_config:
        name: MLP
        dropout: 0
        activation: relu
        hidden_layer_multiplier: 4
    encoder_random:
      _target_: xformers.factory.block_factory.xFormerEncoderConfig
      reversible: false
      num_layers: 4
      user_triton: true
      dim_model: ${emb}
      layer_norm_style: pre
      position_encoding_config:
        name: vocab
        seq_len: 1024
        vocab_size: ${vocab}
        dropout: 0
      multi_head_config:
        num_heads: 4
        residual_dropout: 0
        attention:
          name: random
          dropout: 0.0
          r: 0.01
          constant_masking: true
          force_sparsity: false
      feedforward_config:
        name: MLP
        dropout: 0
        activation: relu
        hidden_layer_multiplier: 4
    decoder_nystrom_favor:
      _target_: xformers.factory.block_factory.xFormerDecoderConfig
      reversible: false
      num_layers: 3
      block_type: decoder
      dim_model: ${emb}
      layer_norm_style: pre
      position_encoding_config:
        name: vocab
        seq_len: ${seq}
        vocab_size: ${vocab}
        dropout: 0
      multi_head_config_masked:
        num_heads: 4
        residual_dropout: 0
        attention:
          name: nystrom
          dropout: 0
          causal: true
          seq_len: ${seq}
      multi_head_config_cross:
        num_heads: 4
        residual_dropout: 0
        attention:
          name: favor
          dropout: 0.0
          dim_features: null
          dim_head: null
          iter_before_redraw: null
          feature_map: null
      feedforward_config:
        name: MLP
        dropout: 0
        activation: relu
        hidden_layer_multiplier: 4
  _target_: xformers.factory.model_factory.xFormer
emb: 384
seq: 1024
vocab: 64

model built

python examples/build_model/my_model.py
output
xFormer(
  (encoders): ModuleList(
    (0): xFormerEncoderBlock(
      (pose_encoding): VocabEmbedding(
        (dropout): Dropout(p=0, inplace=False)
        (position_embeddings): Embedding(1024, 384)
        (word_embeddings): Embedding(64, 384)
      )
      (mha): MultiHeadDispatch(
        (attention): LocalAttention(
          (attn_drop): Dropout(p=0.0, inplace=False)
        )
        (in_proj_container): InProjContainer()
        (resid_drop): Dropout(p=0, inplace=False)
        (proj): Linear(in_features=384, out_features=384, bias=True)
      )
      (feedforward): MLP(
        (mlp): Sequential(
          (0): Linear(in_features=384, out_features=1536, bias=True)
          (1): ReLU()
          (2): Dropout(p=0, inplace=False)
          (3): Linear(in_features=1536, out_features=384, bias=True)
          (4): Dropout(p=0, inplace=False)
        )
      )
      (wrap_att): Residual(
        (layer): PreNorm(
          (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (sublayer): MultiHeadDispatch(
            (attention): LocalAttention(
              (attn_drop): Dropout(p=0.0, inplace=False)
            )
            (in_proj_container): InProjContainer()
            (resid_drop): Dropout(p=0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
          )
        )
      )
      (wrap_ff): PostNorm(
        (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
        (sublayer): Residual(
          (layer): PreNorm(
            (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
            (sublayer): MLP(
              (mlp): Sequential(
                (0): Linear(in_features=384, out_features=1536, bias=True)
                (1): ReLU()
                (2): Dropout(p=0, inplace=False)
                (3): Linear(in_features=1536, out_features=384, bias=True)
                (4): Dropout(p=0, inplace=False)
              )
            )
          )
        )
      )
    )
    (1): xFormerEncoderBlock(
      (pose_encoding): VocabEmbedding(
        (dropout): Dropout(p=0, inplace=False)
        (position_embeddings): Embedding(1024, 384)
        (word_embeddings): Embedding(64, 384)
      )
      (mha): MultiHeadDispatch(
        (attention): RandomAttention(
          (attn_drop): Dropout(p=0.0, inplace=False)
        )
        (in_proj_container): InProjContainer()
        (resid_drop): Dropout(p=0, inplace=False)
        (proj): Linear(in_features=384, out_features=384, bias=True)
      )
      (feedforward): MLP(
        (mlp): Sequential(
          (0): Linear(in_features=384, out_features=1536, bias=True)
          (1): ReLU()
          (2): Dropout(p=0, inplace=False)
          (3): Linear(in_features=1536, out_features=384, bias=True)
          (4): Dropout(p=0, inplace=False)
        )
      )
      (wrap_att): Residual(
        (layer): PreNorm(
          (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (sublayer): MultiHeadDispatch(
            (attention): RandomAttention(
              (attn_drop): Dropout(p=0.0, inplace=False)
            )
            (in_proj_container): InProjContainer()
            (resid_drop): Dropout(p=0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
          )
        )
      )
      (wrap_ff): PostNorm(
        (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
        (sublayer): Residual(
          (layer): PreNorm(
            (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
            (sublayer): MLP(
              (mlp): Sequential(
                (0): Linear(in_features=384, out_features=1536, bias=True)
                (1): ReLU()
                (2): Dropout(p=0, inplace=False)
                (3): Linear(in_features=1536, out_features=384, bias=True)
                (4): Dropout(p=0, inplace=False)
              )
            )
          )
        )
      )
    )
  )
  (decoders): ModuleList(
    (0): xFormerDecoderBlock(
      (pose_encoding): VocabEmbedding(
        (dropout): Dropout(p=0, inplace=False)
        (position_embeddings): Embedding(1024, 384)
        (word_embeddings): Embedding(64, 384)
      )
      (mha): MultiHeadDispatch(
        (attention): NystromAttention(
          (attn_drop): Dropout(p=0, inplace=False)
        )
        (in_proj_container): InProjContainer()
        (resid_drop): Dropout(p=0, inplace=False)
        (proj): Linear(in_features=384, out_features=384, bias=True)
      )
      (cross_mha): MultiHeadDispatch(
        (attention): FavorAttention(
          (attn_drop): Dropout(p=0.0, inplace=True)
          (feature_map_query): SMReg()
          (feature_map_key): SMReg()
        )
        (in_proj_container): InProjContainer()
        (resid_drop): Dropout(p=0, inplace=False)
        (proj): Linear(in_features=384, out_features=384, bias=True)
      )
      (feedforward): MLP(
        (mlp): Sequential(
          (0): Linear(in_features=384, out_features=1536, bias=True)
          (1): ReLU()
          (2): Dropout(p=0, inplace=False)
          (3): Linear(in_features=1536, out_features=384, bias=True)
          (4): Dropout(p=0, inplace=False)
        )
      )
      (wrap_att): Residual(
        (layer): PreNorm(
          (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (sublayer): MultiHeadDispatch(
            (attention): NystromAttention(
              (attn_drop): Dropout(p=0, inplace=False)
            )
            (in_proj_container): InProjContainer()
            (resid_drop): Dropout(p=0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
          )
        )
      )
      (wrap_cross): Residual(
        (layer): PreNorm(
          (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (sublayer): MultiHeadDispatch(
            (attention): FavorAttention(
              (attn_drop): Dropout(p=0.0, inplace=True)
              (feature_map_query): SMReg()
              (feature_map_key): SMReg()
            )
            (in_proj_container): InProjContainer()
            (resid_drop): Dropout(p=0, inplace=False)
            (proj): Linear(in_features=384, out_features=384, bias=True)
          )
        )
      )
      (wrap_ff): PostNorm(
        (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
        (sublayer): Residual(
          (layer): PreNorm(
            (norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
            (sublayer): MLP(
              (mlp): Sequential(
                (0): Linear(in_features=384, out_features=1536, bias=True)
                (1): ReLU()
                (2): Dropout(p=0, inplace=False)
                (3): Linear(in_features=1536, out_features=384, bias=True)
                (4): Dropout(p=0, inplace=False)
              )
            )
          )
        )
      )
    )
  )
)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 11, 2021
@jieru-hu jieru-hu marked this pull request as ready for review November 11, 2021 01:39
@jieru-hu jieru-hu requested review from blefaudeux, dianaml0 and fmassa and removed request for blefaudeux November 11, 2021 01:44
reversible: False # Optionally make these layers reversible to save memory
num_layers: 3 # Optional this means that this config will repeat N times
block_type: decoder
dim_model: ${emb}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to Hydra, but this means that it will be inferred from some broader context ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the interpolation here is absolute - meaning it will interpolate the emb in the primary config file - examples/build_model/conf/config.yaml. Maybe it is not the best/most obvious way to config this. This will be better supported once Hydra supports partial instantiation, which should be coming soon in the latest release.

Copy link
Contributor

@blefaudeux blefaudeux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, provided CI is happy ! Looks cleaner to me, thank you @jieru-hu !

@@ -100,6 +100,8 @@ class xFormerBlockConfig:
layer_norm_style: LayerNormStyle
layer_position: LayerPosition
use_triton: bool
reversible: bool
num_layers: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, feels like this is the right place

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about it, it also means that some of the tests with respect to reversibility can be removed, since this will mean that all the layers in the same block will have the same characteristic

xFormerDecoderBlock,
xFormerDecoderConfig,
xFormerEncoderBlock,
xFormerEncoderConfig,
)


@dataclass(init=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

# Check that the reversible setting is not alternating, which
# - makes little sense, since you loose all the reversible benefits
# - may break
# Reversible is only allowed on the encoder side

reversible = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah perfect, exactly what I had in mind in a comment above

@codecov-commenter
Copy link

codecov-commenter commented Nov 16, 2021

Codecov Report

Merging #93 (247dfff) into main (c08c620) will increase coverage by 0.13%.
The diff coverage is 98.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #93      +/-   ##
==========================================
+ Coverage   87.61%   87.74%   +0.13%     
==========================================
  Files          50       51       +1     
  Lines        2567     2587      +20     
==========================================
+ Hits         2249     2270      +21     
+ Misses        318      317       -1     
Flag Coverage Δ
Python 87.74% <98.18%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
xformers/factory/model_factory.py 97.67% <96.15%> (+0.97%) ⬆️
xformers/components/__init__.py 100.00% <100.00%> (ø)
xformers/components/attention/local.py 100.00% <100.00%> (ø)
xformers/factory/block_factory.py 93.25% <100.00%> (+0.44%) ⬆️
xformers/factory/hydra_helper.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c08c620...247dfff. Read the comment docs.

@blefaudeux
Copy link
Contributor

blefaudeux commented Nov 16, 2021

Last minute demand @jieru-hu, would it be possible to have a look at the unit test coverage and make sure that it does not regress ? It's probably just a matter of a missing test case within the config space. Looks like the regression is in the model factory

@@ -98,7 +98,7 @@ Building full models
====================


This is the last example in the series, and goes one level up again, so that we consider building a whole Tranformer/xFormer model. Please note that this is just an example, because building the whole model from explicit parts is always an option, from pure PyTorch building blocks or adding some xFormers primitives.
Now let's build a full Tranformer/xFormer model. Please note that this is just an example, because building the whole model from explicit parts is always an option, from pure PyTorch building blocks or adding some xFormers primitives.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Now let's build a full Tranformer/xFormer model. Please note that this is just an example, because building the whole model from explicit parts is always an option, from pure PyTorch building blocks or adding some xFormers primitives.
Now let's build a full Transformer/xFormer model. Please note that this is just an example, because building the whole model from explicit parts is always an option, from pure PyTorch building blocks or adding some xFormers primitives.

.. code-block:: yaml

defaults:
- /stack@xformer.stack_configs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be out of scope to add some more information as to what these do in this doc page for people who are unfamiliar? This is a pretty advanced usage of Hydra which may be alright for users of Hydra, but may require some onboarding.

Or maybe I'm completely wrong cc @omry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for taking a look @SeanNaren - yea, you are right, this is "advanced" Hydra :) I've put more context in the example config files. I can certainly add a bit more context in the doc.

@SeanNaren
Copy link
Contributor

hey @jieru-hu how about a more components based conf? I.e define the attention in a folder like:

conf/attention/local.yaml

num_heads: 4
residual_dropout: 0
attention: local

Set the default in the higher level default list, then use interpolation to bring it into the larger stack, i.e (not sure if this works, but with finnicking I think it will)

# base encoder settings that can be extended and overriden
# we leave out the attention part for other config to override

_target_: xformers.factory.block_factory.xFormerEncoderConfig
reversible: False
num_layers: 4
user_triton: True
dim_model: ${emb}
layer_norm_style: pre
position_encoding_config:
  name: vocab
  seq_len: 1024
  vocab_size: ${vocab}
  dropout: 0
multi_head_config: ${attention}

You're then able to set the attention in a nicer way I think, like

python examples/build_model/my_model.py  attention@xformer.attention=local

is there a reason why you can't just do this btw:

python examples/build_model/my_model.py  attention=local

@jieru-hu
Copy link
Contributor Author

hey @jieru-hu how about a more components based conf? I.e define the attention in a folder like:

Thanks for the suggestion Sean. we can certainly do something like this, but, if I understand your example correctly, we won't be able to add multiple stack with different attentions to the same model, right?

@jieru-hu jieru-hu marked this pull request as draft November 30, 2021 23:40
@jieru-hu jieru-hu force-pushed the hydra-2 branch 2 times, most recently from a6963e9 to d3894dc Compare December 1, 2021 00:20
@jieru-hu
Copy link
Contributor Author

jieru-hu commented Dec 1, 2021

I was hoping to see the codedev coverage report with this new commit, but somehow that didn't work. So i ended up finding the link to the report in the Run unit tests with coverage step of the CI, and got the link to the report https://codecov.io/github/facebookresearch/xformers/commit/d3894dc61c4945b0424173e557e59528c33880a4, from the report, click on the commit sha and got to the coverage report line by line in the diff tab :)

https://codecov.io/gh/facebookresearch/xformers/compare/b0526eef53f44b23fea8a97947f80234c125318c...d3894dc61c4945b0424173e557e59528c33880a4/diff

@jieru-hu jieru-hu force-pushed the hydra-2 branch 2 times, most recently from 184a23b to 4420710 Compare December 1, 2021 20:46
@jieru-hu jieru-hu marked this pull request as ready for review December 1, 2021 21:18
@@ -16,3 +16,6 @@ pytest-cov == 2.10.0
pytest-mpi == 0.4
pytest-timeout == 1.4.2
timm >= 0.4.5

# Dependency for factory
hydra-core >= 1.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit @jieru-hu, but duplicate right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i don't think this is a duplicate - since i only added the hydra-core dependency to the examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see, right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woop you're right @jieru-hu , I missed the fact that the other requirements file was under example.. all good !

@blefaudeux
Copy link
Contributor

Nice, thanks for the updates related to the unit tests @jieru-hu ! I think that it would be good to iterate with @SeanNaren on the interface, I especially like the python examples/build_model/my_model.py attention@xformer.attention=local suggestion for instance. Architecture searchs are planned, I think that it will help getting some practical feedback, in the meantime probably a good idea to land that I think since it's both touching a lot of files and relatively isolated.

@jieru-hu
Copy link
Contributor Author

jieru-hu commented Dec 1, 2021

Nice, thanks for the updates related to the unit tests @jieru-hu ! I think that it would be good to iterate with @SeanNaren on the interface, I especially like the python examples/build_model/my_model.py attention@xformer.attention=local suggestion for instance. Architecture searchs are planned, I think that it will help getting some practical feedback, in the meantime probably a good idea to land that I think since it's both touching a lot of files and relatively isolated.

Yep, definitely! These examples are really V0 and I'm definitely planning on iterating on them. Will work with @SeanNaren to gather more feedback :)

@@ -1 +1,2 @@
hydra-core>1.1
Copy link
Contributor

@blefaudeux blefaudeux Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jieru-hu hydra-core is also required here, and requirements is used as part of requirements-test, so looks like a duplicate to me ?

@jieru-hu jieru-hu merged commit c1b0325 into main Dec 1, 2021
@jieru-hu jieru-hu deleted the hydra-2 branch December 1, 2021 22:49
xwhan pushed a commit to xwhan/xformers that referenced this pull request Feb 8, 2022
…research#104)

* Adding some helpers on SparseCS + small unit testing
* unit test fix
* adding a device test, checking for facebookresearch#93
* catching the padding bug, fixing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants