Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fairseq support #1915

Merged
merged 19 commits into from
May 11, 2022
Merged

Fairseq support #1915

merged 19 commits into from
May 11, 2022

Conversation

jeffra
Copy link
Collaborator

@jeffra jeffra commented Apr 26, 2022

This is the first step towards deepspeed supporting fairseq. In order to use deepspeed w. fairseq we still require some changes on the fairseq trainer side. Currently pairs w. a fork of fairseq that adds deepspeed trainer support: https://github.com/jeffra/fairseq/tree/wip-deepspeed. We'll update documentation on deepspeed side once this fork has a new home and is fully featured and tested.

This PR adds the following features:

  • Transparently expose client model attributes that are neither nn.Module or DeepSpeedEngine attributes without code change. For example, BaseFairseqModel (and it's children) expose many attributes that are used throughout the fairseq codebase. Without this feature we would need to modify many places in the fairseq code with deepspeed enabled checks and invoked the model with either model.<fairseq-method>() or model.module.<fairseq-method>().
  • Override deepspeed's loss scaling logic (zero_stage=0,1,2,3 and fp16 fused/unfused) by adding support for override_loss_scale(loss_scale). This can be called before each training step to manually set a specific loss scale value.
  • Manually set/get the a learning rate value across all param groups (zero_stage=0,1,2,3 and fp16 fused/unfused)
  • Support for wrapping FairseqOptimizer w/o needing to specify that it is untested
  • Custom model_f for loading model state_dict
  • Add support for zero.Init in zero-3 to allow kwarg apply, this allows F.linear() usage which was previously not allowed
  • Created base classes (DeepSpeedOptimizer, ZeROOptimizer) for all of the major deepspeed optimizers, this makes it easier to detect we're using a deepspeed optimizer on the client side. In the future it would be great to add common functionality to these classes that all of our optimizers can inherit from (e.g., loss scale logic, grad norm)

@jeffra jeffra marked this pull request as draft April 26, 2022 21:25
@jeffra jeffra changed the title Fairseq support [WIP] Fairseq support Apr 26, 2022
@jeffra jeffra marked this pull request as ready for review May 6, 2022 22:35
@jeffra jeffra changed the title [WIP] Fairseq support Fairseq support May 6, 2022
deepspeed/runtime/engine.py Show resolved Hide resolved
deepspeed/runtime/engine.py Outdated Show resolved Hide resolved
@stas00
Copy link
Collaborator

stas00 commented Feb 14, 2023

Hi @jeffra,

The pass-through feature is very neat! I had no idea it was added. We have been unwrapping the model for things like generate in transformers to meet this need.

Though this magic now makes things very difficult to debug since it doesn't pass the engine object to the method it passed it to, so it could be a puzzle to the user which thinks a normal engine's forward will get called.

class A:
    def __init__(self): pass
    def a(self): print(f"a() was called on {type(self)}")

class B:
    def __init__(self, a):
        self.module = a

    def __getattr__(self, name):
        _module = {}
        if "module" in self.__dict__:
            _module = self.__dict__['module']

        if name in dir(self):
            return getattr(self, name)
        elif name in dir(_module):
            return getattr(_module, name)
        else:
            raise AttributeError(
                f"'{type(self).__name__}' object has no attribute '{name}'")

a = A()
b = B(a)
a.a()
b.module.a()
b.a()
$ python test-generate.py
a() was called on <class '__main__.A'>
a() was called on <class '__main__.A'>
a() was called on <class '__main__.A'>

so we can clearly see the module's method will be called with the unwrapped model.

the confusing part is:

Capture d’écran 2023-02-13 à 10 38 28 PM

yet nobody monkey patched deepspeed - I had to scour all places to validate nobody did monkeypatching, until I thought of searching for an overridden __getattr__ and that's when I discovered this addition.

@stas00
Copy link
Collaborator

stas00 commented Feb 14, 2023

So @VictorSanh found that a model's forward breaks under ZeRO3 when the pass-through generate is invoked and he narrowed it down to this missing code:

if self.zero_optimization_partition_weights():
# Enable automated discovery of external parameters by indicating that
# we are in a forward pass.
for module in self.module.modules():
module._parameters._in_forward = True
pass

if he manually adds it before calling generate everything works.

In other words any utility relying on auto-discovery of external parameters and which gets invoked on an unwrapped model will break.

The same model works just fine under normal training mode or when not using HF transformers' generate.

@stas00
Copy link
Collaborator

stas00 commented Feb 14, 2023

This, of course, also means that any HF transformers model with external params will run into this issue regardless of whether they call generate on the unwrapped model or using the magic of getattr - so this PR didn't break anything directly, and perhaps we ought to open a new Issue instead. Please let me know whether it'd be more beneficial this way.

Ideally, the solution would be to simply pass the wrapped model to generate as a function and not a method, but then its class has a bunch of other methods that it calls as well, so it'd require rewriting everything as a pure function.

The other approach would be to pass 2 objects - one to call on for the methods to get resolved and the other is the wrapped model to continue passing it on, until model(...) call is at hand and to ensure that at that point we substitute it with the wrapped version instead. Except it'd be quite an invasion into the API, but it'd work perfectly.

@VictorSanh
Copy link

VictorSanh commented Feb 14, 2023

Thanks Stas!

Some more context about this: The initial error arose when trying to perform generation (inference) on a HF transformers like model (saying "like" because it has some custom components, that i will describe later).

The model is wrapped into a DSEngine, and calling generate is possible because of the inheritance described above.
The call to generate will then go into HF transformers generation utilities to eventually call the language model's forward (for instance, inside greedy_search):

https://github.com/huggingface/transformers/blob/8c5026628a29a89cf122bc1c95cff8101f78c7c0/src/transformers/generation/utils.py#L2195

at that point, self is the unwrapped model (no DS Engine anymore).

The error arises in the vocab projection module (i.e. at the very end of the network), which instead of a regular nn.Linear, is a custom Linear that allows optionally freeze a slice of it while the other slice is always trainable:

class DecoupledLinear(nn.Linear):
    # Derived from https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear
    """
    Implements a decoupling of parameters to allow freezing (or not) a subset of the parameters.
    In practise, the regular `weight` can be trained or frozen (i.e. `partially_freeze=True`), and if `out_additional_features` > 0, then it will create `out_additional_features * in_features` additional parameters that are always trained.
    If `out_additional_features=0`, then the module defaults back to the regular behavior of `nn.Linear`.
    """

    def __init__(
        self,
        in_features: int,
        out_features: int,
        out_additional_features: int = 0,
        bias: bool = True,
        partially_freeze: bool = True,
        device=None,
        dtype=None,
    ) -> None:
        """
        out_additional_features: int. Number of additional trainable dimensions. Only makes sense when `partially_freeze=True`.
        partially_freeze: bool. If True, the regular `weight` will be frozen and extra parameters (if any) will be trainable. If False, default to the regular behavior of nn.Linear.
        """
        super().__init__(in_features, out_features, bias, device, dtype)
        self.out_additional_features = out_additional_features
        self.partially_freeze = partially_freeze

        self.in_features = in_features
        self.out_features = out_features

        if partially_freeze:
            self.weight.requires_grad_(False)
            if bias:
                self.bias.requires_grad_(False)

        if out_additional_features > 0:
            self.additional_fc = nn.Linear(
                in_features=in_features,
                out_features=out_additional_features,
                bias=bias,
                device=device,
                dtype=dtype,
            )

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        output = F.linear(input, self.weight, self.bias)

        if self.out_additional_features > 0:
            additional_features = F.linear(input, self.additional_fc.weight, self.additional_fc.bias)
            output = torch.cat((output, additional_features), -1)

        return output

The error would typically be:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/victor/miniconda3/envs/victor-m4/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/victor/miniconda3/envs/victor-m4/lib/python3.8/site-packages/transformers/generation/utils.py", line 1518, in generate
    return self.greedy_search(
  File "/home/victor/miniconda3/envs/victor-m4/lib/python3.8/site-packages/transformers/generation/utils.py", line 2285, in greedy_search
    outputs = self(
  File "/home/victor/miniconda3/envs/victor-m4/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/victor/m4/m4/models/vopt/modeling_vopt.py", line 1293, in forward
    logits = self.lm_head(outputs[0]).contiguous()
  File "/home/victor/miniconda3/envs/victor-m4/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/victor/m4/m4/models/custom_modules.py", line 267, in forward
    additional_features = F.linear(input, self.additional_fc.weight, self.additional_fc.bias)
  File "/home/victor/DeepSpeed/deepspeed/runtime/zero/linear.py", line 122, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight)
  File "/home/victor/miniconda3/envs/victor-m4/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 110, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/victor/DeepSpeed/deepspeed/runtime/zero/linear.py", line 67, in forward
    output = input.matmul(weight.t())
RuntimeError: size mismatch, got 15, 15x2048,0

in the call to additional_features = F.linear(input, self.additional_fc.weight, self.additional_fc.bias).

@stas00
Copy link
Collaborator

stas00 commented Feb 14, 2023

ok, so perhaps we can rescue this situation in a few ways (this and the next comment):

deepspeed solution a.
Have deepspeed provide a new method that the user will call before generate and it'll do everything that needs to be done in lieu of calling the wrapped forward, like the code Victor found he had to paste in.

if self.zero_optimization_partition_weights():
# Enable automated discovery of external parameters by indicating that
# we are in a forward pass.
for module in self.module.modules():
module._parameters._in_forward = True
pass

Let's call it: def prepare_for_unwrapped_forward - the key is to make sure it does the same thing engine's forward would have done.

Of course the user can call https://deepspeed.readthedocs.io/en/latest/zero3.html#registering-external-parameters to overcome this, but the problem here is that we have an inconsistency between train/eval and generate which users are likely to find very confusing. But a wrapper like above could be easily added to HF Trainer for example.

@stas00
Copy link
Collaborator

stas00 commented Feb 14, 2023

I think this pass-through method magic is actually problematic as it swipes the problem under the carpet as can be seen from the above comments.

model provider solution a.
Instead fairseq, transformers, and other frameworks - should provide for the wrapped model arg in methods like generate and use it at the point of __call__. So the unwrapped model will be used to deal with inheritance. And at the point of forward/__call__ the wrapped engine will be used.

This doesn't have to be deepspeed specific and could be used for any framework that does the wrapping and needs to ensure the wrapping stays during forward.

@stas00
Copy link
Collaborator

stas00 commented Feb 14, 2023

deepspeed solution b:
A possible mad solution would be to stick the engine object somewhere in the first forward's pre-hook when doing the partitioning and using that pre-hook run it during the unwrapped forward, with some multiple call stacking tracking in case it was already invoked directly so it won't run twice.

@mrwyattii mrwyattii deleted the jeffra/fs-v062 branch July 7, 2023 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants