Skip to content

New TF model design #7753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 78 commits into from
Closed

New TF model design #7753

wants to merge 78 commits into from

Conversation

jplu
Copy link
Contributor

@jplu jplu commented Oct 13, 2020

What does this PR do?

This PR aims to improve the TensorFlow code base of transformers. For sake of simplicity the changes are made only on the BERT model, but the new features can be applied to all the others. The PR brings among some bug fix the following main new features;

  1. It is now possible to train an LM model from scratch (see this notebook as example).
  2. The default outputs of the models are now dictionaries. One can only return tuples in eager execution otherwise a warning message will be displayed saying that only dictionaries are allowed in graph mode. This update fix two issues, the first one in graph mode where a layer cannot return different size of outputs, and a second one where an output cannot have a None value.
  3. Better inputs handle. Now the inputs of each model and the main layer are parsed with a single generic function bringing a more robust parsing and a better error handling in case of wrong input. This feature fix an issue when the input was a list of symbolic inputs (i.e. tf.keras.layers.Input).
  4. TensorFlow models looks now much more similar to what looks PyTorch models making easier for users to switch from a PyTorch model to its TensorFlow implementation and vice versa.
  5. Old and new model implementations can coexist in the library making this new implementation 100% backward compatible. Including the tests.

@jplu jplu marked this pull request as draft October 13, 2020 07:12
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR!! Lots of welcome improvements and will be easier to develop TF models!

Not reviewing any doc/individual code as its a draft PR, just the behavior.

I have a few questions:

From embedding building to tf.keras.layers.Embedding

The word embeddings currently rely on tf.name_scope('word_embeddings): [...] self.add_weight [...], and you updated that to use tf.keras.layers.Embedding. That's cool, but now you're not passing these weights around to models with LM heads to be used by the decoders, while the input and output embeddings should be tied in most ForMaskedLM and WithLMHead models.

  • Would we now be relying on the LM head weights being in the checkpoint?

Dense -> DenseEinsum

  • Does the change from tf.keras.layers.Dense to tf.keras.layers.experimental.EinsumDense imply a breaking change or some magic that we must do to ensure that the weights get correctly loaded?
  • Is that change the main one that unlocks better performance?

Tuples -> tf.TensorArray

You're stacking the tensor array in the model outputs, but that returns stacked tensors instead of the previous tuples, which is a big breaking change.

  • Why is that change necessary? Does it give better performance? Is it necessary for graph mode?

Joining the nsp + mlm in one cls

  • This is cool as it's more similar to what we do in PyTorch! However, you're now handling the names of the weights directly in the modeling_tf_utils.py file. I cannot imagine that scales as more models are converted to this implementation?

@jplu
Copy link
Contributor Author

jplu commented Oct 13, 2020

Thanks for your comments! I will try to answer as clearly as possible.

From embedding building to tf.keras.layers.Embedding

Would we now be relying on the LM head weights being in the checkpoint?

Exactly, also another difference is to have the word_embeddings weights initialized at the model instanciation instead of model building (after a first call). The weights are anyway saved in the checkpoints and shared with the layers that needs to have access to it. You can also see that with the resize_token_embeddings methods. The only test with a decoder we have for BERT is create_and_check_bert_lm_head and is passing. Do you see any other test we can apply to check if the sharing part works as expected?

Dense -> DenseEinsum

Does the change from tf.keras.layers.Dense to tf.keras.layers.experimental.EinsumDense imply a breaking change or some magic that we must do to ensure that the weights get correctly loaded?

No breaking change at all. Old model format can be loaded in the new one and vice versa. And the magic is as simple as a single reshape call because both must have compliant shapes:

if K.int_shape(symbolic_weight) != saved_weight_value.shape:
    try:
        array = np.reshape(saved_weight_value, K.int_shape(symbolic_weight))
    except AssertionError as e:
        e.args += (K.int_shape(symbolic_weight), saved_weight_value.shape)
        raise e
    else:
        array = saved_weight_value

And this is already integrated in the current release https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_tf_utils.py#L290

Is that change the main one that unlocks better performance?

Exactly! Only that change to unlock serving performance.

Tuples -> tf.TensorArray

Why is that change necessary? Does it give better performance? Is it necessary for graph mode?

As you said it is necessary to be able to use the output_hidden_states and output_attentions boolean parameters in graph mode. Otherwise we have to disable them 100% of the time in graph mode and it is pitty to remove that feature. I don't think it gives better performance but thanks to this we can have a variable output size. Second good reason to adopt this feature is for serving, if a SavedModel has been created with output_attentions=True the model will give you 12 outputs (one for each element of the tuple) instead of just one as it is the case with tf.TensorArray. At the end it is not really a breaking change as a tensor can be used as a tuple, you can test this by yourself simply by doing:

import tensorflow as tf
t = tf.constant([[1,2],[3,4]])
a, b = t

Joining the nsp + mlm in one cls

This is cool as it's more similar to what we do in PyTorch! However, you're now handling the names of the weights directly in the modeling_tf_utils.py file. I cannot imagine that scales as more models are converted to this implementation?

Exactly, everything will rely on the load_tf_weights() function in modeling_tf_utils.py. I didn't have major issues to handle that case, it was quite simple, at least in BERT. Let's see how it goes for the others.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this! It looks great!
My main concern is that it seems to me that the tied weights between the embeddings and the decoder in the bert layer have disappeared (and if that's indeed the case, that there is no test failing).

super().__setitem__(name, value)
super().__setattr__(name, value)
if is_tf_available() and isinstance(value, tf.Tensor):
if name not in ("hidden_states", "attentions") or (value.shape[0] != 0 and value.shape[0] is not None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment for that test? I find it hard to understand (it catches empty tensors and does not set them from what I see). Also why the specific test for keys "hidden_states" and "attentions". Can they be tensors?

Copy link
Contributor Author

@jplu jplu Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically in graph mode, all the tensors have unknown first dim (=None), so even the last_hidden_states key will be ignored if the first part of the or is not here. So at the end, only the tensors that have a key name different of hidden_states or attentions (names of the keys that can be dropped or not) and that have a tensor with a first dim other than 0 or None, will be set. Otherwise, will be ignored.

I will add a comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at the comment, there is a mistake in the behavior: all keys can be optional, not just "hidden_states" or "attentions".

Copy link
Contributor Author

@jplu jplu Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the cases where other keys than loss, attentions, hidden_states, past_key_values or use_cache are None?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_cache is in no ModelOutput that I know of.
If you look at modeling_tf_outputs and modeling_outputs (for the output types currently missing in TF because the corresponding models have not been ported), you can see all keys that can appear in the various modeling outputs, and apart form the logits, they're pretty much all optionals. The Seq2Seq models have a lot for instance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be treated the same way None values are currently treated: not set in the inner dict. I think that's exactly what you're doing here so this is good. It's the part of the test that looks at hiddens and attentions keys I don't see as needed.

Note that you need the same test in the init at the line if v is not None toward the end. You should not have the test a second time below in the setattr however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

class ModelOutput(OrderedDict):
    def __post_init__(self):
        .....
        if other_fields_are_none and not is_tensor(first_field):
            ....
        else:
            for field in class_fields:
                v = getattr(self, field.name)
                if v is not None:
                    if is_tf_available() and isinstance(value, tf.Tensor):
                        if value.shape[0] != 0 and value.shape[0] is not None:
                            self[field.name] = v
                    else:
                        self[field.name] = v
    ......
    def __setattr__(self, name, value):
        if name in self.keys() and value is not None:
            # Don't call self.__setitem__ to avoid recursion errors
            if is_tf_available() and isinstance(value, tf.Tensor):
                if value.shape[0] != 0 and value.shape[0] is not None:
                    super().__setitem__(name, value)
            else:
                super().__setitem__(name, value)
        if is_tf_available() and isinstance(value, tf.Tensor):
            if value.shape[0] != 0 and value.shape[0] is not None:
                super().__setattr__(name, value)
        else:
            super().__setattr__(name, value)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd just remove the last 5 lines and always leave super().__setattr__(name, value): we always set the attribute but only registers it in the inner dict if it's not None/empty tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then it doesn't work. I get an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\users\snake\dev\transformers\src\transformers\modeling_tf_utils.py", line 825, in from_pretrained
    model(model.dummy_inputs, training=False)  # build the network with dummy inputs
  File "C:\Users\snake\miniconda3\envs\transformers\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "c:\users\snake\dev\transformers\src\transformers\modeling_tf_bert.py", line 958, in call
    outputs = self.bert(
  File "C:\Users\snake\miniconda3\envs\transformers\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 985, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "c:\users\snake\dev\transformers\src\transformers\modeling_tf_bert.py", line 761, in call
    sequence_output = encoder_outputs[0]
  File "c:\users\snake\dev\transformers\src\transformers\file_utils.py", line 1165, in __getitem__
    return self.to_tuple()[k]
IndexError: tuple index out of range

The reason is because in graph mode all the tensors have their first dim to None. Then no more values are added this was why I have added some exceptions with the key names 😢 Will try to find another solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to see by yourself more exactly what is the issue, run the following piece of code:

from transformers import TFBertModel
import tensorflow as tf

model = tf.function(TFBertModel.from_pretrained("bert-base-cased"))
res = model(tf.constant([[1,2,3,4]]), output_attentions=tf.constant(True))

In res you have to have attentions=<a tensor> and hidden_states=None. Now with

from transformers import TFBertModel
import tensorflow as tf

model = tf.function(TFBertModel.from_pretrained("bert-base-cased"))
res = model(tf.constant([[1,2,3,4]]))

In res you have to have attentions=None and hidden_states=None.

@@ -206,7 +224,6 @@ def call(self, hidden_states, input_tensor, training=False):
return hidden_states


# Copied from transformers.modeling_tf_bert.TFBertLayer with Bert->Electra
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, we'll need those back ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, no worries :)

@jplu
Copy link
Contributor Author

jplu commented Oct 13, 2020

Ok @sgugger I should have addressed all your comments.

self.bert = TFBertMainLayer(config, name="bert")
self.nsp = TFBertNSPHead(config, name="nsp___cls")
self.bert = TFBertMainLayer(config=config, name="bert")
self.cls = TFBertOnlyNSPHead(config=config, name="cls")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weight parameter naming should be avoided!!! Changing self.nsp to self.cls means that all models trained < 4.0 cannot be used with transformers 4.0. The renaming is definitely not worth such a HUGE breaking change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is there maybe something I don't see here? From what I understand, changing the name of a parameter forces all users with "old" BERT checkpoints to manually remap their weights to a new naming -> This should be avoided at all costs I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey @jplu just showed me the regex in the load_tf_weights function that make this backwards compatible. Still not sure if that's the way to go given that adding regexes for all models could explode the complexity and would make such a super important function like load_tf_weights very much unreadable. Think I'd rather vote against this IMO more "cosmetic" name change. (cc. @LysandreJik @sgugger )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see how many exceptions we can find between PT and TF models in the naming, if there are not much it should be largely manageable, as sharing the same name helps quite a lot I think the PT users to understand the TF models and vice versa. Otherwise, if there are way too many changes we can revert this.

@jplu
Copy link
Contributor Author

jplu commented Oct 15, 2020

Ok @patrickvonplaten @sgugger @LysandreJik I should have done the changes you proposed. May I proceed the same way for the other models? Or you want me to do other updates?

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me except two issues:

  • The issue about ModelOutputs that you're discussing with Sylvain
  • The issue with layer renaming, which looks scary to me regarding backwards (and forward, to a certain extent) compatibility.

Comment on lines 547 to 596
return_dict=None,
return_dict=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents users from changing this in the config

Comment on lines 366 to 369
for layer_name in saved_h5_model_layers_name:
name = layer_name.replace("nsp___", "").replace("mlm___", "")

renamed_saved_h5_model_layers_names.add(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this, the more I think this is unmaintainable. This is not something that should be handled directly in modeling_tf_utils.py imo. If this part isn't necessary for better performance, I would vote to revert that. What is the main reason you wanted to do this change?

Copy link
Contributor Author

@jplu jplu Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can become unmaintainable, but let's see first if the other models needs changes otherwise if they don't, I don't see the problem to keep two exceptions.

Anyway revert will be easy, it will be the same code but without the two .replace().

The main reason for which I wanted to do this, is to remove all these ugly names and to have something more understandable for people who are coming from PT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree with @Lysandre here.

  1. changing weight parameters is super dangerous IMO; all kinds of problem can come up here. E.g. how do we deal with people training a model with new weight names of transformer v.4.0, but for some reason having to deploy the model in v3.4? We've had all kinds of issues when we added the position_ids as buffered weights in PyTorch with people that used different "model versions" and library versions.
  2. It's just a cosmetic name change at the cost of adding having to add silent string renaming in the load weights function. I don't think it's worth it here.
  3. This PR is getting huge already and will be hard to review -> can we maybe leave the name changes out of this PR and discuss it in a new one?

Copy link
Contributor Author

@jplu jplu Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This new way is 100% backward compatible.

  2. Not only cosmetic but also for naming and better layers design.

  3. This is the tyniest part of this PR.

I case it is not clear enough (my bad for this). The load_weights method will not change at all, only the occurences of .replace() will be removed, all the rest is necessary.

@jplu
Copy link
Contributor Author

jplu commented Oct 16, 2020

I gave up to filter out the empty tensors from the output, it was too complex to implement. So now we will always have 3 outputs corresponding to attentions, hidden_states and logits. But if output_attentions or output_hidden_states equals False they will be empty tensors (first dim equals to 0).

@jplu jplu force-pushed the tf-bert-pretraining branch 2 times, most recently from df2cb8b to e801fcf Compare October 22, 2020 15:38
@jplu jplu force-pushed the tf-bert-pretraining branch 3 times, most recently from 4cb7a1c to 60411b3 Compare October 30, 2020 19:41
@jplu
Copy link
Contributor Author

jplu commented Oct 30, 2020

Ok, now BERT looks exactly like I expected. Properly optimized + cleaner code base + full compliance with AutoGraph. Next step is to apply the same changes to the other models.

@jplu jplu force-pushed the tf-bert-pretraining branch from a6fa973 to 7b5da2e Compare November 11, 2020 14:48
@LysandreJik
Copy link
Member

Once you're happy with the state of that PR, could you close this PR and open a new one explaining the exact changes? This is a very big thread that takes a while to load, and it's hard to understand the state this current PR is in. Thanks!

@jplu
Copy link
Contributor Author

jplu commented Nov 16, 2020

You are totally right and this was my plan. Since last week I'm shrinking this PR to get only the optimization part and put the other improvements into separate PRs. Once done, I will reopen a clean one 👍

@LysandreJik
Copy link
Member

Fantastic! Thanks @jplu!

@jplu jplu closed this Feb 8, 2021
@jplu jplu deleted the tf-bert-pretraining branch February 8, 2021 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants