New TF model design #7753

jplu · 2020-10-13T07:12:44Z

What does this PR do?

This PR aims to improve the TensorFlow code base of transformers. For sake of simplicity the changes are made only on the BERT model, but the new features can be applied to all the others. The PR brings among some bug fix the following main new features;

It is now possible to train an LM model from scratch (see this notebook as example).
The default outputs of the models are now dictionaries. One can only return tuples in eager execution otherwise a warning message will be displayed saying that only dictionaries are allowed in graph mode. This update fix two issues, the first one in graph mode where a layer cannot return different size of outputs, and a second one where an output cannot have a None value.
Better inputs handle. Now the inputs of each model and the main layer are parsed with a single generic function bringing a more robust parsing and a better error handling in case of wrong input. This feature fix an issue when the input was a list of symbolic inputs (i.e. tf.keras.layers.Input).
TensorFlow models looks now much more similar to what looks PyTorch models making easier for users to switch from a PyTorch model to its TensorFlow implementation and vice versa.
Old and new model implementations can coexist in the library making this new implementation 100% backward compatible. Including the tests.

LysandreJik

This is a great PR!! Lots of welcome improvements and will be easier to develop TF models!

Not reviewing any doc/individual code as its a draft PR, just the behavior.

I have a few questions:

From embedding building to `tf.keras.layers.Embedding`

The word embeddings currently rely on tf.name_scope('word_embeddings): [...] self.add_weight [...], and you updated that to use tf.keras.layers.Embedding. That's cool, but now you're not passing these weights around to models with LM heads to be used by the decoders, while the input and output embeddings should be tied in most ForMaskedLM and WithLMHead models.

Would we now be relying on the LM head weights being in the checkpoint?

Dense -> DenseEinsum

Does the change from tf.keras.layers.Dense to tf.keras.layers.experimental.EinsumDense imply a breaking change or some magic that we must do to ensure that the weights get correctly loaded?
Is that change the main one that unlocks better performance?

Tuples -> tf.TensorArray

You're stacking the tensor array in the model outputs, but that returns stacked tensors instead of the previous tuples, which is a big breaking change.

Why is that change necessary? Does it give better performance? Is it necessary for graph mode?

Joining the `nsp` + `mlm` in one `cls`

This is cool as it's more similar to what we do in PyTorch! However, you're now handling the names of the weights directly in the modeling_tf_utils.py file. I cannot imagine that scales as more models are converted to this implementation?

src/transformers/modeling_tf_bert.py

jplu · 2020-10-13T09:07:26Z

Thanks for your comments! I will try to answer as clearly as possible.

From embedding building to `tf.keras.layers.Embedding`

Would we now be relying on the LM head weights being in the checkpoint?

Exactly, also another difference is to have the word_embeddings weights initialized at the model instanciation instead of model building (after a first call). The weights are anyway saved in the checkpoints and shared with the layers that needs to have access to it. You can also see that with the resize_token_embeddings methods. The only test with a decoder we have for BERT is create_and_check_bert_lm_head and is passing. Do you see any other test we can apply to check if the sharing part works as expected?

Dense -> DenseEinsum

Does the change from tf.keras.layers.Dense to tf.keras.layers.experimental.EinsumDense imply a breaking change or some magic that we must do to ensure that the weights get correctly loaded?

No breaking change at all. Old model format can be loaded in the new one and vice versa. And the magic is as simple as a single reshape call because both must have compliant shapes:

if K.int_shape(symbolic_weight) != saved_weight_value.shape:
    try:
        array = np.reshape(saved_weight_value, K.int_shape(symbolic_weight))
    except AssertionError as e:
        e.args += (K.int_shape(symbolic_weight), saved_weight_value.shape)
        raise e
    else:
        array = saved_weight_value

And this is already integrated in the current release https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_tf_utils.py#L290

Is that change the main one that unlocks better performance?

Exactly! Only that change to unlock serving performance.

Tuples -> tf.TensorArray

Why is that change necessary? Does it give better performance? Is it necessary for graph mode?

As you said it is necessary to be able to use the output_hidden_states and output_attentions boolean parameters in graph mode. Otherwise we have to disable them 100% of the time in graph mode and it is pitty to remove that feature. I don't think it gives better performance but thanks to this we can have a variable output size. Second good reason to adopt this feature is for serving, if a SavedModel has been created with output_attentions=True the model will give you 12 outputs (one for each element of the tuple) instead of just one as it is the case with tf.TensorArray. At the end it is not really a breaking change as a tensor can be used as a tuple, you can test this by yourself simply by doing:

import tensorflow as tf
t = tf.constant([[1,2],[3,4]])
a, b = t

Joining the `nsp` + `mlm` in one `cls`

This is cool as it's more similar to what we do in PyTorch! However, you're now handling the names of the weights directly in the modeling_tf_utils.py file. I cannot imagine that scales as more models are converted to this implementation?

Exactly, everything will rely on the load_tf_weights() function in modeling_tf_utils.py. I didn't have major issues to handle that case, it was quite simple, at least in BERT. Let's see how it goes for the others.

sgugger

Thanks for all your work on this! It looks great!
My main concern is that it seems to me that the tied weights between the embeddings and the decoder in the bert layer have disappeared (and if that's indeed the case, that there is no test failing).

sgugger · 2020-10-13T12:10:11Z

src/transformers/file_utils.py

-            super().__setitem__(name, value)
-        super().__setattr__(name, value)
+            if is_tf_available() and isinstance(value, tf.Tensor):
+                if name not in ("hidden_states", "attentions") or (value.shape[0] != 0 and value.shape[0] is not None):


Could you add a comment for that test? I find it hard to understand (it catches empty tensors and does not set them from what I see). Also why the specific test for keys "hidden_states" and "attentions". Can they be tensors?

Basically in graph mode, all the tensors have unknown first dim (=None), so even the last_hidden_states key will be ignored if the first part of the or is not here. So at the end, only the tensors that have a key name different of hidden_states or attentions (names of the keys that can be dropped or not) and that have a tensor with a first dim other than 0 or None, will be set. Otherwise, will be ignored.

I will add a comment.

Just looking at the comment, there is a mistake in the behavior: all keys can be optional, not just "hidden_states" or "attentions".

What are the cases where other keys than loss, attentions, hidden_states, past_key_values or use_cache are None?

use_cache is in no ModelOutput that I know of.
If you look at modeling_tf_outputs and modeling_outputs (for the output types currently missing in TF because the corresponding models have not been ported), you can see all keys that can appear in the various modeling outputs, and apart form the logits, they're pretty much all optionals. The Seq2Seq models have a lot for instance.

This should be treated the same way None values are currently treated: not set in the inner dict. I think that's exactly what you're doing here so this is good. It's the part of the test that looks at hiddens and attentions keys I don't see as needed.

Note that you need the same test in the init at the line if v is not None toward the end. You should not have the test a second time below in the setattr however.

Something like this?

class ModelOutput(OrderedDict): def __post_init__(self): ..... if other_fields_are_none and not is_tensor(first_field): .... else: for field in class_fields: v = getattr(self, field.name) if v is not None: if is_tf_available() and isinstance(value, tf.Tensor): if value.shape[0] != 0 and value.shape[0] is not None: self[field.name] = v else: self[field.name] = v ...... def __setattr__(self, name, value): if name in self.keys() and value is not None: # Don't call self.__setitem__ to avoid recursion errors if is_tf_available() and isinstance(value, tf.Tensor): if value.shape[0] != 0 and value.shape[0] is not None: super().__setitem__(name, value) else: super().__setitem__(name, value) if is_tf_available() and isinstance(value, tf.Tensor): if value.shape[0] != 0 and value.shape[0] is not None: super().__setattr__(name, value) else: super().__setattr__(name, value)

Yes, I'd just remove the last 5 lines and always leave super().__setattr__(name, value): we always set the attribute but only registers it in the inner dict if it's not None/empty tensor.

Ok, then it doesn't work. I get an error:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\users\snake\dev\transformers\src\transformers\modeling_tf_utils.py", line 825, in from_pretrained model(model.dummy_inputs, training=False) # build the network with dummy inputs File "C:\Users\snake\miniconda3\envs\transformers\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 985, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "c:\users\snake\dev\transformers\src\transformers\modeling_tf_bert.py", line 958, in call outputs = self.bert( File "C:\Users\snake\miniconda3\envs\transformers\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 985, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "c:\users\snake\dev\transformers\src\transformers\modeling_tf_bert.py", line 761, in call sequence_output = encoder_outputs[0] File "c:\users\snake\dev\transformers\src\transformers\file_utils.py", line 1165, in __getitem__ return self.to_tuple()[k] IndexError: tuple index out of range

The reason is because in graph mode all the tensors have their first dim to None. Then no more values are added this was why I have added some exceptions with the key names 😢 Will try to find another solution.

If you want to see by yourself more exactly what is the issue, run the following piece of code:

from transformers import TFBertModel import tensorflow as tf model = tf.function(TFBertModel.from_pretrained("bert-base-cased")) res = model(tf.constant([[1,2,3,4]]), output_attentions=tf.constant(True))

In res you have to have attentions=<a tensor> and hidden_states=None. Now with

from transformers import TFBertModel import tensorflow as tf model = tf.function(TFBertModel.from_pretrained("bert-base-cased")) res = model(tf.constant([[1,2,3,4]]))

In res you have to have attentions=None and hidden_states=None.

src/transformers/modeling_tf_auto.py

src/transformers/modeling_tf_bert.py

sgugger · 2020-10-13T13:28:12Z

src/transformers/modeling_tf_electra.py

@@ -206,7 +224,6 @@ def call(self, hidden_states, input_tensor, training=False):
        return hidden_states


-# Copied from transformers.modeling_tf_bert.TFBertLayer with Bert->Electra


Ultimately, we'll need those back ;-)

Yes, no worries :)

src/transformers/modeling_tf_utils.py

jplu · 2020-10-13T15:19:49Z

Ok @sgugger I should have addressed all your comments.

src/transformers/modeling_tf_bert.py

patrickvonplaten · 2020-10-14T13:06:05Z

src/transformers/modeling_tf_bert.py

-        self.bert = TFBertMainLayer(config, name="bert")
-        self.nsp = TFBertNSPHead(config, name="nsp___cls")
+        self.bert = TFBertMainLayer(config=config, name="bert")
+        self.cls = TFBertOnlyNSPHead(config=config, name="cls")


Weight parameter naming should be avoided!!! Changing self.nsp to self.cls means that all models trained < 4.0 cannot be used with transformers 4.0. The renaming is definitely not worth such a HUGE breaking change

Or is there maybe something I don't see here? From what I understand, changing the name of a parameter forces all users with "old" BERT checkpoints to manually remap their weights to a new naming -> This should be avoided at all costs I think.

Okey @jplu just showed me the regex in the load_tf_weights function that make this backwards compatible. Still not sure if that's the way to go given that adding regexes for all models could explode the complexity and would make such a super important function like load_tf_weights very much unreadable. Think I'd rather vote against this IMO more "cosmetic" name change. (cc. @LysandreJik @sgugger )

Let's see how many exceptions we can find between PT and TF models in the naming, if there are not much it should be largely manageable, as sharing the same name helps quite a lot I think the PT users to understand the TF models and vice versa. Otherwise, if there are way too many changes we can revert this.

src/transformers/modeling_tf_bert.py

jplu · 2020-10-15T07:45:24Z

Ok @patrickvonplaten @sgugger @LysandreJik I should have done the changes you proposed. May I proceed the same way for the other models? Or you want me to do other updates?

LysandreJik

This all looks good to me except two issues:

The issue about ModelOutputs that you're discussing with Sylvain
The issue with layer renaming, which looks scary to me regarding backwards (and forward, to a certain extent) compatibility.

src/transformers/modeling_tf_auto.py

LysandreJik · 2020-10-16T09:01:41Z

src/transformers/modeling_tf_bert.py

-        return_dict=None,
+        return_dict=True,


This prevents users from changing this in the config

src/transformers/modeling_tf_bert.py

LysandreJik · 2020-10-16T09:23:29Z

src/transformers/modeling_tf_utils.py

+        for layer_name in saved_h5_model_layers_name:
+            name = layer_name.replace("nsp___", "").replace("mlm___", "")
+
+            renamed_saved_h5_model_layers_names.add(name)


The more I think about this, the more I think this is unmaintainable. This is not something that should be handled directly in modeling_tf_utils.py imo. If this part isn't necessary for better performance, I would vote to revert that. What is the main reason you wanted to do this change?

It can become unmaintainable, but let's see first if the other models needs changes otherwise if they don't, I don't see the problem to keep two exceptions.

Anyway revert will be easy, it will be the same code but without the two .replace().

The main reason for which I wanted to do this, is to remove all these ugly names and to have something more understandable for people who are coming from PT.

I also agree with @Lysandre here.

changing weight parameters is super dangerous IMO; all kinds of problem can come up here. E.g. how do we deal with people training a model with new weight names of transformer v.4.0, but for some reason having to deploy the model in v3.4? We've had all kinds of issues when we added the position_ids as buffered weights in PyTorch with people that used different "model versions" and library versions.

It's just a cosmetic name change at the cost of adding having to add silent string renaming in the load weights function. I don't think it's worth it here.

This PR is getting huge already and will be hard to review -> can we maybe leave the name changes out of this PR and discuss it in a new one?

This new way is 100% backward compatible.

Not only cosmetic but also for naming and better layers design.

This is the tyniest part of this PR.

I case it is not clear enough (my bad for this). The load_weights method will not change at all, only the occurences of .replace() will be removed, all the rest is necessary.

jplu · 2020-10-16T17:59:00Z

I gave up to filter out the empty tensors from the output, it was too complex to implement. So now we will always have 3 outputs corresponding to attentions, hidden_states and logits. But if output_attentions or output_hidden_states equals False they will be empty tensors (first dim equals to 0).

jplu · 2020-10-30T19:49:39Z

Ok, now BERT looks exactly like I expected. Properly optimized + cleaner code base + full compliance with AutoGraph. Next step is to apply the same changes to the other models.

… (False)

LysandreJik · 2020-11-13T22:08:39Z

Once you're happy with the state of that PR, could you close this PR and open a new one explaining the exact changes? This is a very big thread that takes a while to load, and it's hard to understand the state this current PR is in. Thanks!

jplu · 2020-11-16T09:48:37Z

You are totally right and this was my plan. Since last week I'm shrinking this PR to get only the optimization part and put the other improvements into separate PRs. Once done, I will reopen a clean one 👍

LysandreJik · 2020-11-16T15:44:45Z

Fantastic! Thanks @jplu!

jplu marked this pull request as draft October 13, 2020 07:12

jplu requested review from sgugger, thomwolf, sshleifer, julien-c, LysandreJik and patrickvonplaten October 13, 2020 07:13

LysandreJik reviewed Oct 13, 2020

View reviewed changes

src/transformers/modeling_tf_bert.py Outdated Show resolved Hide resolved

sgugger reviewed Oct 13, 2020

View reviewed changes