-
Notifications
You must be signed in to change notification settings - Fork 29.2k
New TF model design #7753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New TF model design #7753
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great PR!! Lots of welcome improvements and will be easier to develop TF models!
Not reviewing any doc/individual code as its a draft PR, just the behavior.
I have a few questions:
From embedding building to tf.keras.layers.Embedding
The word embeddings currently rely on tf.name_scope('word_embeddings): [...] self.add_weight [...]
, and you updated that to use tf.keras.layers.Embedding
. That's cool, but now you're not passing these weights around to models with LM heads to be used by the decoders, while the input and output embeddings should be tied in most ForMaskedLM
and WithLMHead
models.
- Would we now be relying on the LM head weights being in the checkpoint?
Dense -> DenseEinsum
- Does the change from
tf.keras.layers.Dense
totf.keras.layers.experimental.EinsumDense
imply a breaking change or some magic that we must do to ensure that the weights get correctly loaded? - Is that change the main one that unlocks better performance?
Tuples -> tf.TensorArray
You're stacking the tensor array in the model outputs, but that returns stacked tensors instead of the previous tuples, which is a big breaking change.
- Why is that change necessary? Does it give better performance? Is it necessary for graph mode?
Joining the nsp
+ mlm
in one cls
- This is cool as it's more similar to what we do in PyTorch! However, you're now handling the names of the weights directly in the
modeling_tf_utils.py
file. I cannot imagine that scales as more models are converted to this implementation?
Thanks for your comments! I will try to answer as clearly as possible. From embedding building to
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this! It looks great!
My main concern is that it seems to me that the tied weights between the embeddings and the decoder in the bert layer have disappeared (and if that's indeed the case, that there is no test failing).
src/transformers/file_utils.py
Outdated
super().__setitem__(name, value) | ||
super().__setattr__(name, value) | ||
if is_tf_available() and isinstance(value, tf.Tensor): | ||
if name not in ("hidden_states", "attentions") or (value.shape[0] != 0 and value.shape[0] is not None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment for that test? I find it hard to understand (it catches empty tensors and does not set them from what I see). Also why the specific test for keys "hidden_states" and "attentions". Can they be tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically in graph mode, all the tensors have unknown first dim (=None), so even the last_hidden_states
key will be ignored if the first part of the or
is not here. So at the end, only the tensors that have a key name different of hidden_states
or attentions
(names of the keys that can be dropped or not) and that have a tensor with a first dim other than 0 or None, will be set. Otherwise, will be ignored.
I will add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at the comment, there is a mistake in the behavior: all keys can be optional, not just "hidden_states" or "attentions".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the cases where other keys than loss
, attentions
, hidden_states
, past_key_values
or use_cache
are None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_cache
is in no ModelOutput
that I know of.
If you look at modeling_tf_outputs
and modeling_outputs
(for the output types currently missing in TF because the corresponding models have not been ported), you can see all keys that can appear in the various modeling outputs, and apart form the logits
, they're pretty much all optionals. The Seq2Seq
models have a lot for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be treated the same way None
values are currently treated: not set in the inner dict. I think that's exactly what you're doing here so this is good. It's the part of the test that looks at hiddens
and attentions
keys I don't see as needed.
Note that you need the same test in the init at the line if v is not None
toward the end. You should not have the test a second time below in the setattr however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this?
class ModelOutput(OrderedDict):
def __post_init__(self):
.....
if other_fields_are_none and not is_tensor(first_field):
....
else:
for field in class_fields:
v = getattr(self, field.name)
if v is not None:
if is_tf_available() and isinstance(value, tf.Tensor):
if value.shape[0] != 0 and value.shape[0] is not None:
self[field.name] = v
else:
self[field.name] = v
......
def __setattr__(self, name, value):
if name in self.keys() and value is not None:
# Don't call self.__setitem__ to avoid recursion errors
if is_tf_available() and isinstance(value, tf.Tensor):
if value.shape[0] != 0 and value.shape[0] is not None:
super().__setitem__(name, value)
else:
super().__setitem__(name, value)
if is_tf_available() and isinstance(value, tf.Tensor):
if value.shape[0] != 0 and value.shape[0] is not None:
super().__setattr__(name, value)
else:
super().__setattr__(name, value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd just remove the last 5 lines and always leave super().__setattr__(name, value)
: we always set the attribute but only registers it in the inner dict if it's not None/empty tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then it doesn't work. I get an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\users\snake\dev\transformers\src\transformers\modeling_tf_utils.py", line 825, in from_pretrained
model(model.dummy_inputs, training=False) # build the network with dummy inputs
File "C:\Users\snake\miniconda3\envs\transformers\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 985, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "c:\users\snake\dev\transformers\src\transformers\modeling_tf_bert.py", line 958, in call
outputs = self.bert(
File "C:\Users\snake\miniconda3\envs\transformers\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 985, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "c:\users\snake\dev\transformers\src\transformers\modeling_tf_bert.py", line 761, in call
sequence_output = encoder_outputs[0]
File "c:\users\snake\dev\transformers\src\transformers\file_utils.py", line 1165, in __getitem__
return self.to_tuple()[k]
IndexError: tuple index out of range
The reason is because in graph mode all the tensors have their first dim to None. Then no more values are added this was why I have added some exceptions with the key names 😢 Will try to find another solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to see by yourself more exactly what is the issue, run the following piece of code:
from transformers import TFBertModel
import tensorflow as tf
model = tf.function(TFBertModel.from_pretrained("bert-base-cased"))
res = model(tf.constant([[1,2,3,4]]), output_attentions=tf.constant(True))
In res
you have to have attentions=<a tensor>
and hidden_states=None
. Now with
from transformers import TFBertModel
import tensorflow as tf
model = tf.function(TFBertModel.from_pretrained("bert-base-cased"))
res = model(tf.constant([[1,2,3,4]]))
In res
you have to have attentions=None
and hidden_states=None
.
@@ -206,7 +224,6 @@ def call(self, hidden_states, input_tensor, training=False): | |||
return hidden_states | |||
|
|||
|
|||
# Copied from transformers.modeling_tf_bert.TFBertLayer with Bert->Electra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, we'll need those back ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no worries :)
Ok @sgugger I should have addressed all your comments. |
src/transformers/modeling_tf_bert.py
Outdated
self.bert = TFBertMainLayer(config, name="bert") | ||
self.nsp = TFBertNSPHead(config, name="nsp___cls") | ||
self.bert = TFBertMainLayer(config=config, name="bert") | ||
self.cls = TFBertOnlyNSPHead(config=config, name="cls") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weight parameter naming should be avoided!!! Changing self.nsp
to self.cls
means that all models trained < 4.0 cannot be used with transformers 4.0. The renaming is definitely not worth such a HUGE breaking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is there maybe something I don't see here? From what I understand, changing the name of a parameter forces all users with "old" BERT checkpoints to manually remap their weights to a new naming -> This should be avoided at all costs I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okey @jplu just showed me the regex in the load_tf_weights
function that make this backwards compatible. Still not sure if that's the way to go given that adding regexes for all models could explode the complexity and would make such a super important function like load_tf_weights
very much unreadable. Think I'd rather vote against this IMO more "cosmetic" name change. (cc. @LysandreJik @sgugger )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see how many exceptions we can find between PT and TF models in the naming, if there are not much it should be largely manageable, as sharing the same name helps quite a lot I think the PT users to understand the TF models and vice versa. Otherwise, if there are way too many changes we can revert this.
Ok @patrickvonplaten @sgugger @LysandreJik I should have done the changes you proposed. May I proceed the same way for the other models? Or you want me to do other updates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me except two issues:
- The issue about
ModelOutputs
that you're discussing with Sylvain - The issue with layer renaming, which looks scary to me regarding backwards (and forward, to a certain extent) compatibility.
src/transformers/modeling_tf_bert.py
Outdated
return_dict=None, | ||
return_dict=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents users from changing this in the config
for layer_name in saved_h5_model_layers_name: | ||
name = layer_name.replace("nsp___", "").replace("mlm___", "") | ||
|
||
renamed_saved_h5_model_layers_names.add(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this, the more I think this is unmaintainable. This is not something that should be handled directly in modeling_tf_utils.py
imo. If this part isn't necessary for better performance, I would vote to revert that. What is the main reason you wanted to do this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can become unmaintainable, but let's see first if the other models needs changes otherwise if they don't, I don't see the problem to keep two exceptions.
Anyway revert will be easy, it will be the same code but without the two .replace()
.
The main reason for which I wanted to do this, is to remove all these ugly names and to have something more understandable for people who are coming from PT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree with @Lysandre here.
- changing weight parameters is super dangerous IMO; all kinds of problem can come up here. E.g. how do we deal with people training a model with new weight names of transformer v.4.0, but for some reason having to deploy the model in v3.4? We've had all kinds of issues when we added the
position_ids
as buffered weights in PyTorch with people that used different "model versions" and library versions. - It's just a cosmetic name change at the cost of adding having to add silent string renaming in the load weights function. I don't think it's worth it here.
- This PR is getting huge already and will be hard to review -> can we maybe leave the name changes out of this PR and discuss it in a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
This new way is 100% backward compatible.
-
Not only cosmetic but also for naming and better layers design.
-
This is the tyniest part of this PR.
I case it is not clear enough (my bad for this). The load_weights method will not change at all, only the occurences of .replace()
will be removed, all the rest is necessary.
I gave up to filter out the empty tensors from the output, it was too complex to implement. So now we will always have 3 outputs corresponding to |
df2cb8b
to
e801fcf
Compare
4cb7a1c
to
60411b3
Compare
Ok, now BERT looks exactly like I expected. Properly optimized + cleaner code base + full compliance with AutoGraph. Next step is to apply the same changes to the other models. |
a6fa973
to
7b5da2e
Compare
Once you're happy with the state of that PR, could you close this PR and open a new one explaining the exact changes? This is a very big thread that takes a while to load, and it's hard to understand the state this current PR is in. Thanks! |
You are totally right and this was my plan. Since last week I'm shrinking this PR to get only the optimization part and put the other improvements into separate PRs. Once done, I will reopen a clean one 👍 |
Fantastic! Thanks @jplu! |
What does this PR do?
This PR aims to improve the TensorFlow code base of transformers. For sake of simplicity the changes are made only on the BERT model, but the new features can be applied to all the others. The PR brings among some bug fix the following main new features;
None
value.tf.keras.layers.Input
).