EduardoPach
diff --git a/‎docs/source/en/add_new_model.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/en/add_new_model.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/en/pr_checks.md‎
Lines changed: 55 additions & 0 deletions b/‎docs/source/en/pr_checks.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎src/transformers/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎src/transformers/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/transformers/models/albert/modeling_flax_albert.py‎
Lines changed: 0 additions & 1 deletion b/‎src/transformers/models/albert/modeling_flax_albert.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎src/transformers/models/albert/tokenization_albert.py‎
Lines changed: 2 additions & 2 deletions b/‎src/transformers/models/albert/tokenization_albert.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/transformers/models/align/modeling_align.py‎
Lines changed: 1 addition & 1 deletion b/‎src/transformers/models/align/modeling_align.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/transformers/models/bart/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎src/transformers/models/bart/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/transformers/models/bart/modeling_bart.py‎
Lines changed: 19 additions & 11 deletions b/‎src/transformers/models/bart/modeling_bart.py‎
Lines changed: 19 additions & 11 deletions
diff --git a/‎src/transformers/models/bit/modeling_bit.py‎
Lines changed: 1 addition & 1 deletion b/‎src/transformers/models/bit/modeling_bit.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/transformers/models/blenderbot/modeling_flax_blenderbot.py‎
Lines changed: 1 addition & 2 deletions b/‎src/transformers/models/blenderbot/modeling_flax_blenderbot.py‎
Lines changed: 1 addition & 2 deletions
@@ -101,7 +101,7 @@ own regarding how code should be written :-)
 1. The forward pass of your model should be fully written in the modeling file while being fully independent of other
    models in the library. If you want to reuse a block from another model, copy the code and paste it with a
    `# Copied from` comment on top (see [here](https://github.com/huggingface/transformers/blob/v4.17.0/src/transformers/models/roberta/modeling_roberta.py#L160)
-   for a good example).
+   for a good example and [there](pr_checks#check-copies) for more documentation on Copied from). 
 2. The code should be fully understandable, even by a non-native English speaker. This means you should pick
    descriptive variable names and avoid abbreviations. As an example, `activation` is preferred to `act`.
    One-letter variable names are strongly discouraged unless it's an index in a for loop.
 
@@ -142,3 +142,58 @@ Additional checks concern PRs that add new models, mainly that:
 - All checkpoints used actually exist on the Hub
 
 -->
+
+### Check copies
+
+Since the Transformers library is very opinionated with respect to model code, and each model should fully be implemented in a single file without relying on other models, we have added a mechanism that checks whether a copy of the code of a layer of a given model stays consistent with the original. This way, when there is a bug fix, we can see all other impacted models and choose to trickle down the modification or break the copy.
+
+<Tip>
+
+If a file is a full copy of another file, you should register it in the constant `FULL_COPIES` of `utils/check_copies.py`.
+
+</Tip>
+
+This mechanism relies on comments of the form `# Copied from xxx`. The `xxx` should contain the whole path to the class of function which is being copied below. For instance, `RobertaSelfOutput` is a direct copy of the `BertSelfOutput` class, so you can see [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L289) it has a comment:
+
+```py
+# Copied from transformers.models.bert.modeling_bert.BertSelfOutput
+```
+
+Note that instead of applying this to a whole class, you can apply it to the relevant methods that are copied from. For instance [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L598) you can see how `RobertaPreTrainedModel._init_weights` is copied from the same method in `BertPreTrainedModel` with the comment:
+
+```py
+# Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights
+```
+
+Sometimes the copy is exactly the same except for names: for instance in `RobertaAttention`, we use `RobertaSelfAttention` insted of `BertSelfAttention` but other than that, the code is exactly the same. This is why `# Copied from` supports simple string replacements with the follwoing syntax: `Copied from xxx with foo->bar`. This means the code is copied with all instances of `foo` being replaced by `bar`. You can see how it used [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L304C1-L304C86) in `RobertaAttention` with the comment:
+
+```py
+# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta
+```
+
+Note that there shouldn't be any spaces around the arrow (unless that space is part of the pattern to replace of course).
+
+You can add several patterns separated by a comma. For instance here `CamemberForMaskedLM` is a direct copy of `RobertaForMaskedLM` with two replacements: `Roberta` to `Camembert` and `ROBERTA` to `CAMEMBERT`. You can see [here](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/camembert/modeling_camembert.py#L929) this is done with the comment:
+
+```py
+# Copied from transformers.models.roberta.modeling_roberta.RobertaForMaskedLM with Roberta->Camembert, ROBERTA->CAMEMBERT
+```
+
+If the order matters (because one of the replacements might conflict with a previous one), the replacements are executed from left to right.
+
+<Tip>
+
+If the replacements change the formatting (if you replace a short name by a very long name for instance), the copy is checked after applying the auto-formatter.
+
+</Tip>
+
+Another way when the patterns are just different casings of the same replacement (with an uppercased and a lowercased variants) is just to add the option `all-casing`. [Here](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/mobilebert/modeling_mobilebert.py#L1237) is an example in `MobileBertForSequenceClassification` with the comment:
+
+```py
+# Copied from transformers.models.bert.modeling_bert.BertForSequenceClassification with Bert->MobileBert all-casing
+```
+
+In this case, the code is copied from `BertForSequenceClassification` by replacing:
+- `Bert` by `MobileBert` (for instance when using `MobileBertModel` in the init)
+- `bert` by `mobilebert` (for instance when defining `self.mobilebert`)
+- `BERT` by `MOBILEBERT` (in the constant `MOBILEBERT_INPUTS_DOCSTRING`)
@@ -1169,6 +1169,7 @@
             "BartForSequenceClassification",
             "BartModel",
             "BartPretrainedModel",
+            "BartPreTrainedModel",
             "PretrainedBartModel",
         ]
     )
@@ -5093,6 +5094,7 @@
             BartForQuestionAnswering,
             BartForSequenceClassification,
             BartModel,
+            BartPreTrainedModel,
             BartPretrainedModel,
             PretrainedBartModel,
         )
 
@@ -173,7 +173,6 @@ def setup(self):
         self.LayerNorm = nn.LayerNorm(epsilon=self.config.layer_norm_eps, dtype=self.dtype)
         self.dropout = nn.Dropout(rate=self.config.hidden_dropout_prob)
 
-    # Copied from transformers.models.bert.modeling_flax_bert.FlaxBertEmbeddings.__call__
     def __call__(self, input_ids, token_type_ids, position_ids, deterministic: bool = True):
         # Embed
         inputs_embeds = self.word_embeddings(input_ids.astype("i4"))
 
@@ -183,10 +183,10 @@ def __init__(
         self.sp_model.Load(vocab_file)
 
     @property
-    def vocab_size(self):
+    def vocab_size(self) -> int:
         return len(self.sp_model)
 
-    def get_vocab(self):
+    def get_vocab(self) -> Dict[str, int]:
         vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
         vocab.update(self.added_tokens_encoder)
         return vocab
 
@@ -286,7 +286,7 @@ def align_loss(similarity: torch.Tensor) -> torch.Tensor:
     return (caption_loss + image_loss) / 2.0
 
 
-# Copied from transformers.models.efficientnet.modeling_efficientnet.round_filters with EfficientNet -> AlignVision
+# Copied from transformers.models.efficientnet.modeling_efficientnet.round_filters with EfficientNet->AlignVision
 def round_filters(config: AlignVisionConfig, num_channels: int):
     r"""
     Round number of filters based on depth multiplier.
 
@@ -49,6 +49,7 @@
         "BartForQuestionAnswering",
         "BartForSequenceClassification",
         "BartModel",
+        "BartPreTrainedModel",
         "BartPretrainedModel",
         "PretrainedBartModel",
     ]
@@ -107,6 +108,7 @@
             BartForQuestionAnswering,
             BartForSequenceClassification,
             BartModel,
+            BartPreTrainedModel,
             BartPretrainedModel,
             PretrainedBartModel,
         )
 
@@ -502,7 +502,7 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
         return hidden_states
 
 
-class BartPretrainedModel(PreTrainedModel):
+class BartPreTrainedModel(PreTrainedModel):
     config_class = BartConfig
     base_model_prefix = "model"
     supports_gradient_checkpointing = True
@@ -536,10 +536,18 @@ def dummy_inputs(self):
         return dummy_inputs
 
 
-class PretrainedBartModel(BartPretrainedModel):
+class PretrainedBartModel(BartPreTrainedModel):
     def __init_subclass__(self):
         warnings.warn(
-            "The class `PretrainedBartModel` has been depreciated, please use `BartPretrainedModel` instead.",
+            "The class `PretrainedBartModel` has been depreciated, please use `BartPreTrainedModel` instead.",
+            FutureWarning,
+        )
+
+
+class BartPretrainedModel(BartPreTrainedModel):
+    def __init_subclass__(self):
+        warnings.warn(
+            "The class `PretrainedBartModel` has been depreciated, please use `BartPreTrainedModel` instead.",
             FutureWarning,
         )
 
@@ -700,7 +708,7 @@ def __init_subclass__(self):
 """
 
 
-class BartEncoder(BartPretrainedModel):
+class BartEncoder(BartPreTrainedModel):
     """
     Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer is a
     [`BartEncoderLayer`].
@@ -882,7 +890,7 @@ def custom_forward(*inputs):
         )
 
 
-class BartDecoder(BartPretrainedModel):
+class BartDecoder(BartPreTrainedModel):
     """
     Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a [`BartDecoderLayer`]
 
@@ -1169,7 +1177,7 @@ def custom_forward(*inputs):
     "The bare BART Model outputting raw hidden-states without any specific head on top.",
     BART_START_DOCSTRING,
 )
-class BartModel(BartPretrainedModel):
+class BartModel(BartPreTrainedModel):
     _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"]
 
     def __init__(self, config: BartConfig):
@@ -1296,7 +1304,7 @@ def forward(
 @add_start_docstrings(
     "The BART Model with a language modeling head. Can be used for summarization.", BART_START_DOCSTRING
 )
-class BartForConditionalGeneration(BartPretrainedModel):
+class BartForConditionalGeneration(BartPreTrainedModel):
     base_model_prefix = "model"
     _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight", "lm_head.weight"]
     _keys_to_ignore_on_load_missing = ["final_logits_bias"]
@@ -1471,7 +1479,7 @@ def _reorder_cache(past_key_values, beam_idx):
     """,
     BART_START_DOCSTRING,
 )
-class BartForSequenceClassification(BartPretrainedModel):
+class BartForSequenceClassification(BartPreTrainedModel):
     _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"]
 
     def __init__(self, config: BartConfig, **kwargs):
@@ -1601,7 +1609,7 @@ def forward(
     """,
     BART_START_DOCSTRING,
 )
-class BartForQuestionAnswering(BartPretrainedModel):
+class BartForQuestionAnswering(BartPreTrainedModel):
     _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"]
 
     def __init__(self, config):
@@ -1719,7 +1727,7 @@ def forward(
         )
 
 
-class BartDecoderWrapper(BartPretrainedModel):
+class BartDecoderWrapper(BartPreTrainedModel):
     """
     This wrapper class is a helper class to correctly load pretrained checkpoints when the causal language model is
     used in combination with the [`EncoderDecoderModel`] framework.
@@ -1739,7 +1747,7 @@ def forward(self, *args, **kwargs):
     """,
     BART_START_DOCSTRING,
 )
-class BartForCausalLM(BartPretrainedModel):
+class BartForCausalLM(BartPreTrainedModel):
     _tied_weights_keys = ["lm_head.weight"]
 
     def __init__(self, config):
 
@@ -300,7 +300,7 @@ def forward(self, pixel_values: Tensor) -> Tensor:
 
 
 # Copied from transformers.models.convnext.modeling_convnext.drop_path
-def drop_path(input, drop_prob: float = 0.0, training: bool = False):
+def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
     """
     Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
 
 
@@ -22,7 +22,6 @@
 import flax.linen as nn
 import jax
 import jax.numpy as jnp
-import numpy as np
 from flax.core.frozen_dict import FrozenDict, freeze, unfreeze
 from flax.linen import combine_masks, make_causal_mask
 from flax.linen.attention import dot_product_attention_weights
@@ -205,7 +204,7 @@
 
 
 # Copied from transformers.models.bart.modeling_flax_bart.shift_tokens_right
-def shift_tokens_right(input_ids: np.array, pad_token_id: int, decoder_start_token_id: int) -> np.ndarray:
+def shift_tokens_right(input_ids: jnp.array, pad_token_id: int, decoder_start_token_id: int) -> jnp.ndarray:
     """
     Shift input ids one token to the right.
     """
Original file line number	Diff line number	Diff line change
`@@ -1169,6 +1169,7 @@`
`1169`	`1169`	`"BartForSequenceClassification",`
`1170`	`1170`	`"BartModel",`
`1171`	`1171`	`"BartPretrainedModel",`
	`1172`	`+ "BartPreTrainedModel",`
`1172`	`1173`	`"PretrainedBartModel",`
`1173`	`1174`	`]`
`1174`	`1175`	`)`
`@@ -5093,6 +5094,7 @@`
`5093`	`5094`	`BartForQuestionAnswering,`
`5094`	`5095`	`BartForSequenceClassification,`
`5095`	`5096`	`BartModel,`
	`5097`	`+ BartPreTrainedModel,`
`5096`	`5098`	`BartPretrainedModel,`
`5097`	`5099`	`PretrainedBartModel,`
`5098`	`5100`	`)`