lambdal · haamis · Feb 7, 2019 · Feb 12, 2019 · Feb 12, 2019 · Feb 12, 2019
diff --git a/README.md b/README.md
@@ -1,5 +1,128 @@
 # BERT
 
+**\*\*\*\*\* New March 11th, 2020: Smaller BERT Models \*\*\*\*\***
+
+This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in [Well-Read Students Learn Better: On the Importance of Pre-training Compact Models](https://arxiv.org/abs/1908.08962).
+
+We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.
+
+Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity.
+
+You can download all 24 from [here][all], or individually from the table below:
+
+|   |H=128|H=256|H=512|H=768|
+|---|:---:|:---:|:---:|:---:|
+| **L=2**  |[**2/128 (BERT-Tiny)**][2_128]|[2/256][2_256]|[2/512][2_512]|[2/768][2_768]|
+| **L=4**  |[4/128][4_128]|[**4/256 (BERT-Mini)**][4_256]|[**4/512 (BERT-Small)**][4_512]|[4/768][4_768]|
+| **L=6**  |[6/128][6_128]|[6/256][6_256]|[6/512][6_512]|[6/768][6_768]|
+| **L=8**  |[8/128][8_128]|[8/256][8_256]|[**8/512 (BERT-Medium)**][8_512]|[8/768][8_768]|
+| **L=10** |[10/128][10_128]|[10/256][10_256]|[10/512][10_512]|[10/768][10_768]|
+| **L=12** |[12/128][12_128]|[12/256][12_256]|[12/512][12_512]|[**12/768 (BERT-Base)**][12_768]|
+
+Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model.
+
+Here are the corresponding GLUE scores on the test set:
+
+|Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX|
+|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+|BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0|
+|BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1|
+|BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6|
+|BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5|
+
+For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs:
+- batch sizes: 8, 16, 32, 64, 128
+- learning rates: 3e-4, 1e-4, 5e-5, 3e-5
+
+If you use these models, please cite the following paper:
+
+```
+@article{turc2019,
+  title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
+  author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
+  journal={arXiv preprint arXiv:1908.08962v2 },
+  year={2019}
+}
+```
+
+[2_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-128_A-2.zip
+[2_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-256_A-4.zip
+[2_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-512_A-8.zip
+[2_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-2_H-768_A-12.zip
+[4_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-128_A-2.zip
+[4_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-256_A-4.zip
+[4_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-512_A-8.zip
+[4_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-4_H-768_A-12.zip
+[6_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-128_A-2.zip
+[6_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-256_A-4.zip
+[6_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-512_A-8.zip
+[6_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-6_H-768_A-12.zip
+[8_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-128_A-2.zip
+[8_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-256_A-4.zip
+[8_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-512_A-8.zip
+[8_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-8_H-768_A-12.zip
+[10_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-128_A-2.zip
+[10_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-256_A-4.zip
+[10_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-512_A-8.zip
+[10_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-10_H-768_A-12.zip
+[12_128]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-128_A-2.zip
+[12_256]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-256_A-4.zip
+[12_512]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-512_A-8.zip
+[12_768]: https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-768_A-12.zip
+[all]: https://storage.googleapis.com/bert_models/2020_02_20/all_bert_models.zip
+
+**\*\*\*\*\* New May 31st, 2019: Whole Word Masking Models \*\*\*\*\***
+
+This is a release of several new models which were the result of an improvement
+the pre-processing code.
+
+In the original pre-processing code, we randomly select WordPiece tokens to
+mask. For example:
+
+`Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head`
+`Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil
+[MASK] ##mon ' s head`
+
+The new technique is called Whole Word Masking. In this case, we always mask
+*all* of the the tokens corresponding to a word at once. The overall masking
+rate remains the same.
+
+`Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK]
+[MASK] ' s head`
+
+The training is identical -- we still predict each masked WordPiece token
+independently. The improvement comes from the fact that the original prediction
+task was too 'easy' for words that had been split into multiple WordPieces.
+
+This can be enabled during data generation by passing the flag
+`--do_whole_word_mask=True` to `create_pretraining_data.py`.
+
+Pre-trained models with Whole Word Masking are linked below. The data and
+training were otherwise identical, and the models have identical structure and
+vocab to the original models. We only include BERT-Large models. When using
+these models, please make it clear in the paper that you are using the Whole
+Word Masking variant of BERT-Large.
+
+*   **[`BERT-Large, Uncased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+
+*   **[`BERT-Large, Cased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+
+Model                                    | SQUAD 1.1 F1/EM | Multi NLI Accuracy
+---------------------------------------- | :-------------: | :----------------:
+BERT-Large, Uncased (Original)           | 91.0/84.3       | 86.05
+BERT-Large, Uncased (Whole Word Masking) | 92.8/86.7       | 87.07
+BERT-Large, Cased (Original)             | 91.5/84.8       | 86.09
+BERT-Large, Cased (Whole Word Masking)   | 92.9/86.7       | 86.46
+
+**\*\*\*\*\* New February 7th, 2019: TfHub Module \*\*\*\*\***
+
+BERT has been uploaded to [TensorFlow Hub](https://tfhub.dev). See
+`run_classifier_with_tfhub.py` for an example of how to use the TF Hub module,
+or run an example in the browser on
+[Colab](https://colab.sandbox.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb).
+
 **\*\*\*\*\* New November 23rd, 2018: Un-normalized multilingual model + Thai +
 Mongolian \*\*\*\*\***
 
@@ -219,6 +342,10 @@ using your own script.)**
 
 The links to the models are here (right-click, 'Save link as...' on the name):
 
+*   **[`BERT-Large, Uncased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Large, Cased (Whole Word Masking)`](https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
 *   **[`BERT-Base, Uncased`](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip)**:
     12-layer, 768-hidden, 12-heads, 110M parameters
 *   **[`BERT-Large, Uncased`](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)**:

diff --git a/create_pretraining_data.py b/create_pretraining_data.py
@@ -20,8 +20,8 @@
 
 import collections
 import random
-import tensorflow as tf
 import tokenization
+import tensorflow as tf
 
 flags = tf.flags
 
@@ -42,6 +42,10 @@
     "Whether to lower case the input text. Should be True for uncased "
     "models and False for cased models.")
 
+flags.DEFINE_bool(
+    "do_whole_word_mask", False,
+    "Whether to use whole word masking rather than per-WordPiece masking.")
+
 flags.DEFINE_integer("max_seq_length", 128, "Maximum sequence length.")
 
 flags.DEFINE_integer("max_predictions_per_seq", 20,
@@ -343,7 +347,20 @@ def create_masked_lm_predictions(tokens, masked_lm_prob,
   for (i, token) in enumerate(tokens):
     if token == "[CLS]" or token == "[SEP]":
       continue
-    cand_indexes.append(i)
+    # Whole Word Masking means that if we mask all of the wordpieces
+    # corresponding to an original word. When a word has been split into
+    # WordPieces, the first token does not have any marker and any subsequence
+    # tokens are prefixed with ##. So whenever we see the ## token, we
+    # append it to the previous set of word indexes.
+    #
+    # Note that Whole Word Masking does *not* change the training code
+    # at all -- we still predict each WordPiece independently, softmaxed
+    # over the entire vocabulary.
+    if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and
+        token.startswith("##")):
+      cand_indexes[-1].append(i)
+    else:
+      cand_indexes.append([i])
 
   rng.shuffle(cand_indexes)
 
@@ -354,29 +371,39 @@ def create_masked_lm_predictions(tokens, masked_lm_prob,
 
   masked_lms = []
   covered_indexes = set()
-  for index in cand_indexes:
+  for index_set in cand_indexes:
     if len(masked_lms) >= num_to_predict:
       break
-    if index in covered_indexes:
+    # If adding a whole-word mask would exceed the maximum number of
+    # predictions, then just skip this candidate.
+    if len(masked_lms) + len(index_set) > num_to_predict:
       continue
-    covered_indexes.add(index)
+    is_any_index_covered = False
+    for index in index_set:
+      if index in covered_indexes:
+        is_any_index_covered = True
+        break
+    if is_any_index_covered:
+      continue
+    for index in index_set:
+      covered_indexes.add(index)
 
-    masked_token = None
-    # 80% of the time, replace with [MASK]
-    if rng.random() < 0.8:
-      masked_token = "[MASK]"
-    else:
-      # 10% of the time, keep original
-      if rng.random() < 0.5:
-        masked_token = tokens[index]
-      # 10% of the time, replace with random word
+      masked_token = None
+      # 80% of the time, replace with [MASK]
+      if rng.random() < 0.8:
+        masked_token = "[MASK]"
       else:
-        masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
-
-    output_tokens[index] = masked_token
+        # 10% of the time, keep original
+        if rng.random() < 0.5:
+          masked_token = tokens[index]
+        # 10% of the time, replace with random word
+        else:
+          masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
 
-    masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
+      output_tokens[index] = masked_token
 
+      masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
+  assert len(masked_lms) <= num_to_predict
   masked_lms = sorted(masked_lms, key=lambda x: x.index)
 
   masked_lm_positions = []

diff --git a/modeling.py b/modeling.py
@@ -23,6 +23,7 @@
 import json
 import math
 import re
+import numpy as np
 import six
 import tensorflow as tf
 
@@ -133,7 +134,7 @@ def __init__(self,
                input_ids,
                input_mask=None,
                token_type_ids=None,
-               use_one_hot_embeddings=True,
+               use_one_hot_embeddings=False,
                scope=None):
     """Constructor for BertModel.
 
@@ -145,9 +146,7 @@ def __init__(self,
       input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
       token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
       use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
-        embeddings or tf.embedding_lookup() for the word embeddings. On the TPU,
-        it is much faster if this is True, on the CPU or GPU, it is faster if
-        this is False.
+        embeddings or tf.embedding_lookup() for the word embeddings.
       scope: (optional) variable scope. Defaults to "bert".
 
     Raises:
@@ -262,20 +261,20 @@ def get_embedding_table(self):
     return self.embedding_table
 
 
-def gelu(input_tensor):
+def gelu(x):
   """Gaussian Error Linear Unit.
 
   This is a smoother version of the RELU.
   Original paper: https://arxiv.org/abs/1606.08415
-
   Args:
-    input_tensor: float Tensor to perform activation.
+    x: float Tensor to perform activation.
 
   Returns:
-    `input_tensor` with the GELU activation applied.
+    `x` with the GELU activation applied.
   """
-  cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0)))
-  return input_tensor * cdf
+  cdf = 0.5 * (1.0 + tf.tanh(
+      (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
+  return x * cdf
 
 
 def get_activation(activation_string):
@@ -394,8 +393,7 @@ def embedding_lookup(input_ids,
     initializer_range: float. Embedding initialization range.
     word_embedding_name: string. Name of the embedding table.
     use_one_hot_embeddings: bool. If True, use one-hot method for word
-      embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
-      for TPUs.
+      embeddings. If False, use `tf.gather()`.
 
   Returns:
     float Tensor of shape [batch_size, seq_length, embedding_size].
@@ -413,12 +411,12 @@ def embedding_lookup(input_ids,
       shape=[vocab_size, embedding_size],
       initializer=create_initializer(initializer_range))
 
+  flat_input_ids = tf.reshape(input_ids, [-1])
   if use_one_hot_embeddings:
-    flat_input_ids = tf.reshape(input_ids, [-1])
     one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
     output = tf.matmul(one_hot_input_ids, embedding_table)
   else:
-    output = tf.nn.embedding_lookup(embedding_table, input_ids)
+    output = tf.gather(embedding_table, flat_input_ids)
 
   input_shape = get_shape_list(input_ids)
 

diff --git a/multilingual.md b/multilingual.md
@@ -69,7 +69,7 @@ Note that the English result is worse than the 84.2 MultiNLI baseline because
 this training used Multilingual BERT rather than English-only BERT. This implies
 that for high-resource languages, the Multilingual model is somewhat worse than
 a single-language model. However, it is not feasible for us to train and
-maintain dozens of single-language model. Therefore, if your goal is to maximize
+maintain dozens of single-language models. Therefore, if your goal is to maximize
 performance with a language other than English or Chinese, you might find it
 beneficial to run pre-training for additional steps starting from our
 Multilingual model on data from your language of interest.
@@ -99,8 +99,8 @@ version of MultiNLI where the dev/test sets have been human-translated, and the
 training set has been machine-translated.
 
 To run the fine-tuning code, please download the
-[XNLI dev/test set](https://s3.amazonaws.com/xnli/XNLI-1.0.zip) and the
-[XNLI machine-translated training set](https://s3.amazonaws.com/xnli/XNLI-MT-1.0.zip)
+[XNLI dev/test set](https://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip) and the
+[XNLI machine-translated training set](https://www.nyu.edu/projects/bowman/xnli/XNLI-MT-1.0.zip)
 and then unpack both .zip files into some directory `$XNLI_DIR`.
 
 To run fine-tuning on XNLI. The language is hard-coded into `run_classifier.py`
@@ -152,11 +152,9 @@ taken as the training data for each language
 However, the size of the Wikipedia for a given language varies greatly, and
 therefore low-resource languages may be "under-represented" in terms of the
 neural network model (under the assumption that languages are "competing" for
-limited model capacity to some extent).
-
-However, the size of a Wikipedia also correlates with the number of speakers of
-a language, and we also don't want to overfit the model by performing thousands
-of epochs over a tiny Wikipedia for a particular language.
+limited model capacity to some extent). At the same time, we also don't want
+to overfit the model by performing thousands of epochs over a tiny Wikipedia
+for a particular language.
 
 To balance these two factors, we performed exponentially smoothed weighting of
 the data during pre-training data creation (and WordPiece vocab creation). In