forked from tensorflow/hub
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
2 changed files
with
71 additions
and
0 deletions.
There are no files selected for viewing
7 changes: 7 additions & 0 deletions
7
tfhub_dev/assets/digitalepidemiologylab/digitalepidemiologylab.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Publisher digitalepidemiologylab | ||
Digital Epidemiology Lab at EPFL, Switzerland | ||
|
||
## Digital Epidemiology Lab | ||
Research lab at EPFL working on Applied Machine Learning for Epidemiology. | ||
|
||
https://www.digitalepidemiologylab.org/ |
64 changes: 64 additions & 0 deletions
64
tfhub_dev/assets/digitalepidemiologylab/models/covid-twitter-bert/1.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# Module digitalepidemiologylab/covid-twitter-bert/1 | ||
BERT-large-uncased model, pretrained on a corpus of messages from Twitter about COVID-19 | ||
|
||
<!-- asset-path: https://crowdbreaks-public.s3.eu-central-1.amazonaws.com/models/covid-twitter-bert/v1/tfhub/covid-twitter-bert-v1.tar.gz --> | ||
<!-- module-type: text-embedding --> | ||
<!-- network-architecture: Transformer --> | ||
<!-- dataset: Twitter --> | ||
<!-- language: en --> | ||
<!-- fine-tunable: true --> | ||
<!-- format: saved_model_2 --> | ||
|
||
## Overview | ||
|
||
This model was trained on 160M tweets collected between January 12 and April 16, 2020 containing at least one of the keywords "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2". These tweets were filtered and preprocessed to reach a final sample of 22.5M tweets (containing 40.7M sentences and 633M tokens) which were used for training. | ||
|
||
This model was evaluated based on downstream classification tasks, but it could be used for any other NLP task which can leverage contextual embeddings. | ||
|
||
In order to achieve best results, make sure to use the same text preprocessing as we did for pretraining. This involves replacing user mentions, urls and emojis. You can find a script on our projects [GitHub repo](https://github.com/digitalepidemiologylab/covid-twitter-bert). | ||
|
||
|
||
## Example use | ||
The saved model can be loaded directly: | ||
|
||
```python | ||
max_seq_length = 96 # Your choice here. | ||
input_word_ids = tf.keras.layers.Input( | ||
shape=(max_seq_length,), | ||
dtype=tf.int32, | ||
name="input_word_ids") | ||
input_mask = tf.keras.layers.Input( | ||
shape=(max_seq_length,), | ||
dtype=tf.int32, | ||
name="input_mask") | ||
input_type_ids = tf.keras.layers.Input( | ||
shape=(max_seq_length,), | ||
dtype=tf.int32, | ||
name="input_type_ids") | ||
bert_layer = hub.KerasLayer("https://tfhub.dev/digitalepidemiologylab/covid-twitter-bert/1", trainable=True) | ||
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, input_type_ids]) | ||
``` | ||
|
||
From there you can create a classifier model the following way: | ||
```python | ||
num_labels = 3 | ||
initializer = tf.keras.initializers.TruncatedNormal(stddev=0.2) | ||
output = tf.keras.layers.Dropout(rate=0.1)(pooled_output) | ||
output = tf.keras.layers.Dense(num_labels, kernel_initializer=initializer, name='output')(output) | ||
classifier_model = tf.keras.Model( | ||
inputs={ | ||
'input_word_ids': input_word_ids, | ||
'input_mask': input_mask, | ||
'input_type_ids': input_type_ids}, | ||
outputs=output) | ||
``` | ||
|
||
You can load the tokenizer the following way. The vocab is equivalent to the official bert-large-uncased vocab: | ||
```python | ||
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy() | ||
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy() | ||
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case) | ||
``` | ||
|
||
## References | ||
[1] Martin Müller, Marcel Salaté, Per E Kummervold. "COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter" arXiv preprint arXiv:2005.07503 (2020). |