-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text Classification with FNet [KerasNLP] #898
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good. Left some initial comments.
def preprocess_dataset(dataset): | ||
dataset = dataset.map( | ||
lambda x: { | ||
"sentence": tf.strings.lower(x["sentence"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we lowercase inside the tokenizer? Why do that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It lowercases the special tokens as well, which I wanted to avoid.
we have. WordPiece Tokenizer is a subword tokenizer; training it on a corpus gives | ||
us a vocabulary of subwords. A subword tokenizer is a compromise between word tokenizers | ||
(word tokenizers have the issue of many OOV tokens), and character tokenizers | ||
(characters don't really encode meaning like words do). Luckily, TensorFlow Text makes it very |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TensorFlow Text makes it very simple to train WordPiece on a corpus, as described in this [guide](link location)
vocab_size=vocab_size, | ||
# Reserved tokens that must be included in the vocabulary | ||
reserved_tokens=reserved_tokens, | ||
# Arguments for `text.BertTokenizer` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as other guide... bert_tokenizer_params={"lower_case"=True}
remove Arguments for
text.BertTokenizercomment, remove
learn_params`.
## Formatting the Dataset | ||
|
||
Next, we'll format our datasets in the form that will be fed to the models. | ||
We need to add [START] and [END] tokens to the input sentences. We also need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe you need [START] and [END], and you aren't using them. Please remove.
original text. | ||
""" | ||
|
||
for element in train_ds.take(1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
element = train_ds.take(1).get_single_element()
We first need an Embedding layer, i.e., a vector for every token in our input sequence. | ||
This Embedding layer can be initialised randomly. We also need a Positional | ||
Embedding layer which encodes the word order in the sequence. The convention is | ||
to add these two embeddings. KerasNLP has a `TokenAndPositionEmbedding ` layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full paths to these symbols generally, for hyperlinking
with datasets that are stored in the TensorFlow format. We will use TFDS to load | ||
the SST-2 dataset. | ||
""" | ||
train_ds, val_ds, test_ds = tfds.load( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally try to show downloading and using source file directly. It is more flexible when copying and updating a guide to a new dataset. You can see how to download sst directly in our current guide for KerasNLP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's okay to use tfds here, which already does the splitting. The hassle I see from data loading is usually not how to switch between tfds and other sources, but how to find sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing - let's be more concise in the description, if we choose to use tfds, just say load SST-2 from Tensorflow Datasets (TFDS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a request from @fchollet when I was doing my guide, so maybe we should discuss with him? Personally not particularly opinionated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, let's bring this up in the team chat. I feel since tfds is part of TF ecosystem and still being maintained, we should try using their product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed TFDS. We don't need it for IMDb :)
|
||
### Model | ||
|
||
In 2017, a paper titled [Attention is All You Need](https://arxiv.org/abs/1706.03762) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would generally tighten up this section. Examples shouldn't have a ton of offhand comments, we should focus on what is show in this guide. This reads a little too much like a blog currently.
Roughly, we should just say...
- BERT, RoBERTa, etc have shown the effectiveness of using transformers to compute a rich embedding for input text.
- However transformers are expensive, an ongoing question is how to lower the compute requirements.
- In this guide we will focus on FNet, which replace the expensive attention mechanism with a Fourier transform.
- We will show how this can speed up training, without significantly degrading performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Changes made 👍🏼 . Sorry for the rather verbose introduction! :P
layer and get comparable results? | ||
|
||
A couple of points from the paper stood out: | ||
1. The authors claim that FNet is 80% faster than BERT on GPUs and 70% faster on TPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is our speedup so much less pronounced? Are we including compilation time in the total time? If we grow the model would the speedup become clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, the SST-2 dataset has very short sequences. I tried it with the IMDb dataset (which has longer sequences) and I'm getting a noticeable speed-up :D
""" | ||
|
||
""" | ||
Let's make a table and compare the two models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would state this point a little more clearly--We can see that FNet significantly speeds up our run time, with only a small sacrifice in overall accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! left some comments
|
||
"""shell | ||
pip install -q keras-nlp | ||
pip install -q tfds-nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using nightly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used nightly because it has the huggingface:sst
dataset. I've removed it now because I am using the IMDb dataset.
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab | ||
|
||
""" | ||
Let's also define our parameters/hyperparameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these are hypers?
with datasets that are stored in the TensorFlow format. We will use TFDS to load | ||
the SST-2 dataset. | ||
""" | ||
train_ds, val_ds, test_ds = tfds.load( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's okay to use tfds here, which already does the splitting. The hassle I see from data loading is usually not how to switch between tfds and other sources, but how to find sources.
with datasets that are stored in the TensorFlow format. We will use TFDS to load | ||
the SST-2 dataset. | ||
""" | ||
train_ds, val_ds, test_ds = tfds.load( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing - let's be more concise in the description, if we choose to use tfds, just say load SST-2 from Tensorflow Datasets (TFDS).
|
||
""" | ||
Now, let's define the tokenizer. We will use the vocabulary obtained above as | ||
input to the tokenizers. We will define a maximum sequence length so that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vocabulary is not the "input" to tokenizer. We could say "configure the tokenizer with vocabulary trained above."
""" | ||
Title: Text Classification using FNet | ||
Author: [Abheesht Sharma](https://github.com/abheesht17/) | ||
Date created: 2021/06/01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2022...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, man. Looks like I'm still mentally stuck in 2021 😆 . Changed!
""" | ||
|
||
"""shell | ||
ls aclImdb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question - here you are chaining three ls
commands, do all of these prints get shown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'll have to check this by generating the .ipynb
file. Ideally, it should print all. I'll generate the .iypnb
file and let you know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this will get outputted in the rendered example the way things work on keras.io. If you want to do that you would need to actually os.listdir or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Will change.
## Building the Model | ||
|
||
Now, let's move on to the exciting part - defining our model! | ||
We first need an embedding layer, i.e., a vector for every token in our input sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a layer that maps every token in input sequence to a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @chenmoneygithub!
""" | ||
Title: Text Classification using FNet | ||
Author: [Abheesht Sharma](https://github.com/abheesht17/) | ||
Date created: 2021/06/01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, man. Looks like I'm still mentally stuck in 2021 😆 . Changed!
""" | ||
|
||
"""shell | ||
ls aclImdb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I'll have to check this by generating the .ipynb
file. Ideally, it should print all. I'll generate the .iypnb
file and let you know.
Dockerfile LICENSE Makefile README.md call_for_contributions.md contributor_guide.md examples guides redirects requirements.txt scripts site sources templates theme ~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
from tensorflow import keras | ||
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab | ||
|
||
random.seed(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do this, instead use keras.utils.set_random_seed()
import os | ||
|
||
from tensorflow import keras | ||
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in the other example -- this should be hidden away
our labelled `tf.data.Dataset` dataset from text files. | ||
""" | ||
|
||
train_ds = tf.keras.utils.text_dataset_from_directory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use from tensorflow import keras
so you don't have to use tf.keras
everywhere
|
||
""" | ||
### Tokenizing the Data | ||
We'll be using the `keras_nlp.tokenizers.WordPieceTokenizer` layer to tokenize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add line break above
|
||
""" | ||
Every vocabulary has a few special, reserved tokens. We have four such tokens: | ||
- `"[PAD]"` - Padding token. Padding tokens are appended to the input sequence length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add line break before list
|
||
|
||
""" | ||
## Formatting the Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only capitalize the first word in a section title
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @fchollet! Addressed your comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the updates!
1. The authors claim that FNet is 80% faster than BERT on GPUs and 70% faster | ||
on TPUs. | ||
The reason for this speed-up is two-fold: | ||
a. The Fourier Transform layer is unparametrized, it does not have any parameters! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this get rendered as a nested list when you generate the website?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great contribution! 👍
Thank you, @mattdangerw, @fchollet, @chenmoneygithub for the review comments and the approval! :) Have to make a teensy change, don't merge it just yet. |
Done making the final changes! |
Please fix the code formatting. |
On running |
The CI error is unrelated. I'll merge. Thank you! |
Resolves keras-team/keras-hub#213
Dataset: IMDb
Compared results with Transformer model