Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Classification with FNet [KerasNLP] #898

Merged
merged 26 commits into from
Jun 29, 2022

Conversation

abheesht17
Copy link
Contributor

@abheesht17 abheesht17 commented Jun 1, 2022

Resolves keras-team/keras-hub#213

Dataset: IMDb
Compared results with Transformer model

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good. Left some initial comments.

def preprocess_dataset(dataset):
dataset = dataset.map(
lambda x: {
"sentence": tf.strings.lower(x["sentence"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we lowercase inside the tokenizer? Why do that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It lowercases the special tokens as well, which I wanted to avoid.

we have. WordPiece Tokenizer is a subword tokenizer; training it on a corpus gives
us a vocabulary of subwords. A subword tokenizer is a compromise between word tokenizers
(word tokenizers have the issue of many OOV tokens), and character tokenizers
(characters don't really encode meaning like words do). Luckily, TensorFlow Text makes it very
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorFlow Text makes it very simple to train WordPiece on a corpus, as described in this [guide](link location)

vocab_size=vocab_size,
# Reserved tokens that must be included in the vocabulary
reserved_tokens=reserved_tokens,
# Arguments for `text.BertTokenizer`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as other guide... bert_tokenizer_params={"lower_case"=True} remove Arguments for text.BertTokenizercomment, removelearn_params`.

## Formatting the Dataset

Next, we'll format our datasets in the form that will be fed to the models.
We need to add [START] and [END] tokens to the input sentences. We also need
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe you need [START] and [END], and you aren't using them. Please remove.

original text.
"""

for element in train_ds.take(1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element = train_ds.take(1).get_single_element()

We first need an Embedding layer, i.e., a vector for every token in our input sequence.
This Embedding layer can be initialised randomly. We also need a Positional
Embedding layer which encodes the word order in the sequence. The convention is
to add these two embeddings. KerasNLP has a `TokenAndPositionEmbedding ` layer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full paths to these symbols generally, for hyperlinking

with datasets that are stored in the TensorFlow format. We will use TFDS to load
the SST-2 dataset.
"""
train_ds, val_ds, test_ds = tfds.load(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally try to show downloading and using source file directly. It is more flexible when copying and updating a guide to a new dataset. You can see how to download sst directly in our current guide for KerasNLP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's okay to use tfds here, which already does the splitting. The hassle I see from data loading is usually not how to switch between tfds and other sources, but how to find sources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing - let's be more concise in the description, if we choose to use tfds, just say load SST-2 from Tensorflow Datasets (TFDS).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a request from @fchollet when I was doing my guide, so maybe we should discuss with him? Personally not particularly opinionated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, let's bring this up in the team chat. I feel since tfds is part of TF ecosystem and still being maintained, we should try using their product.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed TFDS. We don't need it for IMDb :)


### Model

In 2017, a paper titled [Attention is All You Need](https://arxiv.org/abs/1706.03762)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would generally tighten up this section. Examples shouldn't have a ton of offhand comments, we should focus on what is show in this guide. This reads a little too much like a blog currently.

Roughly, we should just say...

  • BERT, RoBERTa, etc have shown the effectiveness of using transformers to compute a rich embedding for input text.
  • However transformers are expensive, an ongoing question is how to lower the compute requirements.
  • In this guide we will focus on FNet, which replace the expensive attention mechanism with a Fourier transform.
  • We will show how this can speed up training, without significantly degrading performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Changes made 👍🏼 . Sorry for the rather verbose introduction! :P

layer and get comparable results?

A couple of points from the paper stood out:
1. The authors claim that FNet is 80% faster than BERT on GPUs and 70% faster on TPUs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is our speedup so much less pronounced? Are we including compilation time in the total time? If we grow the model would the speedup become clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, the SST-2 dataset has very short sequences. I tried it with the IMDb dataset (which has longer sequences) and I'm getting a noticeable speed-up :D

"""

"""
Let's make a table and compare the two models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would state this point a little more clearly--We can see that FNet significantly speeds up our run time, with only a small sacrifice in overall accuracy.

Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! left some comments


"""shell
pip install -q keras-nlp
pip install -q tfds-nightly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using nightly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used nightly because it has the huggingface:sst dataset. I've removed it now because I am using the IMDb dataset.

from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

"""
Let's also define our parameters/hyperparameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these are hypers?

with datasets that are stored in the TensorFlow format. We will use TFDS to load
the SST-2 dataset.
"""
train_ds, val_ds, test_ds = tfds.load(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's okay to use tfds here, which already does the splitting. The hassle I see from data loading is usually not how to switch between tfds and other sources, but how to find sources.

with datasets that are stored in the TensorFlow format. We will use TFDS to load
the SST-2 dataset.
"""
train_ds, val_ds, test_ds = tfds.load(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing - let's be more concise in the description, if we choose to use tfds, just say load SST-2 from Tensorflow Datasets (TFDS).


"""
Now, let's define the tokenizer. We will use the vocabulary obtained above as
input to the tokenizers. We will define a maximum sequence length so that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vocabulary is not the "input" to tokenizer. We could say "configure the tokenizer with vocabulary trained above."

"""
Title: Text Classification using FNet
Author: [Abheesht Sharma](https://github.com/abheesht17/)
Date created: 2021/06/01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2022...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, man. Looks like I'm still mentally stuck in 2021 😆 . Changed!

"""

"""shell
ls aclImdb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question - here you are chaining three ls commands, do all of these prints get shown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'll have to check this by generating the .ipynb file. Ideally, it should print all. I'll generate the .iypnb file and let you know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this will get outputted in the rendered example the way things work on keras.io. If you want to do that you would need to actually os.listdir or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Will change.

## Building the Model

Now, let's move on to the exciting part - defining our model!
We first need an embedding layer, i.e., a vector for every token in our input sequence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a layer that maps every token in input sequence to a vector.

Copy link
Contributor Author

@abheesht17 abheesht17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @chenmoneygithub!

"""
Title: Text Classification using FNet
Author: [Abheesht Sharma](https://github.com/abheesht17/)
Date created: 2021/06/01
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, man. Looks like I'm still mentally stuck in 2021 😆 . Changed!

"""

"""shell
ls aclImdb
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'll have to check this by generating the .ipynb file. Ideally, it should print all. I'll generate the .iypnb file and let you know.

Dockerfile
LICENSE
Makefile
README.md
call_for_contributions.md
contributor_guide.md
examples
guides
redirects
requirements.txt
scripts
site
sources
templates
theme
~
Copy link
Contributor

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

from tensorflow import keras
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

random.seed(42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this, instead use keras.utils.set_random_seed()

import os

from tensorflow import keras
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in the other example -- this should be hidden away

our labelled `tf.data.Dataset` dataset from text files.
"""

train_ds = tf.keras.utils.text_dataset_from_directory(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use from tensorflow import keras so you don't have to use tf.keras everywhere


"""
### Tokenizing the Data
We'll be using the `keras_nlp.tokenizers.WordPieceTokenizer` layer to tokenize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add line break above


"""
Every vocabulary has a few special, reserved tokens. We have four such tokens:
- `"[PAD]"` - Padding token. Padding tokens are appended to the input sequence length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add line break before list



"""
## Formatting the Dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only capitalize the first word in a section title

Copy link
Contributor Author

@abheesht17 abheesht17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @fchollet! Addressed your comments :)

Copy link
Contributor

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the updates!

1. The authors claim that FNet is 80% faster than BERT on GPUs and 70% faster
on TPUs.
The reason for this speed-up is two-fold:
a. The Fourier Transform layer is unparametrized, it does not have any parameters!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get rendered as a nested list when you generate the website?

Copy link
Contributor

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great contribution! 👍

@abheesht17
Copy link
Contributor Author

abheesht17 commented Jun 29, 2022

Thank you, @mattdangerw, @fchollet, @chenmoneygithub for the review comments and the approval! :) Have to make a teensy change, don't merge it just yet.

@abheesht17
Copy link
Contributor Author

abheesht17 commented Jun 29, 2022

Thank you, @mattdangerw, @fchollet, @chenmoneygithub for the review comments and the approval! :) Have to make a teensy change, don't merge it just yet.

Done making the final changes!

@fchollet
Copy link
Contributor

Continuous integration / black (pull_request) Failing after 23s — black

Please fix the code formatting.

@abheesht17
Copy link
Contributor Author

Continuous integration / black (pull_request) Failing after 23s — black

Please fix the code formatting.

(keras_io) abheesht@LAPTOP-M2NKFTLU:~/repos/keras-io$ black examples/nlp/fnet_classification_with_keras_nlp.py
All done! ✨ 🍰 ✨
1 file left unchanged.
(keras_io) abheesht@LAPTOP-M2NKFTLU:~/repos/keras-io$ python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import black
>>> black.__version__
'19.10b0'

On running pip install -r requirements.txt, black 19.10b0 get installed. But looking here, seems like we use black 22.1.0. Anyway, formatted the file with the latest version of black!

@fchollet
Copy link
Contributor

The CI error is unrelated. I'll merge. Thank you!

@fchollet fchollet merged commit 0d933fc into keras-team:master Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a "Text Classification with FNet" Example on keras.io
4 participants