Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example on multimodal entailment #581

Merged
merged 6 commits into from
Aug 15, 2021
Merged

Example on multimodal entailment #581

merged 6 commits into from
Aug 15, 2021

Conversation

sayakpaul
Copy link
Contributor

@sayakpaul sayakpaul commented Aug 9, 2021

Textual entailment is a well-studied problem and is also a part of the GLUE benchmark. It's also highly useful for curating and moderating content on social media platforms. As we know social media content contains different data modalities -- texts, images, videos, audios, etc. Because of the nature of interaction that takes place on these platforms, it may be difficult to just use a single modality to learn the entailment task. This is beautifully presented in this ACL tutorial: https://multimodal-entailment.github.io/.

The tutorial mentioned above introduces a multimodal entailment dataset consisting of tweets and corresponding images labeled as entailment, contradictory, and no_entailment.

This example presents a baseline implementation of a multimodal model on the said dataset for encouraging further research in the field. Here is a Colab Notebook version.

@google-cla google-cla bot added the cla: yes label Aug 9, 2021
Comment on lines +505 to +507
# Fetch the embedding projections.
vision_projections = vision_encoder([image_1, image_2])
text_projections = text_encoder(text_inputs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fchollet I want to incorporate a modality dropout trick here to account for the following situation. What if one of the inputs or a pair of the input is not present during inference? How do we make our model perform well under those situations?

Apparently, modality dropout can be a useful recipe that can help. I am thinking of doing something like the following:

# Fetch the embedding projections.
vision_projections = vision_encoder([image_1, image_2])
vision_projections = keras.layers.Dropout(0.2)(vision_projections)
text_projections = text_encoder(text_inputs)
text_projections = keras.layers.Dropout(0.2)(text_projections)

Does this make sense? Or should we introduce Dropout at the inputs themselves? I had tried that but it's failing at the text inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another idea on incorporating cross-attention here:

vision_projections = vision_encoder([image_1, image_2])
text_projections = text_encoder(text_inputs)

# Cross-attention (Luong-style).
query_value_attention_seq = keras.layers.Attention(use_scale=True, dropout=0.2)(
    [vision_projections, text_projections]
)

# Concatenate the projections and pass through the classification layer.
concatenated = keras.layers.Concatenate()([vision_projections, text_projections])
contextual = keras.layers.Concatenate()([concatenated, query_value_attention_seq])
outputs = keras.layers.Dense(3, activation="softmax")(contextual)

Would love to get your thoughts on both of the above ideas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these sound fine to me as a way to add regularization. Any particular concern you had?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing particular. I just wanted to verify their implementation correctness w.r.t the ideas.

@fchollet fchollet requested a review from rchao August 10, 2021 19:57
@sayakpaul
Copy link
Contributor Author

@rchao @fchollet just wanted to gently ping about this PR.

@rchao
Copy link
Contributor

rchao commented Aug 13, 2021

@sayakpaul sorry for the delay. It's been quite busy and we're a bit short-staffed so thanks for your patience. We're try to get to this asap.

Copy link
Contributor

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Super cool example as usual :)


The original dataset is available
[here](https://github.com/google-research-datasets/recognizing-multimodal-entailment).
However, we will be using a better prepared version of the dataset. Thanks to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"we will be using a better prepared version" -> please detail how it was modified from the original and why.

examples/nlp/multimodal_entailment.py Outdated Show resolved Hide resolved
"""

# 10% for test
train_df, test_df = train_test_split(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use sklearn for this, just do it by hand, it will be more explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to but since we are doing a stratified split I went with sklearn.

examples/nlp/multimodal_entailment.py Outdated Show resolved Hide resolved
text_1 = tf.convert_to_tensor([text_1])
text_2 = tf.convert_to_tensor([text_2])
output = bert_preprocess_model([text_1, text_2])
output = {feature: tf.squeeze(output[feature]) for feature in bert_input_features}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the squeeze?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because the preprocessor adds a batch dimension which wouldn't have been a problem if we had applied the preprocessor on batches. For a single sample, the expected dimension is (128, ) with the squeeze included.

Since we are processing the images and the texts simultaneously I tried to keep their preprocessing steps as coherent as possible. To this end, I got confused to apply the following logic to batches:

def read_resize(image_path):
    extension = tf.strings.split(image_path)[-1]

    image = tf.io.read_file(image_path)
    if extension == b"jpg":
        image = tf.image.decode_jpeg(image, 3)
    else:
        image = tf.image.decode_png(image, 3)
    image = tf.image.resize(image, resize)
    return image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still get this with TF nightly?

examples/nlp/multimodal_entailment.py Outdated Show resolved Hide resolved
examples/nlp/multimodal_entailment.py Outdated Show resolved Hide resolved
examples/nlp/multimodal_entailment.py Outdated Show resolved Hide resolved
@fchollet fchollet removed the request for review from rchao August 14, 2021 13:49
@sayakpaul
Copy link
Contributor Author

@fchollet addressed most of your comments and also added detailed notes about modality dropout and cross-attention.

Copy link
Contributor

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I applied copyedit changes. Please pull them and add the generated files.

@sayakpaul
Copy link
Contributor Author

@fchollet done. Thank you.

@fchollet
Copy link
Contributor

Thank you for the great contribution 👍

@fchollet fchollet merged commit e37d0eb into keras-team:master Aug 15, 2021
@fchollet
Copy link
Contributor

Also, fun fact: this is the 100th code example on keras.io!

@sayakpaul
Copy link
Contributor Author

Glad to have hit that century. Here's to 500 more.

Immensely thankful for all that you do to uplift the community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants