-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example on multimodal entailment #581
Conversation
# Fetch the embedding projections. | ||
vision_projections = vision_encoder([image_1, image_2]) | ||
text_projections = text_encoder(text_inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fchollet I want to incorporate a modality dropout trick here to account for the following situation. What if one of the inputs or a pair of the input is not present during inference? How do we make our model perform well under those situations?
Apparently, modality dropout can be a useful recipe that can help. I am thinking of doing something like the following:
# Fetch the embedding projections.
vision_projections = vision_encoder([image_1, image_2])
vision_projections = keras.layers.Dropout(0.2)(vision_projections)
text_projections = text_encoder(text_inputs)
text_projections = keras.layers.Dropout(0.2)(text_projections)
Does this make sense? Or should we introduce Dropout
at the inputs themselves? I had tried that but it's failing at the text inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another idea on incorporating cross-attention here:
vision_projections = vision_encoder([image_1, image_2])
text_projections = text_encoder(text_inputs)
# Cross-attention (Luong-style).
query_value_attention_seq = keras.layers.Attention(use_scale=True, dropout=0.2)(
[vision_projections, text_projections]
)
# Concatenate the projections and pass through the classification layer.
concatenated = keras.layers.Concatenate()([vision_projections, text_projections])
contextual = keras.layers.Concatenate()([concatenated, query_value_attention_seq])
outputs = keras.layers.Dense(3, activation="softmax")(contextual)
Would love to get your thoughts on both of the above ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these sound fine to me as a way to add regularization. Any particular concern you had?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing particular. I just wanted to verify their implementation correctness w.r.t the ideas.
@sayakpaul sorry for the delay. It's been quite busy and we're a bit short-staffed so thanks for your patience. We're try to get to this asap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Super cool example as usual :)
|
||
The original dataset is available | ||
[here](https://github.com/google-research-datasets/recognizing-multimodal-entailment). | ||
However, we will be using a better prepared version of the dataset. Thanks to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"we will be using a better prepared version" -> please detail how it was modified from the original and why.
""" | ||
|
||
# 10% for test | ||
train_df, test_df = train_test_split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use sklearn for this, just do it by hand, it will be more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to but since we are doing a stratified split I went with sklearn
.
text_1 = tf.convert_to_tensor([text_1]) | ||
text_2 = tf.convert_to_tensor([text_2]) | ||
output = bert_preprocess_model([text_1, text_2]) | ||
output = {feature: tf.squeeze(output[feature]) for feature in bert_input_features} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the squeeze?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is because the preprocessor adds a batch dimension which wouldn't have been a problem if we had applied the preprocessor on batches. For a single sample, the expected dimension is (128, )
with the squeeze included.
Since we are processing the images and the texts simultaneously I tried to keep their preprocessing steps as coherent as possible. To this end, I got confused to apply the following logic to batches:
def read_resize(image_path):
extension = tf.strings.split(image_path)[-1]
image = tf.io.read_file(image_path)
if extension == b"jpg":
image = tf.image.decode_jpeg(image, 3)
else:
image = tf.image.decode_png(image, 3)
image = tf.image.resize(image, resize)
return image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still get this with TF nightly?
@fchollet addressed most of your comments and also added detailed notes about modality dropout and cross-attention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I applied copyedit changes. Please pull them and add the generated files.
@fchollet done. Thank you. |
Thank you for the great contribution 👍 |
Also, fun fact: this is the 100th code example on keras.io! |
Glad to have hit that century. Here's to 500 more. Immensely thankful for all that you do to uplift the community. |
Textual entailment is a well-studied problem and is also a part of the GLUE benchmark. It's also highly useful for curating and moderating content on social media platforms. As we know social media content contains different data modalities -- texts, images, videos, audios, etc. Because of the nature of interaction that takes place on these platforms, it may be difficult to just use a single modality to learn the entailment task. This is beautifully presented in this ACL tutorial: https://multimodal-entailment.github.io/.
The tutorial mentioned above introduces a multimodal entailment dataset consisting of tweets and corresponding images labeled as entailment, contradictory, and no_entailment.
This example presents a baseline implementation of a multimodal model on the said dataset for encouraging further research in the field. Here is a Colab Notebook version.