Example on multimodal entailment #581

sayakpaul · 2021-08-09T02:34:32Z

Textual entailment is a well-studied problem and is also a part of the GLUE benchmark. It's also highly useful for curating and moderating content on social media platforms. As we know social media content contains different data modalities -- texts, images, videos, audios, etc. Because of the nature of interaction that takes place on these platforms, it may be difficult to just use a single modality to learn the entailment task. This is beautifully presented in this ACL tutorial: https://multimodal-entailment.github.io/.

The tutorial mentioned above introduces a multimodal entailment dataset consisting of tweets and corresponding images labeled as entailment, contradictory, and no_entailment.

This example presents a baseline implementation of a multimodal model on the said dataset for encouraging further research in the field. Here is a Colab Notebook version.

sayakpaul · 2021-08-09T02:38:22Z

examples/nlp/multimodal_entailment.py

+    # Fetch the embedding projections.
+    vision_projections = vision_encoder([image_1, image_2])
+    text_projections = text_encoder(text_inputs)


@fchollet I want to incorporate a modality dropout trick here to account for the following situation. What if one of the inputs or a pair of the input is not present during inference? How do we make our model perform well under those situations?

Apparently, modality dropout can be a useful recipe that can help. I am thinking of doing something like the following:

# Fetch the embedding projections. vision_projections = vision_encoder([image_1, image_2]) vision_projections = keras.layers.Dropout(0.2)(vision_projections) text_projections = text_encoder(text_inputs) text_projections = keras.layers.Dropout(0.2)(text_projections)

Does this make sense? Or should we introduce Dropout at the inputs themselves? I had tried that but it's failing at the text inputs.

I had another idea on incorporating cross-attention here:

vision_projections = vision_encoder([image_1, image_2]) text_projections = text_encoder(text_inputs) # Cross-attention (Luong-style). query_value_attention_seq = keras.layers.Attention(use_scale=True, dropout=0.2)( [vision_projections, text_projections] ) # Concatenate the projections and pass through the classification layer. concatenated = keras.layers.Concatenate()([vision_projections, text_projections]) contextual = keras.layers.Concatenate()([concatenated, query_value_attention_seq]) outputs = keras.layers.Dense(3, activation="softmax")(contextual)

Would love to get your thoughts on both of the above ideas.

Both of these sound fine to me as a way to add regularization. Any particular concern you had?

Nothing particular. I just wanted to verify their implementation correctness w.r.t the ideas.

sayakpaul · 2021-08-13T23:14:48Z

@rchao @fchollet just wanted to gently ping about this PR.

rchao · 2021-08-13T23:28:44Z

@sayakpaul sorry for the delay. It's been quite busy and we're a bit short-staffed so thanks for your patience. We're try to get to this asap.

fchollet

Thanks for the PR! Super cool example as usual :)

fchollet · 2021-08-14T13:39:02Z

examples/nlp/multimodal_entailment.py

+
+The original dataset is available
+[here](https://github.com/google-research-datasets/recognizing-multimodal-entailment).
+However, we will be using a better prepared version of the dataset. Thanks to


"we will be using a better prepared version" -> please detail how it was modified from the original and why.

examples/nlp/multimodal_entailment.py

fchollet · 2021-08-14T13:40:25Z

examples/nlp/multimodal_entailment.py

+"""
+
+# 10% for test
+train_df, test_df = train_test_split(


Don't use sklearn for this, just do it by hand, it will be more explicit

I wanted to but since we are doing a stratified split I went with sklearn.

examples/nlp/multimodal_entailment.py

fchollet · 2021-08-14T13:45:15Z

examples/nlp/multimodal_entailment.py

+    text_1 = tf.convert_to_tensor([text_1])
+    text_2 = tf.convert_to_tensor([text_2])
+    output = bert_preprocess_model([text_1, text_2])
+    output = {feature: tf.squeeze(output[feature]) for feature in bert_input_features}


Why the squeeze?

It is because the preprocessor adds a batch dimension which wouldn't have been a problem if we had applied the preprocessor on batches. For a single sample, the expected dimension is (128, ) with the squeeze included.

Since we are processing the images and the texts simultaneously I tried to keep their preprocessing steps as coherent as possible. To this end, I got confused to apply the following logic to batches:

def read_resize(image_path): extension = tf.strings.split(image_path)[-1] image = tf.io.read_file(image_path) if extension == b"jpg": image = tf.image.decode_jpeg(image, 3) else: image = tf.image.decode_png(image, 3) image = tf.image.resize(image, resize) return image

Do you still get this with TF nightly?

examples/nlp/multimodal_entailment.py

sayakpaul · 2021-08-14T15:48:05Z

@fchollet addressed most of your comments and also added detailed notes about modality dropout and cross-attention.

fchollet

Awesome! I applied copyedit changes. Please pull them and add the generated files.

sayakpaul · 2021-08-15T02:33:19Z

@fchollet done. Thank you.

fchollet · 2021-08-15T06:13:44Z

Thank you for the great contribution 👍

fchollet · 2021-08-15T06:30:00Z

Also, fun fact: this is the 100th code example on keras.io!

sayakpaul · 2021-08-15T06:46:02Z

Glad to have hit that century. Here's to 500 more.

Immensely thankful for all that you do to uplift the community.

example on multimodal entailment

e27c3ae

google-cla bot added the cla: yes label Aug 9, 2021

sayakpaul commented Aug 9, 2021

View reviewed changes

sayakpaul added 2 commits August 9, 2021 10:58

major edits

023d6eb

minor note

de61e90

fchollet requested a review from rchao August 10, 2021 19:57

fchollet assigned rchao Aug 10, 2021

fchollet reviewed Aug 14, 2021

View reviewed changes

fchollet removed the request for review from rchao August 14, 2021 13:49

fchollet unassigned rchao Aug 14, 2021

feedback I

cd37955

Copyedit

8721897

fchollet approved these changes Aug 14, 2021

View reviewed changes

adding generated files

175c926

fchollet merged commit e37d0eb into keras-team:master Aug 15, 2021

Example on multimodal entailment #581

Example on multimodal entailment #581

Uh oh!

Conversation

sayakpaul commented Aug 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Aug 13, 2021

Uh oh!

rchao commented Aug 13, 2021

Uh oh!

fchollet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayakpaul commented Aug 14, 2021

Uh oh!

fchollet left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Aug 15, 2021

Uh oh!

fchollet commented Aug 15, 2021

Uh oh!

fchollet commented Aug 15, 2021

Uh oh!

sayakpaul commented Aug 15, 2021

Uh oh!

Uh oh!

sayakpaul commented Aug 9, 2021 •

edited

Loading