forked from tensorflow/hub
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Colab demonstrating basic usage of multilingual universal encoder (un…
…iversal-encoder-xling-many) module. PiperOrigin-RevId: 233990556
- Loading branch information
1 parent
d9e6efb
commit 61113ef
Showing
1 changed file
with
335 additions
and
0 deletions.
There are no files selected for viewing
335 changes: 335 additions & 0 deletions
335
examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,335 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "RUymE2l9GZfO" | ||
}, | ||
"source": [ | ||
"##### Copyright 2019 The TensorFlow Hub Authors.\n", | ||
"\n", | ||
"Licensed under the Apache License, Version 2.0 (the \"License\");" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 0, | ||
"metadata": { | ||
"cellView": "code", | ||
"colab": {}, | ||
"colab_type": "code", | ||
"id": "JMyTNwSJGGWg" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Copyright 2019 The TensorFlow Hub Authors. All Rights Reserved.\n", | ||
"#\n", | ||
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n", | ||
"# you may not use this file except in compliance with the License.\n", | ||
"# You may obtain a copy of the License at\n", | ||
"#\n", | ||
"# http://www.apache.org/licenses/LICENSE-2.0\n", | ||
"#\n", | ||
"# Unless required by applicable law or agreed to in writing, software\n", | ||
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n", | ||
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", | ||
"# See the License for the specific language governing permissions and\n", | ||
"# limitations under the License.\n", | ||
"# ==============================================================================" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "co7MV6sX7Xto" | ||
}, | ||
"source": [ | ||
"# Multilingual Universal Sentence Encoder\n", | ||
"\n", | ||
"\n", | ||
"\u003ctable align=\"left\"\u003e\u003ctd\u003e\n", | ||
" \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb\"\u003e\n", | ||
" \u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\n", | ||
" \u003c/a\u003e\n", | ||
"\u003c/td\u003e\u003ctd\u003e\n", | ||
" \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb\"\u003e\n", | ||
" \u003cimg width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n", | ||
"\u003c/td\u003e\u003c/table\u003e\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "eAVQGidpL8v5" | ||
}, | ||
"source": [ | ||
"This notebook illustrates how to access the Multilingual Universal Sentence Encoder module and use it for sentence similarity across multiple languages. This module is an extension of the [original Universal Encoder module](https://tfhub.dev/google/universal-sentence-encoder/2)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "pOTzp8O36CyQ" | ||
}, | ||
"source": [ | ||
"# Getting Started\n", | ||
"\n", | ||
"This section sets up the environment for access to the Multilingual Universal Sentence Encoder Module and also prepares a set of English sentences and their translations. In the following sections, the multilingual module will be used to compute similarity *across languages*." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 0, | ||
"metadata": { | ||
"colab": {}, | ||
"colab_type": "code", | ||
"id": "lVjNK8shFKOC" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Install the latest TensorFlow version compatible with tf-sentencepiece.\n", | ||
"!pip3 install --quiet tensorflow==1.12.0\n", | ||
"# Install TF-Hub.\n", | ||
"!pip3 install --quiet tensorflow-hub\n", | ||
"!pip3 install --quiet seaborn\n", | ||
"# Install Sentencepiece.\n", | ||
"!pip3 install --quiet tf-sentencepiece" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "63Pd3nJnTl-i" | ||
}, | ||
"source": [ | ||
"More detailed information about installing Tensorflow can be found at [https://www.tensorflow.org/install/](https://www.tensorflow.org/install/)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 0, | ||
"metadata": { | ||
"colab": {}, | ||
"colab_type": "code", | ||
"id": "MSeY-MUQo2Ha" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import tensorflow as tf\n", | ||
"import tensorflow_hub as hub\n", | ||
"import numpy as np\n", | ||
"import seaborn as sns\n", | ||
"import tf_sentencepiece" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 0, | ||
"metadata": { | ||
"colab": {}, | ||
"colab_type": "code", | ||
"id": "Q8F4LNGFqOiq" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Some texts of different lengths in different languages.\n", | ||
"english_sentences = [\"dog\", \"Puppies are nice.\", \"I enjoy taking long walks along the beach with my dog.\"]\n", | ||
"spanish_sentences = [\"perro\", \"Los cachorros son agradables.\", \"Disfruto de dar largos paseos por la playa con mi perro.\"]\n", | ||
"german_sentences = [\"Hund\", \"Welpen sind nett.\", \"Ich genieße lange Spaziergänge am Strand entlang mit meinem Hund.\"]\n", | ||
"french_sentences = [\"chien\", \"Les chiots sont gentils.\", \"J'aime faire de longues promenades sur la plage avec mon chien.\"]\n", | ||
"italian_sentences = [\"cane\", \"I cuccioli sono carini.\", \"Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.\"]\n", | ||
"chinese_sentences = [\"狗\", \"小狗很好。\", \"我喜欢和我的狗一起沿着海滩散步。\"]\n", | ||
"korean_sentences = [\"개\", \"강아지가 좋다.\", \"나는 나의 산책을 해변을 따라 길게 산책하는 것을 즐긴다.\"]\n", | ||
"japanese_sentences = [\"犬\", \"子犬はいいです\", \"私は犬と一緒にビーチを散歩するのが好きです\"]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "8xdAogbxJDTD" | ||
}, | ||
"source": [ | ||
"## Computing Text Embeddings\n", | ||
"We first precompute the embeddings for all of our sentences." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 0, | ||
"metadata": { | ||
"colab": {}, | ||
"colab_type": "code", | ||
"id": "weXZqLtTJY9b" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# The 8-language multilingual module. There are also en-es, en-de, and en-fr bilingual modules.\n", | ||
"module_url = \"https://tfhub.dev/google/universal-sentence-encoder-xling-many/1\"\n", | ||
"\n", | ||
"# Set up graph.\n", | ||
"g = tf.Graph()\n", | ||
"with g.as_default():\n", | ||
" text_input = tf.placeholder(dtype=tf.string, shape=[None])\n", | ||
" xling_8_embed = hub.Module(module_url)\n", | ||
" embedded_text = xling_8_embed(text_input)\n", | ||
" init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])\n", | ||
"g.finalize()\n", | ||
"\n", | ||
"# Initialize session.\n", | ||
"session = tf.Session(graph=g)\n", | ||
"session.run(init_op)\n", | ||
"\n", | ||
"# Compute embeddings.\n", | ||
"en_result = session.run(embedded_text, feed_dict={text_input: english_sentences})\n", | ||
"es_result = session.run(embedded_text, feed_dict={text_input: spanish_sentences})\n", | ||
"de_result = session.run(embedded_text, feed_dict={text_input: german_sentences})\n", | ||
"fr_result = session.run(embedded_text, feed_dict={text_input: french_sentences})\n", | ||
"it_result = session.run(embedded_text, feed_dict={text_input: italian_sentences})\n", | ||
"zh_result = session.run(embedded_text, feed_dict={text_input: chinese_sentences})\n", | ||
"ko_result = session.run(embedded_text, feed_dict={text_input: korean_sentences})\n", | ||
"ja_result = session.run(embedded_text, feed_dict={text_input: japanese_sentences})" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "jhLPq6AROyFk" | ||
}, | ||
"source": [ | ||
"## Visualize Embedding Similarity\n", | ||
"With the sentence embeddings now in hand, we can visualize semantic similarity across different languages." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 0, | ||
"metadata": { | ||
"colab": {}, | ||
"colab_type": "code", | ||
"id": "3Peiq_iSNJNX" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2, plot_title):\n", | ||
" corr = np.inner(embeddings_1, embeddings_2)\n", | ||
" g = sns.heatmap(corr,\n", | ||
" xticklabels=labels_1,\n", | ||
" yticklabels=labels_2,\n", | ||
" vmin=0,\n", | ||
" vmax=1,\n", | ||
" cmap=\"YlOrRd\")\n", | ||
" g.set_yticklabels(g.get_yticklabels(), rotation=0)\n", | ||
" g.set_title(plot_title)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "imn28LCiQO7d" | ||
}, | ||
"source": [ | ||
"### English-Italian Similarity" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 0, | ||
"metadata": { | ||
"colab": {}, | ||
"colab_type": "code", | ||
"id": "X9uD3DirPIGd" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"visualize_similarity(en_result, it_result, english_sentences, italian_sentences, \"English-Italian Similarity\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "BJkL6Az0QXNN" | ||
}, | ||
"source": [ | ||
"### English-Japanese Similarity" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 0, | ||
"metadata": { | ||
"colab": {}, | ||
"colab_type": "code", | ||
"id": "CH_BXVGhQ0GL" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"visualize_similarity(en_result, ja_result, english_sentences, japanese_sentences, \"English-Japanese Similarity\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "m6ySvEGbQaTM" | ||
}, | ||
"source": [ | ||
"### Italian-Japanese Similarity" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 0, | ||
"metadata": { | ||
"colab": {}, | ||
"colab_type": "code", | ||
"id": "irfwIeitQ7V6" | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"visualize_similarity(it_result, ja_result, italian_sentences, japanese_sentences, \"Italian-Japanese Similarity\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"colab_type": "text", | ||
"id": "rRabHHQYQfLr" | ||
}, | ||
"source": [ | ||
"### And more...\n", | ||
"\n", | ||
"The above examples can be extended to any language pair from **English, Spanish, German, French, Italian, Chinese, Korean, and Japanese**. Happy coding!" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"colab": { | ||
"collapsed_sections": [ | ||
"RUymE2l9GZfO" | ||
], | ||
"last_runtime": { | ||
"build_target": "", | ||
"kind": "local" | ||
}, | ||
"name": "Cross-Lingual Similarity with TF-Hub Multilingual Universal Encoder", | ||
"provenance": [], | ||
"version": "0.3.2" | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |