Skip to content

Commit

Permalink
Switches example similarity plots to Broken from Seaborn for better s…
Browse files Browse the repository at this point in the history
…upport of multilingual

plotting and adds a new large multilingual example aligning text from many different languages to English.

New small example plots are added for English-Arabic, English-Russian, English-Chinese, English-Korean, and Chinese-Korean.

PiperOrigin-RevId: 259373475
  • Loading branch information
TensorFlow Hub Authors authored and vbardiovskyg committed Jul 23, 2019
1 parent 413f103 commit 5e4f840
Showing 1 changed file with 228 additions and 20 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,22 @@
"* In the second section, we show how to build a semantic search engine from a sample of a Wikipedia corpus in multiple languages."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "UvNRbHGarYeR"
},
"source": [
"## Citation\n",
"\n",
"*Research papers that make use of the models explored in this colab should cite:*\n",
"\n",
"### [Multilingual universal sentence encoder for semantic retrieval](https://arxiv.org/abs/1907.04307)\n",
"Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019.\n",
" arXiv preprint arXiv:1907.04307"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand All @@ -100,7 +116,7 @@
"cell_type": "code",
"execution_count": 0,
"metadata": {
"cellView": "form",
"cellView": "both",
"colab": {},
"colab_type": "code",
"id": "lVjNK8shFKOC"
Expand All @@ -112,7 +128,7 @@
"!pip uninstall --quiet --yes tensorflow\n",
"!pip install --quiet tensorflow-gpu==1.13.1\n",
"!pip install --quiet tensorflow-hub\n",
"!pip install --quiet seaborn\n",
"!pip install --quiet bokeh\n",
"!pip install --quiet tf-sentencepiece\n",
"!pip install --quiet simpleneighbors\n",
"!pip install --quiet tqdm"
Expand All @@ -122,36 +138,76 @@
"cell_type": "code",
"execution_count": 0,
"metadata": {
"cellView": "form",
"cellView": "both",
"colab": {},
"colab_type": "code",
"id": "MSeY-MUQo2Ha"
},
"outputs": [],
"source": [
"#@title Setup common imports and functions\n",
"import bokeh\n",
"import bokeh.models\n",
"import bokeh.plotting\n",
"import numpy as np\n",
"import os\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import tensorflow as tf\n",
"import tensorflow_hub as hub\n",
"import tf_sentencepiece # Not used directly but needed to import TF ops.\n",
"import sklearn.metrics.pairwise\n",
"\n",
"from simpleneighbors import SimpleNeighbors\n",
"from tqdm import tqdm\n",
"from tqdm import trange\n",
"\n",
"def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2, plot_title):\n",
" corr = np.inner(embeddings_1, embeddings_2)\n",
" chart = sns.heatmap(corr,\n",
" xticklabels=labels_1,\n",
" yticklabels=labels_2,\n",
" vmin=0,\n",
" vmax=1,\n",
" cmap='YlOrRd')\n",
" chart.set_yticklabels(chart.get_yticklabels(), rotation=0)\n",
" chart.set_title(plot_title)"
"def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2,\n",
" plot_title,\n",
" plot_width=1200, plot_height=600,\n",
" xaxis_font_size='11pt', yaxis_font_size='11pt'):\n",
"\n",
" assert len(embeddings_1) == len(labels_1)\n",
" assert len(embeddings_2) == len(labels_2)\n",
"\n",
" # arccos based text similarity (Yang et al. 2019; Cer et al. 2019)\n",
" sim = 1 - np.arccos(\n",
" sklearn.metrics.pairwise.cosine_similarity(embeddings_1,\n",
" embeddings_2))/np.pi\n",
"\n",
" embeddings_1_col, embeddings_2_col, sim_col = [], [], []\n",
" for i in range(len(embeddings_1)):\n",
" for j in range(len(embeddings_2)):\n",
" embeddings_1_col.append(labels_1[i])\n",
" embeddings_2_col.append(labels_2[j])\n",
" sim_col.append(sim[i][j])\n",
" df = pd.DataFrame(zip(embeddings_1_col, embeddings_2_col, sim_col),\n",
" columns=['embeddings_1', 'embeddings_2', 'sim'])\n",
"\n",
" mapper = bokeh.models.LinearColorMapper(\n",
" palette=[*reversed(bokeh.palettes.YlOrRd[9])], low=df.sim.min(),\n",
" high=df.sim.max())\n",
"\n",
" p = bokeh.plotting.figure(title=plot_title, x_range=labels_1,\n",
" x_axis_location=\"above\",\n",
" y_range=[*reversed(labels_2)],\n",
" plot_width=plot_width, plot_height=plot_height,\n",
" tools=\"save\",toolbar_location='below', tooltips=[\n",
" ('pair', '@embeddings_1 ||| @embeddings_2'),\n",
" ('sim', '@sim')])\n",
" p.rect(x=\"embeddings_1\", y=\"embeddings_2\", width=1, height=1, source=df,\n",
" fill_color={'field': 'sim', 'transform': mapper}, line_color=None)\n",
"\n",
" p.title.text_font_size = '12pt'\n",
" p.axis.axis_line_color = None\n",
" p.axis.major_tick_line_color = None\n",
" p.axis.major_label_standoff = 16\n",
" p.xaxis.major_label_text_font_size = xaxis_font_size\n",
" p.xaxis.major_label_orientation = 0.25 * np.pi\n",
" p.yaxis.major_label_text_font_size = yaxis_font_size\n",
" p.min_border_right = 300\n",
"\n",
" bokeh.io.output_notebook()\n",
" bokeh.io.show(p)\n"
]
},
{
Expand Down Expand Up @@ -226,14 +282,20 @@
"outputs": [],
"source": [
"# Some texts of different lengths in different languages.\n",
"arabic_sentences = ['كلب', 'الجراء لطيفة.', 'أستمتع بالمشي لمسافات طويلة على طول الشاطئ مع كلبي.']\n",
"chinese_sentences = ['狗', '小狗很好。', '我喜欢和我的狗一起沿着海滩散步。']\n",
"english_sentences = ['dog', 'Puppies are nice.', 'I enjoy taking long walks along the beach with my dog.']\n",
"spanish_sentences = ['perro', 'Los cachorros son agradables.', 'Disfruto de dar largos paseos por la playa con mi perro.']\n",
"german_sentences = ['Hund', 'Welpen sind nett.', 'Ich genieße lange Spaziergänge am Strand entlang mit meinem Hund.']\n",
"french_sentences = ['chien', 'Les chiots sont gentils.', 'J\\'aime faire de longues promenades sur la plage avec mon chien.']\n",
"german_sentences = ['Hund', 'Welpen sind nett.', 'Ich genieße lange Spaziergänge am Strand entlang mit meinem Hund.']\n",
"italian_sentences = ['cane', 'I cuccioli sono carini.', 'Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.']\n",
"chinese_sentences = ['', '小狗很好。', '我喜欢和我的狗一起沿着海滩散步。']\n",
"japanese_sentences = ['', '子犬はいいです', '私は犬と一緒にビーチを散歩するのが好きです']\n",
"korean_sentences = ['개', '강아지가 좋다.', '나는 나의 산책을 해변을 따라 길게 산책하는 것을 즐긴다.']\n",
"japanese_sentences = ['犬', '子犬はいいです', '私は犬と一緒にビーチを散歩するのが好きです']"
"russian_sentences = ['собака', 'Милые щенки.', 'Мне нравится подолгу гулять по пляжу со своей собакой.']\n",
"spanish_sentences = ['perro', 'Los cachorros son agradables.', 'Disfruto de dar largos paseos por la playa con mi perro.']\n",
"\n",
"# Multilingual example\n",
"multilingual_example = [\"Willkommen zu einfachen, aber\", \"verrassend krachtige\", \"multilingüe\", \"compréhension du langage naturel\", \"модели.\", \"人们的意思是什么\" , \"보다 중요한\", \".اللغة التي يتحدثونها\"]\n",
"multilingual_example_in_en = [\"Welcome to simple yet\", \"surprisingly powerful\", \"multilingual\", \"natural language understanding\", \"models.\", \"What people mean\", \"matters more than\", \"the language they speak.\"]\n"
]
},
{
Expand All @@ -247,14 +309,19 @@
"outputs": [],
"source": [
"# Compute embeddings.\n",
"ar_result = session.run(embedded_text, feed_dict={text_input: arabic_sentences})\n",
"en_result = session.run(embedded_text, feed_dict={text_input: english_sentences})\n",
"es_result = session.run(embedded_text, feed_dict={text_input: spanish_sentences})\n",
"de_result = session.run(embedded_text, feed_dict={text_input: german_sentences})\n",
"fr_result = session.run(embedded_text, feed_dict={text_input: french_sentences})\n",
"it_result = session.run(embedded_text, feed_dict={text_input: italian_sentences})\n",
"zh_result = session.run(embedded_text, feed_dict={text_input: chinese_sentences})\n",
"ja_result = session.run(embedded_text, feed_dict={text_input: japanese_sentences})\n",
"ko_result = session.run(embedded_text, feed_dict={text_input: korean_sentences})\n",
"ja_result = session.run(embedded_text, feed_dict={text_input: japanese_sentences})"
"ru_result = session.run(embedded_text, feed_dict={text_input: russian_sentences})\n",
"zh_result = session.run(embedded_text, feed_dict={text_input: chinese_sentences})\n",
"\n",
"multilingual_result = session.run(embedded_text, feed_dict={text_input: multilingual_example})\n",
"multilingual_in_en_result = session.run(embedded_text, feed_dict={text_input: multilingual_example_in_en})"
]
},
{
Expand All @@ -269,6 +336,77 @@
"With text embeddings in hand, we can take their dot-product to visualize how similar sentences are between languages. A darker color indicates the embeddings are semantically similar."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "WOEIJA0mh70g"
},
"source": [
"### Multilingual Similarity"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "R2hbCMhmiDWR"
},
"outputs": [],
"source": [
"visualize_similarity(multilingual_in_en_result, multilingual_result,\n",
" multilingual_example_in_en, multilingual_example, \"Multilingual Universal Sentence Encoder for Semantic Retrieval (Yang et al., 2019)\",\n",
" plot_width=1800, plot_height=800, xaxis_font_size=\"17pt\", yaxis_font_size=\"24pt\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "h3TEhllsq3ax"
},
"source": [
"### English-Arabic Similarity"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "Q9UDpStmq7Ii"
},
"outputs": [],
"source": [
"visualize_similarity(en_result, ar_result, english_sentences, arabic_sentences, 'English-Arabic Similarity')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "QF9z48HMp4WL"
},
"source": [
"### Engish-Russian Similarity"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "QE68UejYp86z"
},
"outputs": [],
"source": [
"visualize_similarity(en_result, ru_result, english_sentences, russian_sentences, 'English-Russian Similarity')"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down Expand Up @@ -338,6 +476,75 @@
"visualize_similarity(it_result, es_result, italian_sentences, spanish_sentences, 'Italian-Spanish Similarity')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ueoRO8balwwr"
},
"source": [
"### English-Chinese Similarity"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "xA7anofVlxL7"
},
"outputs": [],
"source": [
"visualize_similarity(en_result, zh_result, english_sentences, chinese_sentences, 'English-Chinese Similarity')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "8zV1BJc3mL3W"
},
"source": [
"### English-Korean Similarity"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "iqWy1e1UmQeX"
},
"outputs": [],
"source": [
"visualize_similarity(en_result, ko_result, english_sentences, korean_sentences, 'English-Korean Similarity')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "dfTj-JaunFTv"
},
"source": [
"### Chinese-Korean Similarity"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "MndSgKGPnJuF"
},
"outputs": [],
"source": [
"visualize_similarity(zh_result, ko_result, chinese_sentences, korean_sentences, 'Chinese-Korean Similarity')"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down Expand Up @@ -673,6 +880,7 @@
"accelerator": "GPU",
"colab": {
"collapsed_sections": [],
"machine_shape": "hm",
"name": "Cross-Lingual Similarity and Semantic Search Engine with TF-Hub Multilingual Universal Encoder",
"provenance": [],
"version": "0.3.2"
Expand Down

0 comments on commit 5e4f840

Please sign in to comment.