diff --git a/README.md b/README.md
index 9fd208e..1272667 100644
--- a/README.md
+++ b/README.md
@@ -49,6 +49,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Text classification](text-classification)
+Trained on [English sentiment dataset](https://github.com/huseinzol05/NLP-Models-Tensorflow/tree/master/text-classification/data)
+
1. Basic cell RNN
2. Bidirectional RNN
3. LSTM cell RNN
@@ -141,6 +143,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Chatbot](chatbot)
+Trained on [Cornell Movie Dialog corpus](https://github.com/huseinzol05/NLP-Models-Tensorflow/blob/master/chatbot/dataset.tar.gz)
+
1. Seq2Seq-manual
2. Seq2Seq-API Greedy
3. Bidirectional Seq2Seq-manual
@@ -214,7 +218,9 @@ I will attached github repositories for models that I not implemented from scrat
-### [Neural Machine Translation (English to Vietnam)](neural-machine-translation)
+### [Neural Machine Translation](neural-machine-translation)
+
+Trained on [500 English-Vietnam](https://github.com/huseinzol05/NLP-Models-Tensorflow/blob/master/neural-machine-translation/vietnam-train)
1. Seq2Seq-manual
2. Seq2Seq-API Greedy
@@ -287,6 +293,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Embedded](embedded)
+Trained on [English sentiment dataset](https://github.com/huseinzol05/NLP-Models-Tensorflow/tree/master/text-classification/data)
+
1. Word Vector using CBOW sample softmax
2. Word Vector using CBOW noise contrastive estimation
3. Word Vector using skipgram sample softmax
@@ -301,6 +309,8 @@ I will attached github repositories for models that I not implemented from scrat
### [POS-Tagging](pos-tagging)
+Trained on [CONLL POS](https://cogcomp.org/page/resource_view/81)
+
1. Bidirectional RNN + CRF, test accuracy 92%
2. Bidirectional RNN + Luong Attention + CRF, test accuracy 91%
3. Bidirectional RNN + Bahdanau Attention + CRF, test accuracy 91%
@@ -312,6 +322,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Entity-Tagging](entity-tagging)
+Trained on [CONLL NER](https://cogcomp.org/page/resource_view/81)
+
1. Bidirectional RNN + CRF, test accuracy 96%
2. Bidirectional RNN + Luong Attention + CRF, test accuracy 93%
3. Bidirectional RNN + Bahdanau Attention + CRF, test accuracy 95%
@@ -323,6 +335,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Dependency-Parser](dependency-parser)
+Trained on [CONLL English Dependency](https://github.com/huseinzol05/NLP-Models-Tensorflow/blob/master/dependency-parser/dev.conll.txt)
+
1. Bidirectional RNN + Bahdanau Attention + CRF
2. Bidirectional RNN + Luong Attention + CRF
3. Residual Network + Bahdanau Attention + CRF
@@ -331,6 +345,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Question-Answers](question-answer)
+Trained on [bAbI Dataset](https://research.fb.com/downloads/babi/)
+
1. End-to-End Memory Network + Basic cell
2. End-to-End Memory Network + GRU cell
3. End-to-End Memory Network + LSTM cell
@@ -338,6 +354,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Stemming](stemming)
+Trained on [English Lemmatization](https://github.com/huseinzol05/NLP-Models-Tensorflow/blob/master/stemming/lemmatization-en.txt)
+
1. LSTM + Seq2Seq + Beam
2. GRU + Seq2Seq + Beam
3. LSTM + BiRNN + Seq2Seq + Beam
@@ -347,6 +365,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Abstractive Summarization](abstractive-summarization)
+Trained on [India news](https://github.com/huseinzol05/NLP-Models-Tensorflow/tree/master/abstractive-summarization/dataset)
+
1. LSTM Seq2Seq using topic modelling
2. LSTM Seq2Seq + Luong Attention using topic modelling
3. LSTM Seq2Seq + Beam Decoder using topic modelling
@@ -361,6 +381,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Extractive Summarization](extractive-summarization)
+Trained on [random books](https://github.com/huseinzol05/NLP-Models-Tensorflow/tree/master/extractive-summarization/books)
+
1. Skip-thought Vector
2. Residual Network using Atrous CNN
3. Residual Network using Atrous CNN + Bahdanau Attention
@@ -371,6 +393,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Speech to Text](speech-to-text)
+Trained on [Toronto speech dataset](https://tspace.library.utoronto.ca/handle/1807/24487)
+
1. Tacotron, https://github.com/Kyubyong/tacotron_asr
2. Bidirectional RNN + Greedy CTC
3. Bidirectional RNN + Beam CTC
@@ -386,6 +410,8 @@ I will attached github repositories for models that I not implemented from scrat
### [Text to Speech](text-to-speech)
+Trained on [Toronto speech dataset](https://tspace.library.utoronto.ca/handle/1807/24487)
+
1. Tacotron, https://github.com/Kyubyong/tacotron
2. Fairseq + Dilated CNN vocoder
3. Seq2Seq + Bahdanau Attention
@@ -397,10 +423,14 @@ I will attached github repositories for models that I not implemented from scrat
### [Old-to-Young Vocoder](vocoder)
+Trained on [Toronto speech dataset](https://tspace.library.utoronto.ca/handle/1807/24487)
+
1. Dilated CNN
### [Generator](generator)
+Trained on [Shakespeare dataset](https://github.com/huseinzol05/NLP-Models-Tensorflow/blob/master/generator/shakespeare.txt)
+
1. Character-wise RNN + LSTM
2. Character-wise RNN + Beam search
3. Character-wise RNN + LSTM + Embedding
@@ -419,20 +449,28 @@ I will attached github repositories for models that I not implemented from scrat
### [Topic Generator](topic-generator)
+Trained on [Malaysia news](https://github.com/huseinzol05/Malaya-Dataset/raw/master/news/news.zip)
+
1. TAT-LSTM
2. TAV-LSTM
3. MTA-LSTM
### [Language-detection](language-detection)
+Trained on [Tatoeba dataset](http://downloads.tatoeba.org/exports/sentences.tar.bz2)
+
1. Fast-text Char N-Grams
### [Text Similarity](text-similarity)
-1. Character wise similarity + LSTM + Bidirectional
-2. Word wise similarity + LSTM + Bidirectional
-3. Character wise similarity Triplet loss + LSTM
-4. Word wise similarity Triplet loss + LSTM
+Trained on [First Quora Dataset Release: Question Pairs](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs)
+
+1. BiRNN + Contrastive loss, test accuracy 76.50%
+2. Dilated CNN + Contrastive loss, test accuracy 72.98%
+3. Transformer + Contrastive loss, test accuracy 73.48%
+4. Dilated CNN + Cross entropy, test accuracy 72.27%
+5. Transformer + Cross entropy, test accuracy 71.1%
+6. Transfer learning BERT base + Cross entropy, test accuracy 90%
### [Text Augmentation](text-augmentation)
diff --git a/text-similarity/1.birnn-contrastive.ipynb b/text-similarity/1.birnn-contrastive.ipynb
new file mode 100644
index 0000000..588cbac
--- /dev/null
+++ b/text-similarity/1.birnn-contrastive.ipynb
@@ -0,0 +1,763 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# !wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/home/jupyter/.local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
+ " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
+ ]
+ }
+ ],
+ "source": [
+ "import tensorflow as tf\n",
+ "import re\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from tqdm import tqdm\n",
+ "import collections\n",
+ "from unidecode import unidecode\n",
+ "from sklearn.cross_validation import train_test_split"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def build_dataset(words, n_words):\n",
+ " count = [['PAD', 0], ['GO', 1], ['EOS', 2], ['UNK', 3]]\n",
+ " count.extend(collections.Counter(words).most_common(n_words - 1))\n",
+ " dictionary = dict()\n",
+ " for word, _ in count:\n",
+ " dictionary[word] = len(dictionary)\n",
+ " data = list()\n",
+ " unk_count = 0\n",
+ " for word in words:\n",
+ " index = dictionary.get(word, 0)\n",
+ " if index == 0:\n",
+ " unk_count += 1\n",
+ " data.append(index)\n",
+ " count[0][1] = unk_count\n",
+ " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
+ " return data, count, dictionary, reversed_dictionary\n",
+ "\n",
+ "def str_idx(corpus, dic, maxlen, UNK=3):\n",
+ " X = np.zeros((len(corpus),maxlen))\n",
+ " for i in range(len(corpus)):\n",
+ " for no, k in enumerate(corpus[i][:maxlen][::-1]):\n",
+ " val = dic[k] if k in dic else UNK\n",
+ " X[i,-1 - no]= val\n",
+ " return X\n",
+ "\n",
+ "def cleaning(string):\n",
+ " string = unidecode(string).replace('.', ' . ').replace(',', ' , ')\n",
+ " string = re.sub('[^A-Za-z\\- ]+', ' ', string)\n",
+ " string = re.sub(r'[ ]+', ' ', string).strip()\n",
+ " return string.lower()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " qid1 | \n",
+ " qid2 | \n",
+ " question1 | \n",
+ " question2 | \n",
+ " is_duplicate | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 3 | \n",
+ " 4 | \n",
+ " What is the story of Kohinoor (Koh-i-Noor) Dia... | \n",
+ " What would happen if the Indian government sto... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 5 | \n",
+ " 6 | \n",
+ " How can I increase the speed of my internet co... | \n",
+ " How can Internet speed be increased by hacking... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 3 | \n",
+ " 7 | \n",
+ " 8 | \n",
+ " Why am I mentally very lonely? How can I solve... | \n",
+ " Find the remainder when [math]23^{24}[/math] i... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 4 | \n",
+ " 9 | \n",
+ " 10 | \n",
+ " Which one dissolve in water quikly sugar, salt... | \n",
+ " Which fish would survive in salt water? | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id qid1 qid2 question1 \\\n",
+ "0 0 1 2 What is the step by step guide to invest in sh... \n",
+ "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
+ "2 2 5 6 How can I increase the speed of my internet co... \n",
+ "3 3 7 8 Why am I mentally very lonely? How can I solve... \n",
+ "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n",
+ "\n",
+ " question2 is_duplicate \n",
+ "0 What is the step by step guide to invest in sh... 0 \n",
+ "1 What would happen if the Indian government sto... 0 \n",
+ "2 How can Internet speed be increased by hacking... 0 \n",
+ "3 Find the remainder when [math]23^{24}[/math] i... 0 \n",
+ "4 Which fish would survive in salt water? 0 "
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv('quora_duplicate_questions.tsv', delimiter='\\t').dropna()\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "left, right, label = df['question1'].tolist(), df['question2'].tolist(), df['is_duplicate'].tolist()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(array([0, 1]), array([255024, 149263]))"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.unique(label, return_counts = True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 404287/404287 [00:07<00:00, 54874.65it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "for i in tqdm(range(len(left))):\n",
+ " left[i] = cleaning(left[i])\n",
+ " right[i] = cleaning(right[i])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "vocab from size: 87661\n",
+ "Most common words [('the', 377593), ('what', 324635), ('is', 269934), ('i', 223893), ('how', 220876), ('a', 212757)]\n",
+ "Sample data [5, 6, 4, 1285, 62, 1285, 2501, 10, 564, 11] ['what', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in']\n"
+ ]
+ }
+ ],
+ "source": [
+ "concat = ' '.join(left + right).split()\n",
+ "vocabulary_size = len(list(set(concat)))\n",
+ "data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)\n",
+ "print('vocab from size: %d'%(vocabulary_size))\n",
+ "print('Most common words', count[4:10])\n",
+ "print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class Model:\n",
+ " def __init__(self, size_layer, num_layers, embedded_size,\n",
+ " dict_size, learning_rate, dropout):\n",
+ " \n",
+ " def cells(size, reuse=False):\n",
+ " cell = tf.nn.rnn_cell.LSTMCell(size,initializer=tf.orthogonal_initializer(),reuse=reuse)\n",
+ " return tf.contrib.rnn.DropoutWrapper(cell,output_keep_prob=dropout)\n",
+ " \n",
+ " def birnn(inputs, scope):\n",
+ " with tf.variable_scope(scope, reuse = tf.AUTO_REUSE):\n",
+ " for n in range(num_layers):\n",
+ " (out_fw, out_bw), (state_fw, state_bw) = tf.nn.bidirectional_dynamic_rnn(\n",
+ " cell_fw = cells(size_layer // 2),\n",
+ " cell_bw = cells(size_layer // 2),\n",
+ " inputs = inputs,\n",
+ " dtype = tf.float32,\n",
+ " scope = 'bidirectional_rnn_%d'%(n))\n",
+ " inputs = tf.concat((out_fw, out_bw), 2)\n",
+ " return inputs[:,-1]\n",
+ " \n",
+ " self.X_left = tf.placeholder(tf.int32, [None, None])\n",
+ " self.X_right = tf.placeholder(tf.int32, [None, None])\n",
+ " self.Y = tf.placeholder(tf.float32, [None])\n",
+ " self.batch_size = tf.shape(self.X_left)[0]\n",
+ " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n",
+ " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X_left)\n",
+ " embedded_right = tf.nn.embedding_lookup(encoder_embeddings, self.X_right)\n",
+ " \n",
+ " def contrastive_loss(y,d):\n",
+ " tmp= y * tf.square(d)\n",
+ " tmp2 = (1-y) * tf.square(tf.maximum((1 - d),0))\n",
+ " return tf.reduce_sum(tmp +tmp2)/tf.cast(self.batch_size,tf.float32)/2\n",
+ " \n",
+ " self.output_left = birnn(embedded_left, 'left')\n",
+ " self.output_right = birnn(embedded_right, 'right')\n",
+ " self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.output_left,self.output_right)),\n",
+ " 1,keep_dims=True))\n",
+ " self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.output_left),\n",
+ " 1,keep_dims=True)),\n",
+ " tf.sqrt(tf.reduce_sum(tf.square(self.output_right),\n",
+ " 1,keep_dims=True))))\n",
+ " self.distance = tf.reshape(self.distance, [-1])\n",
+ " self.cost = contrastive_loss(self.Y,self.distance)\n",
+ " \n",
+ " self.temp_sim = tf.subtract(tf.ones_like(self.distance),\n",
+ " tf.rint(self.distance))\n",
+ " correct_predictions = tf.equal(self.temp_sim, self.Y)\n",
+ " self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, \"float\"))\n",
+ " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "size_layer = 256\n",
+ "num_layers = 2\n",
+ "embedded_size = 128\n",
+ "learning_rate = 1e-3\n",
+ "maxlen = 50\n",
+ "batch_size = 128\n",
+ "dropout = 0.8"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.cross_validation import train_test_split\n",
+ "\n",
+ "vectors_left = str_idx(left, dictionary, maxlen)\n",
+ "vectors_right = str_idx(right, dictionary, maxlen)\n",
+ "train_X_left, test_X_left, train_X_right, test_X_right, train_Y, test_Y = train_test_split(vectors_left,\n",
+ " vectors_right,\n",
+ " label,\n",
+ " test_size = 0.2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Colocations handled automatically by placer.\n",
+ "WARNING:tensorflow:From :6: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.\n",
+ "\n",
+ "WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.\n",
+ "For more information, please see:\n",
+ " * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md\n",
+ " * https://github.com/tensorflow/addons\n",
+ "If you depend on functionality not listed there, please file an issue.\n",
+ "\n",
+ "WARNING:tensorflow:From :17: bidirectional_dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py:443: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Please use `keras.layers.RNN(cell)`, which is equivalent to this API\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn_cell_impl.py:1259: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n",
+ "WARNING:tensorflow:From :37: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "keep_dims is deprecated, use keepdims instead\n",
+ "WARNING:tensorflow:From :41: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Deprecated in favor of operator or tf.math.divide.\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.cast instead.\n"
+ ]
+ }
+ ],
+ "source": [
+ "tf.reset_default_graph()\n",
+ "sess = tf.InteractiveSession()\n",
+ "model = Model(size_layer,num_layers,embedded_size,len(dictionary),learning_rate,dropout)\n",
+ "sess.run(tf.global_variables_initializer())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:54<00:00, 3.32it/s, accuracy=0.762, cost=0.0892]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:30<00:00, 7.02it/s, accuracy=0.611, cost=0.114] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.000000, current acc: 0.721205\n",
+ "time taken: 865.0403523445129\n",
+ "epoch: 0, training loss: 0.102127, training acc: 0.692444, valid loss: 0.095351, valid acc: 0.721205\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:50<00:00, 3.28it/s, accuracy=0.733, cost=0.0808]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:31<00:00, 6.93it/s, accuracy=0.644, cost=0.106] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.721205, current acc: 0.743804\n",
+ "time taken: 861.1270344257355\n",
+ "epoch: 0, training loss: 0.092396, training acc: 0.733870, valid loss: 0.089960, valid acc: 0.743804\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:51<00:00, 3.32it/s, accuracy=0.802, cost=0.0735]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:30<00:00, 6.93it/s, accuracy=0.644, cost=0.105] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.743804, current acc: 0.754440\n",
+ "time taken: 861.8881492614746\n",
+ "epoch: 0, training loss: 0.088065, training acc: 0.751199, valid loss: 0.087837, valid acc: 0.754440\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:50<00:00, 3.27it/s, accuracy=0.842, cost=0.0697]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:30<00:00, 7.03it/s, accuracy=0.667, cost=0.104] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.754440, current acc: 0.757023\n",
+ "time taken: 861.7144618034363\n",
+ "epoch: 0, training loss: 0.085004, training acc: 0.764099, valid loss: 0.086727, valid acc: 0.757023\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:51<00:00, 3.22it/s, accuracy=0.812, cost=0.0724]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:31<00:00, 6.93it/s, accuracy=0.633, cost=0.11] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.757023, current acc: 0.760754\n",
+ "time taken: 862.4683222770691\n",
+ "epoch: 0, training loss: 0.082544, training acc: 0.773236, valid loss: 0.085892, valid acc: 0.760754\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:50<00:00, 3.32it/s, accuracy=0.782, cost=0.0759]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:30<00:00, 6.95it/s, accuracy=0.656, cost=0.108] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 861.5845947265625\n",
+ "epoch: 0, training loss: 0.080261, training acc: 0.781377, valid loss: 0.086369, valid acc: 0.757438\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:50<00:00, 3.31it/s, accuracy=0.832, cost=0.0661]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:31<00:00, 6.97it/s, accuracy=0.656, cost=0.102] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.760754, current acc: 0.763077\n",
+ "time taken: 861.8206684589386\n",
+ "epoch: 0, training loss: 0.078398, training acc: 0.788314, valid loss: 0.084990, valid acc: 0.763077\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:50<00:00, 3.29it/s, accuracy=0.842, cost=0.0661]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:30<00:00, 6.96it/s, accuracy=0.656, cost=0.103] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 861.3552474975586\n",
+ "epoch: 0, training loss: 0.076674, training acc: 0.795231, valid loss: 0.085479, valid acc: 0.759256\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:50<00:00, 3.23it/s, accuracy=0.851, cost=0.0647]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:31<00:00, 6.93it/s, accuracy=0.656, cost=0.1] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.763077, current acc: 0.763510\n",
+ "time taken: 861.3164525032043\n",
+ "epoch: 0, training loss: 0.075192, training acc: 0.800024, valid loss: 0.084781, valid acc: 0.763510\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:49<00:00, 3.28it/s, accuracy=0.822, cost=0.0684]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:30<00:00, 6.92it/s, accuracy=0.667, cost=0.107] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.763510, current acc: 0.765012\n",
+ "time taken: 860.8371865749359\n",
+ "epoch: 0, training loss: 0.073777, training acc: 0.805469, valid loss: 0.084846, valid acc: 0.765012\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:51<00:00, 3.27it/s, accuracy=0.842, cost=0.0651]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:31<00:00, 6.90it/s, accuracy=0.644, cost=0.104] \n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 862.1494925022125\n",
+ "epoch: 0, training loss: 0.072904, training acc: 0.808442, valid loss: 0.084983, valid acc: 0.762664\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:50<00:00, 3.29it/s, accuracy=0.802, cost=0.0664]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:31<00:00, 6.96it/s, accuracy=0.678, cost=0.0966]\n",
+ "train minibatch loop: 0%| | 0/2527 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 861.4347906112671\n",
+ "epoch: 0, training loss: 0.072015, training acc: 0.811395, valid loss: 0.084607, valid acc: 0.763842\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [12:48<00:00, 3.33it/s, accuracy=0.851, cost=0.0605]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [01:31<00:00, 6.98it/s, accuracy=0.667, cost=0.0982]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 859.5523777008057\n",
+ "epoch: 0, training loss: 0.070824, training acc: 0.816009, valid loss: 0.085312, valid acc: 0.761277\n",
+ "\n",
+ "break epoch:0\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import time\n",
+ "\n",
+ "EARLY_STOPPING, CURRENT_CHECKPOINT, CURRENT_ACC, EPOCH = 3, 0, 0, 0\n",
+ "\n",
+ "while True:\n",
+ " lasttime = time.time()\n",
+ " if CURRENT_CHECKPOINT == EARLY_STOPPING:\n",
+ " print('break epoch:%d\\n' % (EPOCH))\n",
+ " break\n",
+ "\n",
+ " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n",
+ " pbar = tqdm(range(0, len(train_X_left), batch_size), desc='train minibatch loop')\n",
+ " for i in pbar:\n",
+ " batch_x_left = train_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " batch_x_right = train_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " batch_y = train_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " acc, loss, _ = sess.run([model.accuracy, model.cost, model.optimizer], \n",
+ " feed_dict = {model.X_left : batch_x_left, \n",
+ " model.X_right: batch_x_right,\n",
+ " model.Y : batch_y})\n",
+ " assert not np.isnan(loss)\n",
+ " train_loss += loss\n",
+ " train_acc += acc\n",
+ " pbar.set_postfix(cost=loss, accuracy = acc)\n",
+ " \n",
+ " pbar = tqdm(range(0, len(test_X_left), batch_size), desc='test minibatch loop')\n",
+ " for i in pbar:\n",
+ " batch_x_left = test_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " batch_x_right = test_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " batch_y = test_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " acc, loss = sess.run([model.accuracy, model.cost], \n",
+ " feed_dict = {model.X_left : batch_x_left, \n",
+ " model.X_right: batch_x_right,\n",
+ " model.Y : batch_y})\n",
+ " test_loss += loss\n",
+ " test_acc += acc\n",
+ " pbar.set_postfix(cost=loss, accuracy = acc)\n",
+ " \n",
+ " train_loss /= (len(train_X_left) / batch_size)\n",
+ " train_acc /= (len(train_X_left) / batch_size)\n",
+ " test_loss /= (len(test_X_left) / batch_size)\n",
+ " test_acc /= (len(test_X_left) / batch_size)\n",
+ " \n",
+ " if test_acc > CURRENT_ACC:\n",
+ " print(\n",
+ " 'epoch: %d, pass acc: %f, current acc: %f'\n",
+ " % (EPOCH, CURRENT_ACC, test_acc)\n",
+ " )\n",
+ " CURRENT_ACC = test_acc\n",
+ " CURRENT_CHECKPOINT = 0\n",
+ " else:\n",
+ " CURRENT_CHECKPOINT += 1\n",
+ " \n",
+ " print('time taken:', time.time()-lasttime)\n",
+ " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n",
+ " train_acc,test_loss,\n",
+ " test_acc))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[array([0.], dtype=float32), array([0.13218915], dtype=float32)]"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "left = str_idx(['a person is outdoors, on a horse.'], dictionary, maxlen)\n",
+ "right = str_idx(['a person on a horse jumps over a broken down airplane.'], dictionary, maxlen)\n",
+ "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
+ " model.X_right: right})"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/text-similarity/1.char-similarity-siamese-birnn.ipynb b/text-similarity/1.char-similarity-siamese-birnn.ipynb
deleted file mode 100644
index 9c6db2f..0000000
--- a/text-similarity/1.char-similarity-siamese-birnn.ipynb
+++ /dev/null
@@ -1,443 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import collections\n",
- "import random\n",
- "import tensorflow as tf"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "def build_dataset(words, n_words):\n",
- " count = [['GO', 0], ['PAD', 1], ['EOS', 2], ['UNK', 3]]\n",
- " count.extend(collections.Counter(words).most_common(n_words - 1))\n",
- " dictionary = dict()\n",
- " for word, _ in count:\n",
- " dictionary[word] = len(dictionary)\n",
- " data = list()\n",
- " unk_count = 0\n",
- " for word in words:\n",
- " index = dictionary.get(word, 0)\n",
- " if index == 0:\n",
- " unk_count += 1\n",
- " data.append(index)\n",
- " count[0][1] = unk_count\n",
- " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
- " return data, count, dictionary, reversed_dictionary\n",
- "\n",
- "def str_idx(corpus, dic, maxlen, UNK=3):\n",
- " X = np.zeros((len(corpus),maxlen))\n",
- " for i in range(len(corpus)):\n",
- " for no, k in enumerate(corpus[i][:maxlen][::-1]):\n",
- " val = dic[k] if k in dic else UNK\n",
- " X[i,-1 - no]= val\n",
- " return X\n",
- "\n",
- "def load_data(filepath):\n",
- " x1=[]\n",
- " x2=[]\n",
- " y=[]\n",
- " for line in open(filepath):\n",
- " l=line.strip().split(\"\\t\")\n",
- " if len(l)<2:\n",
- " continue\n",
- " if random.random() > 0.5:\n",
- " x1.append(l[0].lower())\n",
- " x2.append(l[1].lower())\n",
- " else:\n",
- " x1.append(l[1].lower())\n",
- " x2.append(l[0].lower())\n",
- " y.append(1)\n",
- " combined = np.asarray(x1+x2)\n",
- " shuffle_indices = np.random.permutation(np.arange(len(combined)))\n",
- " combined_shuff = combined[shuffle_indices]\n",
- " for i in range(len(combined)):\n",
- " x1.append(combined[i])\n",
- " x2.append(combined_shuff[i])\n",
- " y.append(0)\n",
- " return np.array(x1),np.array(x2),np.array(y)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "X1_text, X2_text, Y = load_data('person_match.train')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "vocab from size: 101\n",
- "Most common words [(' ', 2076683), ('a', 1345908), ('e', 1246119), ('r', 1019184), ('n', 940224), ('i', 880143)]\n",
- "Sample data [5, 16, 7, 9, 5, 8, 5, 4, 6, 26] ['a', 'd', 'r', 'i', 'a', 'n', 'a', ' ', 'e', 'v']\n"
- ]
- }
- ],
- "source": [
- "concat = ' '.join(X1_text.tolist() + X2_text.tolist())\n",
- "vocabulary_size = len(list(set(concat)))\n",
- "data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)\n",
- "print('vocab from size: %d'%(vocabulary_size))\n",
- "print('Most common words', count[4:10])\n",
- "print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [],
- "source": [
- "class Model:\n",
- " def __init__(self, size_layer, num_layers, embedded_size,\n",
- " dict_size, learning_rate, dropout):\n",
- " \n",
- " def cells(size, reuse=False):\n",
- " cell = tf.nn.rnn_cell.LSTMCell(size,initializer=tf.orthogonal_initializer(),reuse=reuse)\n",
- " return tf.contrib.rnn.DropoutWrapper(cell,output_keep_prob=dropout)\n",
- " \n",
- " def birnn(inputs, scope):\n",
- " with tf.variable_scope(scope):\n",
- " for n in range(num_layers):\n",
- " (out_fw, out_bw), (state_fw, state_bw) = tf.nn.bidirectional_dynamic_rnn(\n",
- " cell_fw = cells(size_layer // 2),\n",
- " cell_bw = cells(size_layer // 2),\n",
- " inputs = inputs,\n",
- " dtype = tf.float32,\n",
- " scope = 'bidirectional_rnn_%d'%(n))\n",
- " inputs = tf.concat((out_fw, out_bw), 2)\n",
- " return inputs[:,-1]\n",
- " \n",
- " self.X_left = tf.placeholder(tf.int32, [None, None])\n",
- " self.X_right = tf.placeholder(tf.int32, [None, None])\n",
- " self.Y = tf.placeholder(tf.float32, [None])\n",
- " self.batch_size = tf.shape(self.X_left)[0]\n",
- " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n",
- " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X_left)\n",
- " embedded_right = tf.nn.embedding_lookup(encoder_embeddings, self.X_right)\n",
- " \n",
- " def contrastive_loss(y,d):\n",
- " tmp= y * tf.square(d)\n",
- " tmp2 = (1-y) * tf.square(tf.maximum((1 - d),0))\n",
- " return tf.reduce_sum(tmp +tmp2)/tf.cast(self.batch_size,tf.float32)/2\n",
- " \n",
- " self.output_left = birnn(embedded_left, 'left')\n",
- " self.output_right = birnn(embedded_right, 'right')\n",
- " self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.output_left,self.output_right)),1,keep_dims=True))\n",
- " self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.output_left),1,keep_dims=True)),\n",
- " tf.sqrt(tf.reduce_sum(tf.square(self.output_right),1,keep_dims=True))))\n",
- " self.distance = tf.reshape(self.distance, [-1])\n",
- " self.cost = contrastive_loss(self.Y,self.distance)\n",
- " \n",
- " self.temp_sim = tf.subtract(tf.ones_like(self.distance),\n",
- " tf.rint(self.distance))\n",
- " correct_predictions = tf.equal(self.temp_sim, self.Y)\n",
- " self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, \"float\"))\n",
- " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "size_layer = 256\n",
- "num_layers = 2\n",
- "embedded_size = 128\n",
- "learning_rate = 1e-3\n",
- "maxlen = 30\n",
- "batch_size = 128\n",
- "dropout = 0.8"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "WARNING:tensorflow:From :36: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "keep_dims is deprecated, use keepdims instead\n"
- ]
- }
- ],
- "source": [
- "tf.reset_default_graph()\n",
- "sess = tf.InteractiveSession()\n",
- "model = Model(size_layer,num_layers,embedded_size,len(dictionary),learning_rate,dropout)\n",
- "sess.run(tf.global_variables_initializer())"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
- " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
- ]
- }
- ],
- "source": [
- "from sklearn.cross_validation import train_test_split\n",
- "\n",
- "vectors_left = str_idx(X1_text, dictionary, maxlen)\n",
- "vectors_right = str_idx(X2_text, dictionary, maxlen)\n",
- "train_X_left, test_X_left, train_X_right, test_X_right, train_Y, test_Y = train_test_split(vectors_left,\n",
- " vectors_right,\n",
- " Y,\n",
- " test_size = 0.2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 3337/3337 [06:42<00:00, 8.42it/s, accuracy=1, cost=0.0317] \n",
- "test minibatch loop: 100%|██████████| 835/835 [00:43<00:00, 19.32it/s, accuracy=1, cost=0.0234] \n",
- "train minibatch loop: 0%| | 1/3337 [00:00<06:24, 8.67it/s, accuracy=0.945, cost=0.0468]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 446.1067855358124\n",
- "epoch: 0, training loss: 0.047054, training acc: 0.936334, valid loss: 0.044065, valid acc: 0.943738\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 3337/3337 [06:40<00:00, 8.67it/s, accuracy=1, cost=0.0305] \n",
- "test minibatch loop: 100%|██████████| 835/835 [00:42<00:00, 19.71it/s, accuracy=1, cost=0.0329] \n",
- "train minibatch loop: 0%| | 1/3337 [00:00<06:31, 8.51it/s, accuracy=0.961, cost=0.0434]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 442.806120634079\n",
- "epoch: 1, training loss: 0.043424, training acc: 0.943691, valid loss: 0.043407, valid acc: 0.943963\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 3337/3337 [06:41<00:00, 8.32it/s, accuracy=1, cost=0.0299] \n",
- "test minibatch loop: 100%|██████████| 835/835 [00:42<00:00, 19.59it/s, accuracy=1, cost=0.0296] \n",
- "train minibatch loop: 0%| | 1/3337 [00:00<06:22, 8.72it/s, accuracy=0.93, cost=0.0451]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 443.7129006385803\n",
- "epoch: 2, training loss: 0.042537, training acc: 0.945597, valid loss: 0.042411, valid acc: 0.946839\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 3337/3337 [06:36<00:00, 8.84it/s, accuracy=1, cost=0.0261] \n",
- "test minibatch loop: 100%|██████████| 835/835 [00:42<00:00, 19.37it/s, accuracy=1, cost=0.0269] \n",
- "train minibatch loop: 0%| | 1/3337 [00:00<06:26, 8.63it/s, accuracy=0.953, cost=0.0426]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 438.77990889549255\n",
- "epoch: 3, training loss: 0.041973, training acc: 0.946717, valid loss: 0.041931, valid acc: 0.947616\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 3337/3337 [06:36<00:00, 8.33it/s, accuracy=1, cost=0.0281] \n",
- "test minibatch loop: 100%|██████████| 835/835 [00:42<00:00, 19.76it/s, accuracy=1, cost=0.0286] "
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 438.77926087379456\n",
- "epoch: 4, training loss: 0.041583, training acc: 0.947766, valid loss: 0.041881, valid acc: 0.948243\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "\n"
- ]
- }
- ],
- "source": [
- "from tqdm import tqdm\n",
- "import time\n",
- "\n",
- "for EPOCH in range(5):\n",
- " lasttime = time.time()\n",
- " \n",
- " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n",
- " pbar = tqdm(range(0, len(train_X_left), batch_size), desc='train minibatch loop')\n",
- " for i in pbar:\n",
- " batch_x_left = train_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_x_right = train_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_y = train_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " acc, loss, _ = sess.run([model.accuracy, model.cost, model.optimizer], \n",
- " feed_dict = {model.X_left : batch_x_left, \n",
- " model.X_right: batch_x_right,\n",
- " model.Y : batch_y})\n",
- " assert not np.isnan(loss)\n",
- " train_loss += loss\n",
- " train_acc += acc\n",
- " pbar.set_postfix(cost=loss, accuracy = acc)\n",
- " \n",
- " pbar = tqdm(range(0, len(test_X_left), batch_size), desc='test minibatch loop')\n",
- " for i in pbar:\n",
- " batch_x_left = test_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_x_right = test_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_y = test_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " acc, loss = sess.run([model.accuracy, model.cost], \n",
- " feed_dict = {model.X_left : batch_x_left, \n",
- " model.X_right: batch_x_right,\n",
- " model.Y : batch_y})\n",
- " test_loss += loss\n",
- " test_acc += acc\n",
- " pbar.set_postfix(cost=loss, accuracy = acc)\n",
- " \n",
- " train_loss /= (len(train_X_left) / batch_size)\n",
- " train_acc /= (len(train_X_left) / batch_size)\n",
- " test_loss /= (len(test_X_left) / batch_size)\n",
- " test_acc /= (len(test_X_left) / batch_size)\n",
- " \n",
- " print('time taken:', time.time()-lasttime)\n",
- " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n",
- " train_acc,test_loss,\n",
- " test_acc))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[array([0.], dtype=float32), array([0.31725764], dtype=float32)]"
- ]
- },
- "execution_count": 39,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "left = str_idx(['adriana evans'], dictionary, maxlen)\n",
- "right = str_idx(['adriana'], dictionary, maxlen)\n",
- "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
- " model.X_right: right})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 41,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[array([1.], dtype=float32), array([0.631173], dtype=float32)]"
- ]
- },
- "execution_count": 41,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "left = str_idx(['husein zolkepli'], dictionary, maxlen)\n",
- "right = str_idx(['zolkepli'], dictionary, maxlen)\n",
- "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
- " model.X_right: right})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.5.2"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/text-similarity/2.dilated-cnn-contrastive.ipynb b/text-similarity/2.dilated-cnn-contrastive.ipynb
new file mode 100644
index 0000000..d7276c3
--- /dev/null
+++ b/text-similarity/2.dilated-cnn-contrastive.ipynb
@@ -0,0 +1,648 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# !wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/home/jupyter/.local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
+ " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
+ ]
+ }
+ ],
+ "source": [
+ "import tensorflow as tf\n",
+ "import re\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from tqdm import tqdm\n",
+ "import collections\n",
+ "from unidecode import unidecode\n",
+ "from sklearn.cross_validation import train_test_split"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def build_dataset(words, n_words):\n",
+ " count = [['PAD', 0], ['GO', 1], ['EOS', 2], ['UNK', 3]]\n",
+ " count.extend(collections.Counter(words).most_common(n_words - 1))\n",
+ " dictionary = dict()\n",
+ " for word, _ in count:\n",
+ " dictionary[word] = len(dictionary)\n",
+ " data = list()\n",
+ " unk_count = 0\n",
+ " for word in words:\n",
+ " index = dictionary.get(word, 0)\n",
+ " if index == 0:\n",
+ " unk_count += 1\n",
+ " data.append(index)\n",
+ " count[0][1] = unk_count\n",
+ " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
+ " return data, count, dictionary, reversed_dictionary\n",
+ "\n",
+ "def str_idx(corpus, dic, maxlen, UNK=3):\n",
+ " X = np.zeros((len(corpus),maxlen))\n",
+ " for i in range(len(corpus)):\n",
+ " for no, k in enumerate(corpus[i][:maxlen][::-1]):\n",
+ " val = dic[k] if k in dic else UNK\n",
+ " X[i,-1 - no]= val\n",
+ " return X\n",
+ "\n",
+ "def cleaning(string):\n",
+ " string = unidecode(string).replace('.', ' . ').replace(',', ' , ')\n",
+ " string = re.sub('[^A-Za-z\\- ]+', ' ', string)\n",
+ " string = re.sub(r'[ ]+', ' ', string).strip()\n",
+ " return string.lower()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " qid1 | \n",
+ " qid2 | \n",
+ " question1 | \n",
+ " question2 | \n",
+ " is_duplicate | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 3 | \n",
+ " 4 | \n",
+ " What is the story of Kohinoor (Koh-i-Noor) Dia... | \n",
+ " What would happen if the Indian government sto... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 5 | \n",
+ " 6 | \n",
+ " How can I increase the speed of my internet co... | \n",
+ " How can Internet speed be increased by hacking... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 3 | \n",
+ " 7 | \n",
+ " 8 | \n",
+ " Why am I mentally very lonely? How can I solve... | \n",
+ " Find the remainder when [math]23^{24}[/math] i... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 4 | \n",
+ " 9 | \n",
+ " 10 | \n",
+ " Which one dissolve in water quikly sugar, salt... | \n",
+ " Which fish would survive in salt water? | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id qid1 qid2 question1 \\\n",
+ "0 0 1 2 What is the step by step guide to invest in sh... \n",
+ "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
+ "2 2 5 6 How can I increase the speed of my internet co... \n",
+ "3 3 7 8 Why am I mentally very lonely? How can I solve... \n",
+ "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n",
+ "\n",
+ " question2 is_duplicate \n",
+ "0 What is the step by step guide to invest in sh... 0 \n",
+ "1 What would happen if the Indian government sto... 0 \n",
+ "2 How can Internet speed be increased by hacking... 0 \n",
+ "3 Find the remainder when [math]23^{24}[/math] i... 0 \n",
+ "4 Which fish would survive in salt water? 0 "
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv('quora_duplicate_questions.tsv', delimiter='\\t').dropna()\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "left, right, label = df['question1'].tolist(), df['question2'].tolist(), df['is_duplicate'].tolist()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(array([0, 1]), array([255024, 149263]))"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.unique(label, return_counts = True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 404287/404287 [00:07<00:00, 54845.16it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "for i in tqdm(range(len(left))):\n",
+ " left[i] = cleaning(left[i])\n",
+ " right[i] = cleaning(right[i])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "vocab from size: 87661\n",
+ "Most common words [('the', 377593), ('what', 324635), ('is', 269934), ('i', 223893), ('how', 220876), ('a', 212757)]\n",
+ "Sample data [5, 6, 4, 1285, 62, 1285, 2501, 10, 564, 11] ['what', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in']\n"
+ ]
+ }
+ ],
+ "source": [
+ "concat = ' '.join(left + right).split()\n",
+ "vocabulary_size = len(list(set(concat)))\n",
+ "data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)\n",
+ "print('vocab from size: %d'%(vocabulary_size))\n",
+ "print('Most common words', count[4:10])\n",
+ "print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def position_encoding(inputs):\n",
+ " T = tf.shape(inputs)[1]\n",
+ " repr_dim = inputs.get_shape()[-1].value\n",
+ " pos = tf.reshape(tf.range(0.0, tf.to_float(T), dtype=tf.float32), [-1, 1])\n",
+ " i = np.arange(0, repr_dim, 2, np.float32)\n",
+ " denom = np.reshape(np.power(10000.0, i / repr_dim), [1, -1])\n",
+ " enc = tf.expand_dims(tf.concat([tf.sin(pos / denom), tf.cos(pos / denom)], 1), 0)\n",
+ " return tf.tile(enc, [tf.shape(inputs)[0], 1, 1])\n",
+ "\n",
+ "def layer_norm(inputs, epsilon=1e-8):\n",
+ " mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)\n",
+ " normalized = (inputs - mean) / (tf.sqrt(variance + epsilon))\n",
+ " params_shape = inputs.get_shape()[-1:]\n",
+ " gamma = tf.get_variable('gamma', params_shape, tf.float32, tf.ones_initializer())\n",
+ " beta = tf.get_variable('beta', params_shape, tf.float32, tf.zeros_initializer())\n",
+ " return gamma * normalized + beta\n",
+ "\n",
+ "def cnn_block(x, dilation_rate, pad_sz, hidden_dim, kernel_size):\n",
+ " x = layer_norm(x)\n",
+ " pad = tf.zeros([tf.shape(x)[0], pad_sz, hidden_dim])\n",
+ " x = tf.layers.conv1d(inputs = tf.concat([pad, x, pad], 1),\n",
+ " filters = hidden_dim,\n",
+ " kernel_size = kernel_size,\n",
+ " dilation_rate = dilation_rate)\n",
+ " x = x[:, :-pad_sz, :]\n",
+ " x = tf.nn.relu(x)\n",
+ " return x\n",
+ "\n",
+ "class Model:\n",
+ " def __init__(self, size_layer, num_layers, embedded_size,\n",
+ " dict_size, learning_rate, dropout, kernel_size = 5):\n",
+ " \n",
+ " def cnn(x, scope):\n",
+ " x += position_encoding(x)\n",
+ " with tf.variable_scope(scope, reuse = tf.AUTO_REUSE):\n",
+ " for n in range(num_layers):\n",
+ " dilation_rate = 2 ** n\n",
+ " pad_sz = (kernel_size - 1) * dilation_rate \n",
+ " with tf.variable_scope('block_%d'%i,reuse=tf.AUTO_REUSE):\n",
+ " x += cnn_block(x, dilation_rate, pad_sz, size_layer, kernel_size)\n",
+ " \n",
+ " with tf.variable_scope('logits', reuse=tf.AUTO_REUSE):\n",
+ " return tf.layers.dense(x, size_layer)[:, -1]\n",
+ " \n",
+ " self.X_left = tf.placeholder(tf.int32, [None, None])\n",
+ " self.X_right = tf.placeholder(tf.int32, [None, None])\n",
+ " self.Y = tf.placeholder(tf.float32, [None])\n",
+ " self.batch_size = tf.shape(self.X_left)[0]\n",
+ " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n",
+ " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X_left)\n",
+ " embedded_right = tf.nn.embedding_lookup(encoder_embeddings, self.X_right)\n",
+ " \n",
+ " def contrastive_loss(y,d):\n",
+ " tmp= y * tf.square(d)\n",
+ " tmp2 = (1-y) * tf.square(tf.maximum((1 - d),0))\n",
+ " return tf.reduce_sum(tmp +tmp2)/tf.cast(self.batch_size,tf.float32)/2\n",
+ " \n",
+ " self.output_left = cnn(embedded_left, 'left')\n",
+ " self.output_right = cnn(embedded_right, 'right')\n",
+ " print(self.output_left, self.output_right)\n",
+ " self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.output_left,self.output_right)),\n",
+ " 1,keep_dims=True))\n",
+ " self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.output_left),\n",
+ " 1,keep_dims=True)),\n",
+ " tf.sqrt(tf.reduce_sum(tf.square(self.output_right),\n",
+ " 1,keep_dims=True))))\n",
+ " self.distance = tf.reshape(self.distance, [-1])\n",
+ " self.cost = contrastive_loss(self.Y,self.distance)\n",
+ " \n",
+ " self.temp_sim = tf.subtract(tf.ones_like(self.distance),\n",
+ " tf.rint(self.distance))\n",
+ " correct_predictions = tf.equal(self.temp_sim, self.Y)\n",
+ " self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, \"float\"))\n",
+ " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "size_layer = 128\n",
+ "num_layers = 4\n",
+ "embedded_size = 128\n",
+ "learning_rate = 1e-3\n",
+ "maxlen = 50\n",
+ "batch_size = 128\n",
+ "dropout = 0.8"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.cross_validation import train_test_split\n",
+ "\n",
+ "vectors_left = str_idx(left, dictionary, maxlen)\n",
+ "vectors_right = str_idx(right, dictionary, maxlen)\n",
+ "train_X_left, test_X_left, train_X_right, test_X_right, train_Y, test_Y = train_test_split(vectors_left,\n",
+ " vectors_right,\n",
+ " label,\n",
+ " test_size = 0.2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Colocations handled automatically by placer.\n",
+ "WARNING:tensorflow:From :4: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.cast instead.\n",
+ "WARNING:tensorflow:From :24: conv1d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use keras.layers.conv1d instead.\n",
+ "WARNING:tensorflow:From :43: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use keras.layers.dense instead.\n",
+ "Tensor(\"left/logits/strided_slice:0\", shape=(?, 128), dtype=float32) Tensor(\"right/logits/strided_slice:0\", shape=(?, 128), dtype=float32)\n",
+ "WARNING:tensorflow:From :62: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "keep_dims is deprecated, use keepdims instead\n",
+ "WARNING:tensorflow:From :66: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Deprecated in favor of operator or tf.math.divide.\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.cast instead.\n"
+ ]
+ }
+ ],
+ "source": [
+ "tf.reset_default_graph()\n",
+ "sess = tf.InteractiveSession()\n",
+ "model = Model(size_layer,num_layers,embedded_size,len(dictionary),learning_rate,dropout)\n",
+ "sess.run(tf.global_variables_initializer())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:56<00:00, 44.82it/s, accuracy=0.713, cost=0.0944]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:03<00:00, 164.55it/s, accuracy=0.7, cost=0.0912] \n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:54, 46.44it/s, accuracy=0.719, cost=0.0951]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.000000, current acc: 0.718526\n",
+ "time taken: 60.22631072998047\n",
+ "epoch: 0, training loss: 0.102462, training acc: 0.686018, valid loss: 0.094624, valid acc: 0.718526\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.66it/s, accuracy=0.733, cost=0.0908]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:03<00:00, 170.70it/s, accuracy=0.722, cost=0.0877]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:54, 46.23it/s, accuracy=0.75, cost=0.0887] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.718526, current acc: 0.726600\n",
+ "time taken: 57.863216400146484\n",
+ "epoch: 0, training loss: 0.090969, training acc: 0.733650, valid loss: 0.091963, valid acc: 0.726600\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.46it/s, accuracy=0.812, cost=0.0809]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:03<00:00, 170.07it/s, accuracy=0.733, cost=0.087] \n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:54, 46.45it/s, accuracy=0.742, cost=0.0818]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.726600, current acc: 0.729846\n",
+ "time taken: 58.1152925491333\n",
+ "epoch: 0, training loss: 0.084663, training acc: 0.758519, valid loss: 0.090660, valid acc: 0.729846\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.46it/s, accuracy=0.812, cost=0.0746]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:03<00:00, 170.52it/s, accuracy=0.756, cost=0.0854]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:54, 46.34it/s, accuracy=0.773, cost=0.077] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 58.09940052032471\n",
+ "epoch: 0, training loss: 0.079745, training acc: 0.776804, valid loss: 0.091354, valid acc: 0.726319\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.48it/s, accuracy=0.812, cost=0.0669]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:03<00:00, 170.46it/s, accuracy=0.744, cost=0.0865]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 46.80it/s, accuracy=0.812, cost=0.075] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 58.082401752471924\n",
+ "epoch: 0, training loss: 0.075484, training acc: 0.792712, valid loss: 0.092073, valid acc: 0.720588\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.52it/s, accuracy=0.861, cost=0.0598]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:03<00:00, 170.85it/s, accuracy=0.767, cost=0.0831]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 58.026214838027954\n",
+ "epoch: 0, training loss: 0.071810, training acc: 0.806165, valid loss: 0.091487, valid acc: 0.724296\n",
+ "\n",
+ "break epoch:0\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import time\n",
+ "\n",
+ "EARLY_STOPPING, CURRENT_CHECKPOINT, CURRENT_ACC, EPOCH = 3, 0, 0, 0\n",
+ "\n",
+ "while True:\n",
+ " lasttime = time.time()\n",
+ " if CURRENT_CHECKPOINT == EARLY_STOPPING:\n",
+ " print('break epoch:%d\\n' % (EPOCH))\n",
+ " break\n",
+ "\n",
+ " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n",
+ " pbar = tqdm(range(0, len(train_X_left), batch_size), desc='train minibatch loop')\n",
+ " for i in pbar:\n",
+ " batch_x_left = train_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " batch_x_right = train_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " batch_y = train_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " acc, loss, _ = sess.run([model.accuracy, model.cost, model.optimizer], \n",
+ " feed_dict = {model.X_left : batch_x_left, \n",
+ " model.X_right: batch_x_right,\n",
+ " model.Y : batch_y})\n",
+ " assert not np.isnan(loss)\n",
+ " train_loss += loss\n",
+ " train_acc += acc\n",
+ " pbar.set_postfix(cost=loss, accuracy = acc)\n",
+ " \n",
+ " pbar = tqdm(range(0, len(test_X_left), batch_size), desc='test minibatch loop')\n",
+ " for i in pbar:\n",
+ " batch_x_left = test_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " batch_x_right = test_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " batch_y = test_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " acc, loss = sess.run([model.accuracy, model.cost], \n",
+ " feed_dict = {model.X_left : batch_x_left, \n",
+ " model.X_right: batch_x_right,\n",
+ " model.Y : batch_y})\n",
+ " test_loss += loss\n",
+ " test_acc += acc\n",
+ " pbar.set_postfix(cost=loss, accuracy = acc)\n",
+ " \n",
+ " train_loss /= (len(train_X_left) / batch_size)\n",
+ " train_acc /= (len(train_X_left) / batch_size)\n",
+ " test_loss /= (len(test_X_left) / batch_size)\n",
+ " test_acc /= (len(test_X_left) / batch_size)\n",
+ " \n",
+ " if test_acc > CURRENT_ACC:\n",
+ " print(\n",
+ " 'epoch: %d, pass acc: %f, current acc: %f'\n",
+ " % (EPOCH, CURRENT_ACC, test_acc)\n",
+ " )\n",
+ " CURRENT_ACC = test_acc\n",
+ " CURRENT_CHECKPOINT = 0\n",
+ " else:\n",
+ " CURRENT_CHECKPOINT += 1\n",
+ " \n",
+ " print('time taken:', time.time()-lasttime)\n",
+ " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n",
+ " train_acc,test_loss,\n",
+ " test_acc))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[array([0.], dtype=float32), array([0.05150324], dtype=float32)]"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "left = str_idx(['a person is outdoors, on a horse.'], dictionary, maxlen)\n",
+ "right = str_idx(['a person on a horse jumps over a broken down airplane.'], dictionary, maxlen)\n",
+ "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
+ " model.X_right: right})"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/text-similarity/2.sentence-similarity-birnn.ipynb b/text-similarity/2.sentence-similarity-birnn.ipynb
deleted file mode 100644
index f2ff7cf..0000000
--- a/text-similarity/2.sentence-similarity-birnn.ipynb
+++ /dev/null
@@ -1,456 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import collections\n",
- "import random\n",
- "import tensorflow as tf"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "def build_dataset(words, n_words):\n",
- " count = [['GO', 0], ['PAD', 1], ['EOS', 2], ['UNK', 3]]\n",
- " count.extend(collections.Counter(words).most_common(n_words - 1))\n",
- " dictionary = dict()\n",
- " for word, _ in count:\n",
- " dictionary[word] = len(dictionary)\n",
- " data = list()\n",
- " unk_count = 0\n",
- " for word in words:\n",
- " index = dictionary.get(word, 0)\n",
- " if index == 0:\n",
- " unk_count += 1\n",
- " data.append(index)\n",
- " count[0][1] = unk_count\n",
- " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
- " return data, count, dictionary, reversed_dictionary\n",
- "\n",
- "def str_idx(corpus, dic, maxlen, UNK=3):\n",
- " X = np.zeros((len(corpus),maxlen))\n",
- " for i in range(len(corpus)):\n",
- " for no, k in enumerate(corpus[i][:maxlen][::-1]):\n",
- " val = dic[k] if k in dic else UNK\n",
- " X[i,-1 - no]= val\n",
- " return X\n",
- "\n",
- "def load_data(filepath):\n",
- " x1=[]\n",
- " x2=[]\n",
- " y=[]\n",
- " for line in open(filepath):\n",
- " l=line.strip().split(\"\\t\")\n",
- " if len(l)<2:\n",
- " continue\n",
- " if random.random() > 0.5:\n",
- " x1.append(l[0].lower())\n",
- " x2.append(l[1].lower())\n",
- " else:\n",
- " x1.append(l[1].lower())\n",
- " x2.append(l[0].lower())\n",
- " y.append(int(l[2]))\n",
- " return np.array(x1),np.array(x2),np.array(y)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "X1_text, X2_text, Y = load_data('train_snli.txt')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(array([0, 1]), array([183966, 183407]))"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "np.unique(Y,return_counts=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "vocab from size: 47170\n",
- "Most common words [('a', 959179), ('the', 341846), ('in', 273772), ('is', 248868), ('man', 173742), ('on', 154293)]\n",
- "Sample data [4, 38, 7, 17, 4, 16662, 2698, 20, 27512, 4] ['a', 'person', 'is', 'at', 'a', 'diner,', 'ordering', 'an', 'omelette.', 'a']\n"
- ]
- }
- ],
- "source": [
- "concat = (' '.join(X1_text.tolist() + X2_text.tolist())).split()\n",
- "vocabulary_size = len(list(set(concat)))\n",
- "data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)\n",
- "print('vocab from size: %d'%(vocabulary_size))\n",
- "print('Most common words', count[4:10])\n",
- "print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "class Model:\n",
- " def __init__(self, size_layer, num_layers, embedded_size,\n",
- " dict_size, learning_rate, dropout):\n",
- " \n",
- " def cells(size, reuse=False):\n",
- " cell = tf.nn.rnn_cell.LSTMCell(size,initializer=tf.orthogonal_initializer(),reuse=reuse)\n",
- " return tf.contrib.rnn.DropoutWrapper(cell,output_keep_prob=dropout)\n",
- " \n",
- " def birnn(inputs, scope):\n",
- " with tf.variable_scope(scope):\n",
- " for n in range(num_layers):\n",
- " (out_fw, out_bw), (state_fw, state_bw) = tf.nn.bidirectional_dynamic_rnn(\n",
- " cell_fw = cells(size_layer // 2),\n",
- " cell_bw = cells(size_layer // 2),\n",
- " inputs = inputs,\n",
- " dtype = tf.float32,\n",
- " scope = 'bidirectional_rnn_%d'%(n))\n",
- " inputs = tf.concat((out_fw, out_bw), 2)\n",
- " return inputs[:,-1]\n",
- " \n",
- " self.X_left = tf.placeholder(tf.int32, [None, None])\n",
- " self.X_right = tf.placeholder(tf.int32, [None, None])\n",
- " self.Y = tf.placeholder(tf.float32, [None])\n",
- " self.batch_size = tf.shape(self.X_left)[0]\n",
- " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n",
- " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X_left)\n",
- " embedded_right = tf.nn.embedding_lookup(encoder_embeddings, self.X_right)\n",
- " \n",
- " def contrastive_loss(y,d):\n",
- " tmp= y * tf.square(d)\n",
- " tmp2 = (1-y) * tf.square(tf.maximum((1 - d),0))\n",
- " return tf.reduce_sum(tmp +tmp2)/tf.cast(self.batch_size,tf.float32)/2\n",
- " \n",
- " self.output_left = birnn(embedded_left, 'left')\n",
- " self.output_right = birnn(embedded_right, 'right')\n",
- " self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.output_left,self.output_right)),1,keep_dims=True))\n",
- " self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.output_left),1,keep_dims=True)),\n",
- " tf.sqrt(tf.reduce_sum(tf.square(self.output_right),1,keep_dims=True))))\n",
- " self.distance = tf.reshape(self.distance, [-1])\n",
- " self.cost = contrastive_loss(self.Y,self.distance)\n",
- " \n",
- " self.temp_sim = tf.subtract(tf.ones_like(self.distance),\n",
- " tf.rint(self.distance))\n",
- " correct_predictions = tf.equal(self.temp_sim, self.Y)\n",
- " self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, \"float\"))\n",
- " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "size_layer = 256\n",
- "num_layers = 2\n",
- "embedded_size = 128\n",
- "learning_rate = 1e-3\n",
- "maxlen = 50\n",
- "batch_size = 128\n",
- "dropout = 0.8"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "WARNING:tensorflow:From :36: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "keep_dims is deprecated, use keepdims instead\n"
- ]
- }
- ],
- "source": [
- "tf.reset_default_graph()\n",
- "sess = tf.InteractiveSession()\n",
- "model = Model(size_layer,num_layers,embedded_size,len(dictionary),learning_rate,dropout)\n",
- "sess.run(tf.global_variables_initializer())"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
- " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
- ]
- }
- ],
- "source": [
- "from sklearn.cross_validation import train_test_split\n",
- "\n",
- "vectors_left = str_idx(X1_text, dictionary, maxlen)\n",
- "vectors_right = str_idx(X2_text, dictionary, maxlen)\n",
- "train_X_left, test_X_left, train_X_right, test_X_right, train_Y, test_Y = train_test_split(vectors_left,\n",
- " vectors_right,\n",
- " Y,\n",
- " test_size = 0.2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 2297/2297 [07:41<00:00, 5.12it/s, accuracy=0.4, cost=0.121] \n",
- "test minibatch loop: 100%|██████████| 575/575 [00:46<00:00, 12.98it/s, accuracy=0.667, cost=0.11] \n",
- "train minibatch loop: 0%| | 0/2297 [00:00, ?it/s]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 507.81807231903076\n",
- "epoch: 0, training loss: 0.111454, training acc: 0.635320, valid loss: 0.102844, valid acc: 0.684101\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 2297/2297 [07:41<00:00, 5.09it/s, accuracy=0.7, cost=0.0912] \n",
- "test minibatch loop: 100%|██████████| 575/575 [00:46<00:00, 12.28it/s, accuracy=0.667, cost=0.111] \n",
- "train minibatch loop: 0%| | 0/2297 [00:00, ?it/s]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 508.3038227558136\n",
- "epoch: 1, training loss: 0.099547, training acc: 0.699636, valid loss: 0.097585, valid acc: 0.710069\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 2297/2297 [07:41<00:00, 5.19it/s, accuracy=0.7, cost=0.089] \n",
- "test minibatch loop: 100%|██████████| 575/575 [00:46<00:00, 12.31it/s, accuracy=0.333, cost=0.121] \n",
- "train minibatch loop: 0%| | 0/2297 [00:00, ?it/s]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 508.48236751556396\n",
- "epoch: 2, training loss: 0.095110, training acc: 0.722028, valid loss: 0.095254, valid acc: 0.720867\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 2297/2297 [07:41<00:00, 5.22it/s, accuracy=0.9, cost=0.0622] \n",
- "test minibatch loop: 100%|██████████| 575/575 [00:46<00:00, 12.26it/s, accuracy=0.667, cost=0.103] \n",
- "train minibatch loop: 0%| | 0/2297 [00:00, ?it/s, accuracy=0.695, cost=0.0968]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 508.11622738838196\n",
- "epoch: 3, training loss: 0.092058, training acc: 0.735736, valid loss: 0.093679, valid acc: 0.728484\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 2297/2297 [07:41<00:00, 5.07it/s, accuracy=0.9, cost=0.0628] \n",
- "test minibatch loop: 100%|██████████| 575/575 [00:46<00:00, 12.33it/s, accuracy=0.667, cost=0.12] "
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 507.789755821228\n",
- "epoch: 4, training loss: 0.089936, training acc: 0.745521, valid loss: 0.093175, valid acc: 0.730239\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "\n"
- ]
- }
- ],
- "source": [
- "from tqdm import tqdm\n",
- "import time\n",
- "\n",
- "for EPOCH in range(5):\n",
- " lasttime = time.time()\n",
- " \n",
- " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n",
- " pbar = tqdm(range(0, len(train_X_left), batch_size), desc='train minibatch loop')\n",
- " for i in pbar:\n",
- " batch_x_left = train_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_x_right = train_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_y = train_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " acc, loss, _ = sess.run([model.accuracy, model.cost, model.optimizer], \n",
- " feed_dict = {model.X_left : batch_x_left, \n",
- " model.X_right: batch_x_right,\n",
- " model.Y : batch_y})\n",
- " assert not np.isnan(loss)\n",
- " train_loss += loss\n",
- " train_acc += acc\n",
- " pbar.set_postfix(cost=loss, accuracy = acc)\n",
- " \n",
- " pbar = tqdm(range(0, len(test_X_left), batch_size), desc='test minibatch loop')\n",
- " for i in pbar:\n",
- " batch_x_left = test_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_x_right = test_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_y = test_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " acc, loss = sess.run([model.accuracy, model.cost], \n",
- " feed_dict = {model.X_left : batch_x_left, \n",
- " model.X_right: batch_x_right,\n",
- " model.Y : batch_y})\n",
- " test_loss += loss\n",
- " test_acc += acc\n",
- " pbar.set_postfix(cost=loss, accuracy = acc)\n",
- " \n",
- " train_loss /= (len(train_X_left) / batch_size)\n",
- " train_acc /= (len(train_X_left) / batch_size)\n",
- " test_loss /= (len(test_X_left) / batch_size)\n",
- " test_acc /= (len(test_X_left) / batch_size)\n",
- " \n",
- " print('time taken:', time.time()-lasttime)\n",
- " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n",
- " train_acc,test_loss,\n",
- " test_acc))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[array([1.], dtype=float32), array([0.69642884], dtype=float32)]"
- ]
- },
- "execution_count": 15,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "left = str_idx(['a person is outdoors, on a horse.'], dictionary, maxlen)\n",
- "right = str_idx(['a person on a horse jumps over a broken down airplane.'], dictionary, maxlen)\n",
- "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
- " model.X_right: right})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[array([0.], dtype=float32), array([0.37782538], dtype=float32)]"
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "left = str_idx(['i love you'], dictionary, maxlen)\n",
- "right = str_idx(['you love i'], dictionary, maxlen)\n",
- "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
- " model.X_right: right})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.5.2"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/text-similarity/3.char-similarity-batchall-tripletloss.ipynb b/text-similarity/3.char-similarity-batchall-tripletloss.ipynb
deleted file mode 100644
index d699730..0000000
--- a/text-similarity/3.char-similarity-batchall-tripletloss.ipynb
+++ /dev/null
@@ -1,562 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import collections\n",
- "import random\n",
- "import tensorflow as tf"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "def build_dataset(words, n_words):\n",
- " count = [['GO', 0], ['PAD', 1], ['EOS', 2], ['UNK', 3]]\n",
- " count.extend(collections.Counter(words).most_common(n_words - 1))\n",
- " dictionary = dict()\n",
- " for word, _ in count:\n",
- " dictionary[word] = len(dictionary)\n",
- " data = list()\n",
- " unk_count = 0\n",
- " for word in words:\n",
- " index = dictionary.get(word, 0)\n",
- " if index == 0:\n",
- " unk_count += 1\n",
- " data.append(index)\n",
- " count[0][1] = unk_count\n",
- " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
- " return data, count, dictionary, reversed_dictionary\n",
- "\n",
- "def str_idx(corpus, dic, maxlen, UNK=3):\n",
- " X = np.zeros((len(corpus),maxlen))\n",
- " for i in range(len(corpus)):\n",
- " for no, k in enumerate(corpus[i][:maxlen][::-1]):\n",
- " val = dic[k] if k in dic else UNK\n",
- " X[i,-1 - no]= val\n",
- " return X\n",
- "\n",
- "def load_data(filepath):\n",
- " x1=[]\n",
- " x2=[]\n",
- " y=[]\n",
- " for line in open(filepath):\n",
- " l=line.strip().split(\"\\t\")\n",
- " if len(l)<2:\n",
- " continue\n",
- " if random.random() > 0.5:\n",
- " x1.append(l[0].lower())\n",
- " x2.append(l[1].lower())\n",
- " else:\n",
- " x1.append(l[1].lower())\n",
- " x2.append(l[0].lower())\n",
- " y.append(1)\n",
- " combined = np.asarray(x1+x2)\n",
- " shuffle_indices = np.random.permutation(np.arange(len(combined)))\n",
- " combined_shuff = combined[shuffle_indices]\n",
- " for i in range(len(combined)):\n",
- " x1.append(combined[i])\n",
- " x2.append(combined_shuff[i])\n",
- " y.append(0)\n",
- " return np.array(x1),np.array(x2),np.array(y)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "X1_text, X2_text, Y = load_data('person_match.train')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "vocab from size: 101\n",
- "Most common words [(' ', 2076683), ('a', 1345908), ('e', 1246119), ('r', 1019184), ('n', 940224), ('i', 880143)]\n",
- "Sample data [5, 16, 7, 9, 5, 8, 5, 4, 6, 26] ['a', 'd', 'r', 'i', 'a', 'n', 'a', ' ', 'e', 'v']\n"
- ]
- }
- ],
- "source": [
- "concat = ' '.join(X1_text.tolist() + X2_text.tolist())\n",
- "vocabulary_size = len(list(set(concat)))\n",
- "data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)\n",
- "print('vocab from size: %d'%(vocabulary_size))\n",
- "print('Most common words', count[4:10])\n",
- "print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [],
- "source": [
- "def _pairwise_distances(embeddings_left, embeddings_right, squared=False):\n",
- " dot_product = tf.matmul(embeddings_left, \n",
- " tf.transpose(embeddings_right))\n",
- " square_norm = tf.diag_part(dot_product)\n",
- " distances = tf.expand_dims(square_norm, 1) - 2.0 * dot_product + tf.expand_dims(square_norm, 0)\n",
- " distances = tf.maximum(distances, 0.0)\n",
- "\n",
- " if not squared:\n",
- " mask = tf.to_float(tf.equal(distances, 0.0))\n",
- " distances = distances + mask * 1e-16\n",
- " distances = tf.sqrt(distances)\n",
- " distances = distances * (1.0 - mask)\n",
- "\n",
- " return distances\n",
- "\n",
- "\n",
- "def _get_anchor_positive_triplet_mask(labels):\n",
- " indices_equal = tf.cast(tf.eye(tf.shape(labels)[0]), tf.bool)\n",
- " indices_not_equal = tf.logical_not(indices_equal)\n",
- " labels_equal = tf.equal(tf.expand_dims(labels, 0), tf.expand_dims(labels, 1))\n",
- " mask = tf.logical_and(indices_not_equal, labels_equal)\n",
- "\n",
- " return mask\n",
- "\n",
- "\n",
- "def _get_anchor_negative_triplet_mask(labels):\n",
- " labels_equal = tf.equal(tf.expand_dims(labels, 0), tf.expand_dims(labels, 1))\n",
- " mask = tf.logical_not(labels_equal)\n",
- "\n",
- " return mask\n",
- "\n",
- "def _get_triplet_mask(labels):\n",
- " indices_equal = tf.cast(tf.eye(tf.shape(labels)[0]), tf.bool)\n",
- " indices_not_equal = tf.logical_not(indices_equal)\n",
- " i_not_equal_j = tf.expand_dims(indices_not_equal, 2)\n",
- " i_not_equal_k = tf.expand_dims(indices_not_equal, 1)\n",
- " j_not_equal_k = tf.expand_dims(indices_not_equal, 0)\n",
- "\n",
- " distinct_indices = tf.logical_and(tf.logical_and(i_not_equal_j, i_not_equal_k), j_not_equal_k)\n",
- "\n",
- " label_equal = tf.equal(tf.expand_dims(labels, 0), tf.expand_dims(labels, 1))\n",
- " i_equal_j = tf.expand_dims(label_equal, 2)\n",
- " i_equal_k = tf.expand_dims(label_equal, 1)\n",
- "\n",
- " valid_labels = tf.logical_and(i_equal_j, tf.logical_not(i_equal_k))\n",
- " mask = tf.logical_and(distinct_indices, valid_labels)\n",
- "\n",
- " return mask\n",
- "def batch_all_triplet_loss(labels, embeddings_left, embeddings_right, margin, squared=False):\n",
- " pairwise_dist = _pairwise_distances(embeddings_left, embeddings_right, squared=squared)\n",
- "\n",
- " anchor_positive_dist = tf.expand_dims(pairwise_dist, 2)\n",
- " assert anchor_positive_dist.shape[2] == 1, \"{}\".format(anchor_positive_dist.shape)\n",
- " anchor_negative_dist = tf.expand_dims(pairwise_dist, 1)\n",
- " assert anchor_negative_dist.shape[1] == 1, \"{}\".format(anchor_negative_dist.shape)\n",
- "\n",
- " triplet_loss = anchor_positive_dist - anchor_negative_dist + margin\n",
- "\n",
- " mask = _get_triplet_mask(labels)\n",
- " mask = tf.to_float(mask)\n",
- " triplet_loss = tf.multiply(mask, triplet_loss)\n",
- "\n",
- " triplet_loss = tf.maximum(triplet_loss, 0.0)\n",
- "\n",
- " valid_triplets = tf.to_float(tf.greater(triplet_loss, 1e-16))\n",
- " num_positive_triplets = tf.reduce_sum(valid_triplets)\n",
- " num_valid_triplets = tf.reduce_sum(mask)\n",
- " fraction_positive_triplets = num_positive_triplets / (num_valid_triplets + 1e-16)\n",
- "\n",
- " triplet_loss = tf.reduce_sum(triplet_loss) / (num_positive_triplets + 1e-16)\n",
- "\n",
- " return triplet_loss, fraction_positive_triplets"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "class Model:\n",
- " def __init__(self, size_layer, num_layers, embedded_size,\n",
- " dict_size, learning_rate, dimension_output):\n",
- " \n",
- " def cells(reuse=False):\n",
- " return tf.nn.rnn_cell.LSTMCell(size_layer,\n",
- " initializer=tf.orthogonal_initializer(),reuse=reuse)\n",
- " \n",
- " def rnn(inputs, reuse=False):\n",
- " with tf.variable_scope('model', reuse = reuse):\n",
- " rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cells() for _ in range(num_layers)])\n",
- " outputs, _ = tf.nn.dynamic_rnn(rnn_cells, inputs, dtype = tf.float32)\n",
- " return tf.layers.dense(outputs[:,-1], dimension_output)\n",
- " \n",
- " self.X_left = tf.placeholder(tf.int32, [None, None])\n",
- " self.X_right = tf.placeholder(tf.int32, [None, None])\n",
- " self.Y = tf.placeholder(tf.float32, [None])\n",
- " self.batch_size = tf.shape(self.X_left)[0]\n",
- " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n",
- " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X_left)\n",
- " embedded_right = tf.nn.embedding_lookup(encoder_embeddings, self.X_right)\n",
- " \n",
- " self.output_left = rnn(embedded_left, False)\n",
- " self.output_right = rnn(embedded_right, True)\n",
- " \n",
- " self.cost, fraction = batch_all_triplet_loss(self.Y, self.output_left, \n",
- " self.output_right, margin=0.5, squared=False)\n",
- " \n",
- " self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.output_left,self.output_right)),1,keep_dims=True))\n",
- " self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.output_left),1,keep_dims=True)),\n",
- " tf.sqrt(tf.reduce_sum(tf.square(self.output_right),1,keep_dims=True))))\n",
- " self.distance = tf.reshape(self.distance, [-1])\n",
- " \n",
- " self.temp_sim = tf.subtract(tf.ones_like(self.distance),\n",
- " tf.rint(self.distance))\n",
- " correct_predictions = tf.equal(self.temp_sim, self.Y)\n",
- " self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, \"float\"))\n",
- " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "size_layer = 256\n",
- "num_layers = 2\n",
- "embedded_size = 128\n",
- "learning_rate = 1e-3\n",
- "dimension_output = 300\n",
- "maxlen = 30\n",
- "batch_size = 128"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "WARNING:tensorflow:From :34: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "keep_dims is deprecated, use keepdims instead\n"
- ]
- }
- ],
- "source": [
- "tf.reset_default_graph()\n",
- "sess = tf.InteractiveSession()\n",
- "model = Model(size_layer,num_layers,embedded_size,len(dictionary),\n",
- " learning_rate,dimension_output)\n",
- "sess.run(tf.global_variables_initializer())"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
- " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
- ]
- }
- ],
- "source": [
- "from sklearn.cross_validation import train_test_split\n",
- "\n",
- "vectors_left = str_idx(X1_text, dictionary, maxlen)\n",
- "vectors_right = str_idx(X2_text, dictionary, maxlen)\n",
- "train_X_left, test_X_left, train_X_right, test_X_right, train_Y, test_Y = train_test_split(vectors_left,\n",
- " vectors_right,\n",
- " Y,\n",
- " test_size = 0.2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 3337/3337 [04:00<00:00, 14.45it/s, accuracy=0.985, cost=0.51] \n",
- "test minibatch loop: 100%|██████████| 835/835 [00:22<00:00, 36.49it/s, accuracy=1, cost=0.72] \n",
- "train minibatch loop: 0%| | 2/3337 [00:00<03:56, 14.09it/s, accuracy=0.961, cost=0.517]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 263.62842535972595\n",
- "epoch: 0, training loss: 0.506187, training acc: 0.947672, valid loss: 0.499385, valid acc: 0.949227\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 3337/3337 [04:01<00:00, 13.84it/s, accuracy=0.97, cost=0.506] \n",
- "test minibatch loop: 100%|██████████| 835/835 [00:22<00:00, 36.36it/s, accuracy=1, cost=0.694] \n",
- "train minibatch loop: 0%| | 2/3337 [00:00<04:03, 13.71it/s, accuracy=0.953, cost=0.494]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 264.0480725765228\n",
- "epoch: 1, training loss: 0.505925, training acc: 0.948232, valid loss: 0.488787, valid acc: 0.946576\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 3337/3337 [04:01<00:00, 13.84it/s, accuracy=0.97, cost=0.528] \n",
- "test minibatch loop: 100%|██████████| 835/835 [00:23<00:00, 36.30it/s, accuracy=1, cost=0.663] \n",
- "train minibatch loop: 0%| | 2/3337 [00:00<03:54, 14.23it/s, accuracy=0.945, cost=0.471]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 264.1305401325226\n",
- "epoch: 2, training loss: 0.505620, training acc: 0.947038, valid loss: 0.488307, valid acc: 0.945199\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 3337/3337 [04:01<00:00, 13.84it/s, accuracy=0.97, cost=0.535] \n",
- "test minibatch loop: 100%|██████████| 835/835 [00:23<00:00, 36.06it/s, accuracy=1, cost=0.627] \n",
- "train minibatch loop: 0%| | 2/3337 [00:00<03:58, 14.00it/s, accuracy=0.953, cost=0.458]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 264.2930498123169\n",
- "epoch: 3, training loss: 0.505297, training acc: 0.946200, valid loss: 0.475937, valid acc: 0.943635\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 3337/3337 [04:00<00:00, 13.85it/s, accuracy=0.97, cost=0.52] \n",
- "test minibatch loop: 100%|██████████| 835/835 [00:22<00:00, 36.33it/s, accuracy=1, cost=0.611] "
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 263.87016439437866\n",
- "epoch: 4, training loss: 0.505250, training acc: 0.946427, valid loss: 0.468737, valid acc: 0.944113\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "\n"
- ]
- }
- ],
- "source": [
- "from tqdm import tqdm\n",
- "import time\n",
- "\n",
- "for EPOCH in range(5):\n",
- " lasttime = time.time()\n",
- " \n",
- " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n",
- " pbar = tqdm(range(0, len(train_X_left), batch_size), desc='train minibatch loop')\n",
- " for i in pbar:\n",
- " batch_x_left = train_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_x_right = train_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_y = train_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " acc, loss, _ = sess.run([model.accuracy, model.cost, model.optimizer], \n",
- " feed_dict = {model.X_left : batch_x_left, \n",
- " model.X_right: batch_x_right,\n",
- " model.Y : batch_y})\n",
- " assert not np.isnan(loss)\n",
- " train_loss += loss\n",
- " train_acc += acc\n",
- " pbar.set_postfix(cost = loss, accuracy = acc)\n",
- " \n",
- " pbar = tqdm(range(0, len(test_X_left), batch_size), desc='test minibatch loop')\n",
- " for i in pbar:\n",
- " batch_x_left = test_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_x_right = test_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_y = test_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " acc, loss = sess.run([model.accuracy, model.cost], \n",
- " feed_dict = {model.X_left : batch_x_left, \n",
- " model.X_right: batch_x_right,\n",
- " model.Y : batch_y})\n",
- " test_loss += loss\n",
- " test_acc += acc\n",
- " pbar.set_postfix(cost = loss, accuracy = acc)\n",
- " \n",
- " train_loss /= (len(train_X_left) / batch_size)\n",
- " train_acc /= (len(train_X_left) / batch_size)\n",
- " test_loss /= (len(test_X_left) / batch_size)\n",
- " test_acc /= (len(test_X_left) / batch_size)\n",
- " \n",
- " print('time taken:', time.time()-lasttime)\n",
- " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n",
- " train_acc,test_loss,\n",
- " test_acc))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[array([1.], dtype=float32), array([0.5210439], dtype=float32)]"
- ]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "left = str_idx(['adriana evans'], dictionary, maxlen)\n",
- "right = str_idx(['adriana'], dictionary, maxlen)\n",
- "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
- " model.X_right: right})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[array([1.], dtype=float32), array([0.7454066], dtype=float32)]"
- ]
- },
- "execution_count": 12,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "left = str_idx(['husein zolkepli'], dictionary, maxlen)\n",
- "right = str_idx(['zolkepli'], dictionary, maxlen)\n",
- "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
- " model.X_right: right})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[array([0.], dtype=float32), array([0.31712526], dtype=float32)]"
- ]
- },
- "execution_count": 13,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "left = str_idx(['adriana evans'], dictionary, maxlen)\n",
- "right = str_idx(['evans adriana'], dictionary, maxlen)\n",
- "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
- " model.X_right: right})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[array([0.], dtype=float32), array([0.26328784], dtype=float32)]"
- ]
- },
- "execution_count": 15,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "left = str_idx(['synergy telecom'], dictionary, maxlen)\n",
- "right = str_idx(['syntel'], dictionary, maxlen)\n",
- "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
- " model.X_right: right})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.5.2"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/text-similarity/3.transformer-contrastive.ipynb b/text-similarity/3.transformer-contrastive.ipynb
new file mode 100644
index 0000000..ab0c837
--- /dev/null
+++ b/text-similarity/3.transformer-contrastive.ipynb
@@ -0,0 +1,974 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# !wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/home/jupyter/.local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
+ " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
+ ]
+ }
+ ],
+ "source": [
+ "import tensorflow as tf\n",
+ "import re\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from tqdm import tqdm\n",
+ "import collections\n",
+ "from unidecode import unidecode\n",
+ "from sklearn.cross_validation import train_test_split"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def build_dataset(words, n_words):\n",
+ " count = [['PAD', 0], ['GO', 1], ['EOS', 2], ['UNK', 3]]\n",
+ " count.extend(collections.Counter(words).most_common(n_words - 1))\n",
+ " dictionary = dict()\n",
+ " for word, _ in count:\n",
+ " dictionary[word] = len(dictionary)\n",
+ " data = list()\n",
+ " unk_count = 0\n",
+ " for word in words:\n",
+ " index = dictionary.get(word, 0)\n",
+ " if index == 0:\n",
+ " unk_count += 1\n",
+ " data.append(index)\n",
+ " count[0][1] = unk_count\n",
+ " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
+ " return data, count, dictionary, reversed_dictionary\n",
+ "\n",
+ "def str_idx(corpus, dic, maxlen, UNK=3):\n",
+ " X = np.zeros((len(corpus),maxlen))\n",
+ " for i in range(len(corpus)):\n",
+ " for no, k in enumerate(corpus[i][:maxlen][::-1]):\n",
+ " val = dic[k] if k in dic else UNK\n",
+ " X[i,-1 - no]= val\n",
+ " return X\n",
+ "\n",
+ "def cleaning(string):\n",
+ " string = unidecode(string).replace('.', ' . ').replace(',', ' , ')\n",
+ " string = re.sub('[^A-Za-z\\- ]+', ' ', string)\n",
+ " string = re.sub(r'[ ]+', ' ', string).strip()\n",
+ " return string.lower()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " qid1 | \n",
+ " qid2 | \n",
+ " question1 | \n",
+ " question2 | \n",
+ " is_duplicate | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 3 | \n",
+ " 4 | \n",
+ " What is the story of Kohinoor (Koh-i-Noor) Dia... | \n",
+ " What would happen if the Indian government sto... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 5 | \n",
+ " 6 | \n",
+ " How can I increase the speed of my internet co... | \n",
+ " How can Internet speed be increased by hacking... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 3 | \n",
+ " 7 | \n",
+ " 8 | \n",
+ " Why am I mentally very lonely? How can I solve... | \n",
+ " Find the remainder when [math]23^{24}[/math] i... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 4 | \n",
+ " 9 | \n",
+ " 10 | \n",
+ " Which one dissolve in water quikly sugar, salt... | \n",
+ " Which fish would survive in salt water? | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id qid1 qid2 question1 \\\n",
+ "0 0 1 2 What is the step by step guide to invest in sh... \n",
+ "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
+ "2 2 5 6 How can I increase the speed of my internet co... \n",
+ "3 3 7 8 Why am I mentally very lonely? How can I solve... \n",
+ "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n",
+ "\n",
+ " question2 is_duplicate \n",
+ "0 What is the step by step guide to invest in sh... 0 \n",
+ "1 What would happen if the Indian government sto... 0 \n",
+ "2 How can Internet speed be increased by hacking... 0 \n",
+ "3 Find the remainder when [math]23^{24}[/math] i... 0 \n",
+ "4 Which fish would survive in salt water? 0 "
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv('quora_duplicate_questions.tsv', delimiter='\\t').dropna()\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "left, right, label = df['question1'].tolist(), df['question2'].tolist(), df['is_duplicate'].tolist()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(array([0, 1]), array([255024, 149263]))"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.unique(label, return_counts = True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 404287/404287 [00:07<00:00, 53664.30it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "for i in tqdm(range(len(left))):\n",
+ " left[i] = cleaning(left[i])\n",
+ " right[i] = cleaning(right[i])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "vocab from size: 87661\n",
+ "Most common words [('the', 377593), ('what', 324635), ('is', 269934), ('i', 223893), ('how', 220876), ('a', 212757)]\n",
+ "Sample data [5, 6, 4, 1285, 62, 1285, 2501, 10, 564, 11] ['what', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in']\n"
+ ]
+ }
+ ],
+ "source": [
+ "concat = ' '.join(left + right).split()\n",
+ "vocabulary_size = len(list(set(concat)))\n",
+ "data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)\n",
+ "print('vocab from size: %d'%(vocabulary_size))\n",
+ "print('Most common words', count[4:10])\n",
+ "print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def position_encoding(inputs):\n",
+ " T = tf.shape(inputs)[1]\n",
+ " repr_dim = inputs.get_shape()[-1].value\n",
+ " pos = tf.reshape(tf.range(0.0, tf.to_float(T), dtype=tf.float32), [-1, 1])\n",
+ " i = np.arange(0, repr_dim, 2, np.float32)\n",
+ " denom = np.reshape(np.power(10000.0, i / repr_dim), [1, -1])\n",
+ " enc = tf.expand_dims(tf.concat([tf.sin(pos / denom), tf.cos(pos / denom)], 1), 0)\n",
+ " return tf.tile(enc, [tf.shape(inputs)[0], 1, 1])\n",
+ "\n",
+ "def layer_norm(inputs, epsilon=1e-8):\n",
+ " mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)\n",
+ " normalized = (inputs - mean) / (tf.sqrt(variance + epsilon))\n",
+ " params_shape = inputs.get_shape()[-1:]\n",
+ " gamma = tf.get_variable('gamma', params_shape, tf.float32, tf.ones_initializer())\n",
+ " beta = tf.get_variable('beta', params_shape, tf.float32, tf.zeros_initializer())\n",
+ " return gamma * normalized + beta\n",
+ "\n",
+ "def self_attention(inputs, is_training, num_units, num_heads = 8, activation=None):\n",
+ " T_q = T_k = tf.shape(inputs)[1]\n",
+ " Q_K_V = tf.layers.dense(inputs, 3*num_units, activation)\n",
+ " Q, K, V = tf.split(Q_K_V, 3, -1)\n",
+ " Q_ = tf.concat(tf.split(Q, num_heads, axis=2), 0)\n",
+ " K_ = tf.concat(tf.split(K, num_heads, axis=2), 0)\n",
+ " V_ = tf.concat(tf.split(V, num_heads, axis=2), 0)\n",
+ " align = tf.matmul(Q_, K_, transpose_b=True)\n",
+ " align *= tf.rsqrt(tf.to_float(K_.get_shape()[-1].value))\n",
+ " paddings = tf.fill(tf.shape(align), float('-inf'))\n",
+ " lower_tri = tf.ones([T_q, T_k])\n",
+ " lower_tri = tf.linalg.LinearOperatorLowerTriangular(lower_tri).to_dense()\n",
+ " masks = tf.tile(tf.expand_dims(lower_tri,0), [tf.shape(align)[0],1,1])\n",
+ " align = tf.where(tf.equal(masks, 0), paddings, align)\n",
+ " align = tf.nn.softmax(align)\n",
+ " align = tf.layers.dropout(align, 0.1, training=is_training) \n",
+ " x = tf.matmul(align, V_)\n",
+ " x = tf.concat(tf.split(x, num_heads, axis=0), 2)\n",
+ " x += inputs\n",
+ " x = layer_norm(x)\n",
+ " return x\n",
+ "\n",
+ "def ffn(inputs, hidden_dim, activation=tf.nn.relu):\n",
+ " x = tf.layers.conv1d(inputs, 4* hidden_dim, 1, activation=activation) \n",
+ " x = tf.layers.conv1d(x, hidden_dim, 1, activation=None)\n",
+ " x += inputs\n",
+ " x = layer_norm(x)\n",
+ " return x\n",
+ "\n",
+ "class Model:\n",
+ " def __init__(self, size_layer, num_layers, embedded_size,\n",
+ " dict_size, learning_rate, dropout, kernel_size = 5):\n",
+ " \n",
+ " def cnn(x, scope):\n",
+ " x += position_encoding(x)\n",
+ " with tf.variable_scope(scope, reuse = tf.AUTO_REUSE):\n",
+ " for n in range(num_layers):\n",
+ " with tf.variable_scope('attn_%d'%i,reuse=tf.AUTO_REUSE):\n",
+ " x = self_attention(x, True, size_layer)\n",
+ " with tf.variable_scope('ffn_%d'%i, reuse=tf.AUTO_REUSE):\n",
+ " x = ffn(x, size_layer)\n",
+ " \n",
+ " with tf.variable_scope('logits', reuse=tf.AUTO_REUSE):\n",
+ " return tf.layers.dense(x, size_layer)[:, -1]\n",
+ " \n",
+ " self.X_left = tf.placeholder(tf.int32, [None, None])\n",
+ " self.X_right = tf.placeholder(tf.int32, [None, None])\n",
+ " self.Y = tf.placeholder(tf.float32, [None])\n",
+ " self.batch_size = tf.shape(self.X_left)[0]\n",
+ " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n",
+ " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X_left)\n",
+ " embedded_right = tf.nn.embedding_lookup(encoder_embeddings, self.X_right)\n",
+ " \n",
+ " def contrastive_loss(y,d):\n",
+ " tmp= y * tf.square(d)\n",
+ " tmp2 = (1-y) * tf.square(tf.maximum((1 - d),0))\n",
+ " return tf.reduce_sum(tmp +tmp2)/tf.cast(self.batch_size,tf.float32)/2\n",
+ " \n",
+ " self.output_left = cnn(embedded_left, 'left')\n",
+ " self.output_right = cnn(embedded_right, 'right')\n",
+ " print(self.output_left, self.output_right)\n",
+ " self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.output_left,self.output_right)),\n",
+ " 1,keep_dims=True))\n",
+ " self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.output_left),\n",
+ " 1,keep_dims=True)),\n",
+ " tf.sqrt(tf.reduce_sum(tf.square(self.output_right),\n",
+ " 1,keep_dims=True))))\n",
+ " self.distance = tf.reshape(self.distance, [-1])\n",
+ " self.cost = contrastive_loss(self.Y,self.distance)\n",
+ " \n",
+ " self.temp_sim = tf.subtract(tf.ones_like(self.distance),\n",
+ " tf.rint(self.distance))\n",
+ " correct_predictions = tf.equal(self.temp_sim, self.Y)\n",
+ " self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, \"float\"))\n",
+ " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "size_layer = 128\n",
+ "num_layers = 4\n",
+ "embedded_size = 128\n",
+ "learning_rate = 1e-4\n",
+ "maxlen = 50\n",
+ "batch_size = 128\n",
+ "dropout = 0.8"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.cross_validation import train_test_split\n",
+ "\n",
+ "vectors_left = str_idx(left, dictionary, maxlen)\n",
+ "vectors_right = str_idx(right, dictionary, maxlen)\n",
+ "train_X_left, test_X_left, train_X_right, test_X_right, train_Y, test_Y = train_test_split(vectors_left,\n",
+ " vectors_right,\n",
+ " label,\n",
+ " test_size = 0.2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Colocations handled automatically by placer.\n",
+ "WARNING:tensorflow:From :4: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.cast instead.\n",
+ "WARNING:tensorflow:From :20: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use keras.layers.dense instead.\n",
+ "WARNING:tensorflow:From :33: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use keras.layers.dropout instead.\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n",
+ "WARNING:tensorflow:From :41: conv1d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use keras.layers.conv1d instead.\n",
+ "Tensor(\"left/logits/strided_slice:0\", shape=(?, 128), dtype=float32) Tensor(\"right/logits/strided_slice:0\", shape=(?, 128), dtype=float32)\n",
+ "WARNING:tensorflow:From :80: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "keep_dims is deprecated, use keepdims instead\n",
+ "WARNING:tensorflow:From :84: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Deprecated in favor of operator or tf.math.divide.\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.cast instead.\n"
+ ]
+ }
+ ],
+ "source": [
+ "tf.reset_default_graph()\n",
+ "sess = tf.InteractiveSession()\n",
+ "model = Model(size_layer,num_layers,embedded_size,len(dictionary),learning_rate,dropout)\n",
+ "sess.run(tf.global_variables_initializer())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:41<00:00, 25.12it/s, accuracy=0.693, cost=0.1] \n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 65.48it/s, accuracy=0.711, cost=0.096] \n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:40, 25.16it/s, accuracy=0.703, cost=0.101] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.000000, current acc: 0.685201\n",
+ "time taken: 111.32214426994324\n",
+ "epoch: 0, training loss: 0.106726, training acc: 0.669383, valid loss: 0.103184, valid acc: 0.685201\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.08it/s, accuracy=0.733, cost=0.0915]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.03it/s, accuracy=0.722, cost=0.0919]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:41, 24.90it/s, accuracy=0.688, cost=0.104] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.685201, current acc: 0.701866\n",
+ "time taken: 110.18735837936401\n",
+ "epoch: 0, training loss: 0.100379, training acc: 0.691623, valid loss: 0.098808, valid acc: 0.701866\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.11it/s, accuracy=0.733, cost=0.0892]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.03it/s, accuracy=0.678, cost=0.095] \n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:39, 25.28it/s, accuracy=0.711, cost=0.0951]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.701866, current acc: 0.712456\n",
+ "time taken: 110.06335616111755\n",
+ "epoch: 0, training loss: 0.096448, training acc: 0.707221, valid loss: 0.096495, valid acc: 0.712456\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.09it/s, accuracy=0.743, cost=0.0927]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 66.97it/s, accuracy=0.644, cost=0.0971]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:39, 25.36it/s, accuracy=0.719, cost=0.0931]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.712456, current acc: 0.715025\n",
+ "time taken: 110.16492295265198\n",
+ "epoch: 0, training loss: 0.093926, training acc: 0.717781, valid loss: 0.095615, valid acc: 0.715025\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.03it/s, accuracy=0.752, cost=0.0877]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 66.98it/s, accuracy=0.678, cost=0.097] \n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:41, 24.84it/s, accuracy=0.688, cost=0.0955]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.715025, current acc: 0.721843\n",
+ "time taken: 110.38844656944275\n",
+ "epoch: 0, training loss: 0.092020, training acc: 0.726040, valid loss: 0.094243, valid acc: 0.721843\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.11it/s, accuracy=0.723, cost=0.0882]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.09it/s, accuracy=0.667, cost=0.0952]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:41, 24.93it/s, accuracy=0.75, cost=0.0906] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.721843, current acc: 0.722270\n",
+ "time taken: 110.06278610229492\n",
+ "epoch: 0, training loss: 0.090355, training acc: 0.733065, valid loss: 0.093710, valid acc: 0.722270\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.03it/s, accuracy=0.752, cost=0.086] \n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 66.96it/s, accuracy=0.7, cost=0.0953] \n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:41, 24.94it/s, accuracy=0.742, cost=0.0918]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.722270, current acc: 0.725934\n",
+ "time taken: 110.40167164802551\n",
+ "epoch: 0, training loss: 0.088796, training acc: 0.739814, valid loss: 0.092955, valid acc: 0.725934\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.15it/s, accuracy=0.762, cost=0.0806]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.26it/s, accuracy=0.689, cost=0.096] \n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:41, 24.84it/s, accuracy=0.781, cost=0.0892]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 109.86811327934265\n",
+ "epoch: 0, training loss: 0.087358, training acc: 0.746224, valid loss: 0.092556, valid acc: 0.725335\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.04it/s, accuracy=0.762, cost=0.0808]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.34it/s, accuracy=0.7, cost=0.0938] \n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:38, 25.63it/s, accuracy=0.805, cost=0.0879]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.725934, current acc: 0.729039\n",
+ "time taken: 110.31477642059326\n",
+ "epoch: 0, training loss: 0.085995, training acc: 0.751777, valid loss: 0.091761, valid acc: 0.729039\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.26it/s, accuracy=0.743, cost=0.0775]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.20it/s, accuracy=0.722, cost=0.0949]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:40, 25.17it/s, accuracy=0.727, cost=0.0899]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.729039, current acc: 0.730447\n",
+ "time taken: 109.4636116027832\n",
+ "epoch: 0, training loss: 0.084593, training acc: 0.756880, valid loss: 0.091620, valid acc: 0.730447\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.15it/s, accuracy=0.792, cost=0.0763]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 66.96it/s, accuracy=0.711, cost=0.0971]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:39, 25.33it/s, accuracy=0.781, cost=0.0882]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.730447, current acc: 0.732334\n",
+ "time taken: 109.93308997154236\n",
+ "epoch: 0, training loss: 0.083287, training acc: 0.762669, valid loss: 0.091151, valid acc: 0.732334\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.26it/s, accuracy=0.772, cost=0.0729]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.32it/s, accuracy=0.678, cost=0.098] \n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:39, 25.40it/s, accuracy=0.781, cost=0.0819]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.732334, current acc: 0.732491\n",
+ "time taken: 109.41248917579651\n",
+ "epoch: 0, training loss: 0.082038, training acc: 0.767324, valid loss: 0.090638, valid acc: 0.732491\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.21it/s, accuracy=0.772, cost=0.0769]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.24it/s, accuracy=0.711, cost=0.0949]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:38, 25.54it/s, accuracy=0.781, cost=0.0809]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.732491, current acc: 0.734844\n",
+ "time taken: 109.63890266418457\n",
+ "epoch: 0, training loss: 0.080769, training acc: 0.772957, valid loss: 0.090315, valid acc: 0.734844\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.16it/s, accuracy=0.822, cost=0.0687]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.61it/s, accuracy=0.744, cost=0.0907]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:39, 25.38it/s, accuracy=0.781, cost=0.0854]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 109.79329133033752\n",
+ "epoch: 0, training loss: 0.079631, training acc: 0.777117, valid loss: 0.090068, valid acc: 0.734180\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.25it/s, accuracy=0.822, cost=0.0702]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.38it/s, accuracy=0.722, cost=0.091] \n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:40, 25.05it/s, accuracy=0.781, cost=0.0819]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.734844, current acc: 0.735022\n",
+ "time taken: 109.46223187446594\n",
+ "epoch: 0, training loss: 0.078417, training acc: 0.781514, valid loss: 0.089608, valid acc: 0.735022\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.14it/s, accuracy=0.782, cost=0.0686]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 66.88it/s, accuracy=0.711, cost=0.0945]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:40, 25.15it/s, accuracy=0.75, cost=0.0856] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.735022, current acc: 0.737936\n",
+ "time taken: 109.98049426078796\n",
+ "epoch: 0, training loss: 0.077204, training acc: 0.786631, valid loss: 0.089129, valid acc: 0.737936\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:39<00:00, 25.27it/s, accuracy=0.792, cost=0.0682]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.62it/s, accuracy=0.722, cost=0.0938]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:38, 25.51it/s, accuracy=0.836, cost=0.0775]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.737936, current acc: 0.739277\n",
+ "time taken: 109.33117318153381\n",
+ "epoch: 0, training loss: 0.076121, training acc: 0.790172, valid loss: 0.089027, valid acc: 0.739277\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.23it/s, accuracy=0.832, cost=0.067] \n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.35it/s, accuracy=0.7, cost=0.0949] \n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:38, 25.66it/s, accuracy=0.82, cost=0.0774] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.739277, current acc: 0.739749\n",
+ "time taken: 109.55670094490051\n",
+ "epoch: 0, training loss: 0.074985, training acc: 0.794015, valid loss: 0.088705, valid acc: 0.739749\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:39<00:00, 25.35it/s, accuracy=0.812, cost=0.0635]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.36it/s, accuracy=0.711, cost=0.0848]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:40, 25.13it/s, accuracy=0.797, cost=0.0756]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.739749, current acc: 0.740187\n",
+ "time taken: 109.05358052253723\n",
+ "epoch: 0, training loss: 0.074041, training acc: 0.797890, valid loss: 0.088700, valid acc: 0.740187\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.26it/s, accuracy=0.842, cost=0.0616]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.31it/s, accuracy=0.689, cost=0.0933]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:39, 25.36it/s, accuracy=0.773, cost=0.0746]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 109.43666005134583\n",
+ "epoch: 0, training loss: 0.072876, training acc: 0.801452, valid loss: 0.088649, valid acc: 0.739768\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.21it/s, accuracy=0.871, cost=0.0602]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.55it/s, accuracy=0.689, cost=0.0911]\n",
+ "train minibatch loop: 0%| | 3/2527 [00:00<01:37, 25.84it/s, accuracy=0.812, cost=0.0774]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 109.58015727996826\n",
+ "epoch: 0, training loss: 0.071968, training acc: 0.804654, valid loss: 0.088769, valid acc: 0.738841\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [01:40<00:00, 25.22it/s, accuracy=0.822, cost=0.0614]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:09<00:00, 67.43it/s, accuracy=0.689, cost=0.0998]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 109.57378196716309\n",
+ "epoch: 0, training loss: 0.070959, training acc: 0.809158, valid loss: 0.088572, valid acc: 0.739855\n",
+ "\n",
+ "break epoch:0\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import time\n",
+ "\n",
+ "EARLY_STOPPING, CURRENT_CHECKPOINT, CURRENT_ACC, EPOCH = 3, 0, 0, 0\n",
+ "\n",
+ "while True:\n",
+ " lasttime = time.time()\n",
+ " if CURRENT_CHECKPOINT == EARLY_STOPPING:\n",
+ " print('break epoch:%d\\n' % (EPOCH))\n",
+ " break\n",
+ "\n",
+ " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n",
+ " pbar = tqdm(range(0, len(train_X_left), batch_size), desc='train minibatch loop')\n",
+ " for i in pbar:\n",
+ " batch_x_left = train_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " batch_x_right = train_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " batch_y = train_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
+ " acc, loss, _ = sess.run([model.accuracy, model.cost, model.optimizer], \n",
+ " feed_dict = {model.X_left : batch_x_left, \n",
+ " model.X_right: batch_x_right,\n",
+ " model.Y : batch_y})\n",
+ " assert not np.isnan(loss)\n",
+ " train_loss += loss\n",
+ " train_acc += acc\n",
+ " pbar.set_postfix(cost=loss, accuracy = acc)\n",
+ " \n",
+ " pbar = tqdm(range(0, len(test_X_left), batch_size), desc='test minibatch loop')\n",
+ " for i in pbar:\n",
+ " batch_x_left = test_X_left[i:min(i+batch_size,test_X_left.shape[0])]\n",
+ " batch_x_right = test_X_right[i:min(i+batch_size,test_X_left.shape[0])]\n",
+ " batch_y = test_Y[i:min(i+batch_size,test_X_left.shape[0])]\n",
+ " acc, loss = sess.run([model.accuracy, model.cost], \n",
+ " feed_dict = {model.X_left : batch_x_left, \n",
+ " model.X_right: batch_x_right,\n",
+ " model.Y : batch_y})\n",
+ " test_loss += loss\n",
+ " test_acc += acc\n",
+ " pbar.set_postfix(cost=loss, accuracy = acc)\n",
+ " \n",
+ " train_loss /= (len(train_X_left) / batch_size)\n",
+ " train_acc /= (len(train_X_left) / batch_size)\n",
+ " test_loss /= (len(test_X_left) / batch_size)\n",
+ " test_acc /= (len(test_X_left) / batch_size)\n",
+ " \n",
+ " if test_acc > CURRENT_ACC:\n",
+ " print(\n",
+ " 'epoch: %d, pass acc: %f, current acc: %f'\n",
+ " % (EPOCH, CURRENT_ACC, test_acc)\n",
+ " )\n",
+ " CURRENT_ACC = test_acc\n",
+ " CURRENT_CHECKPOINT = 0\n",
+ " else:\n",
+ " CURRENT_CHECKPOINT += 1\n",
+ " \n",
+ " print('time taken:', time.time()-lasttime)\n",
+ " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n",
+ " train_acc,test_loss,\n",
+ " test_acc))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[array([0.], dtype=float32), array([0.13981318], dtype=float32)]"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "left = str_idx(['a person is outdoors, on a horse.'], dictionary, maxlen)\n",
+ "right = str_idx(['a person on a horse jumps over a broken down airplane.'], dictionary, maxlen)\n",
+ "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
+ " model.X_right: right})"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/text-similarity/4.dilated-cnn-crossentropy.ipynb b/text-similarity/4.dilated-cnn-crossentropy.ipynb
new file mode 100644
index 0000000..40bc3dd
--- /dev/null
+++ b/text-similarity/4.dilated-cnn-crossentropy.ipynb
@@ -0,0 +1,746 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# !wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/home/jupyter/.local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
+ " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
+ ]
+ }
+ ],
+ "source": [
+ "import tensorflow as tf\n",
+ "import re\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from tqdm import tqdm\n",
+ "import collections\n",
+ "from unidecode import unidecode\n",
+ "from sklearn.cross_validation import train_test_split"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def build_dataset(words, n_words):\n",
+ " count = [['PAD', 0], ['GO', 1], ['EOS', 2], ['UNK', 3], ['SEPARATOR', 4]]\n",
+ " count.extend(collections.Counter(words).most_common(n_words - 1))\n",
+ " dictionary = dict()\n",
+ " for word, _ in count:\n",
+ " dictionary[word] = len(dictionary)\n",
+ " data = list()\n",
+ " unk_count = 0\n",
+ " for word in words:\n",
+ " index = dictionary.get(word, 0)\n",
+ " if index == 0:\n",
+ " unk_count += 1\n",
+ " data.append(index)\n",
+ " count[0][1] = unk_count\n",
+ " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
+ " return data, count, dictionary, reversed_dictionary\n",
+ "\n",
+ "def str_idx(corpus, dic, maxlen, UNK=3):\n",
+ " X = np.zeros((len(corpus),maxlen))\n",
+ " for i in range(len(corpus)):\n",
+ " for no, k in enumerate(corpus[i][:maxlen][::-1]):\n",
+ " val = dic[k] if k in dic else UNK\n",
+ " X[i,-1 - no]= val\n",
+ " return X\n",
+ "\n",
+ "def cleaning(string):\n",
+ " string = unidecode(string).replace('.', ' . ').replace(',', ' , ')\n",
+ " string = re.sub('[^A-Za-z\\- ]+', ' ', string)\n",
+ " string = re.sub(r'[ ]+', ' ', string).strip()\n",
+ " return string.lower()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " qid1 | \n",
+ " qid2 | \n",
+ " question1 | \n",
+ " question2 | \n",
+ " is_duplicate | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 3 | \n",
+ " 4 | \n",
+ " What is the story of Kohinoor (Koh-i-Noor) Dia... | \n",
+ " What would happen if the Indian government sto... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 5 | \n",
+ " 6 | \n",
+ " How can I increase the speed of my internet co... | \n",
+ " How can Internet speed be increased by hacking... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 3 | \n",
+ " 7 | \n",
+ " 8 | \n",
+ " Why am I mentally very lonely? How can I solve... | \n",
+ " Find the remainder when [math]23^{24}[/math] i... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 4 | \n",
+ " 9 | \n",
+ " 10 | \n",
+ " Which one dissolve in water quikly sugar, salt... | \n",
+ " Which fish would survive in salt water? | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id qid1 qid2 question1 \\\n",
+ "0 0 1 2 What is the step by step guide to invest in sh... \n",
+ "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
+ "2 2 5 6 How can I increase the speed of my internet co... \n",
+ "3 3 7 8 Why am I mentally very lonely? How can I solve... \n",
+ "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n",
+ "\n",
+ " question2 is_duplicate \n",
+ "0 What is the step by step guide to invest in sh... 0 \n",
+ "1 What would happen if the Indian government sto... 0 \n",
+ "2 How can Internet speed be increased by hacking... 0 \n",
+ "3 Find the remainder when [math]23^{24}[/math] i... 0 \n",
+ "4 Which fish would survive in salt water? 0 "
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv('quora_duplicate_questions.tsv', delimiter='\\t').dropna()\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "left, right, label = df['question1'].tolist(), df['question2'].tolist(), df['is_duplicate'].tolist()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(array([0, 1]), array([255024, 149263]))"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.unique(label, return_counts = True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 404287/404287 [00:07<00:00, 51783.93it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "for i in tqdm(range(len(left))):\n",
+ " left[i] = cleaning(left[i])\n",
+ " right[i] = cleaning(right[i])\n",
+ " left[i] = left[i] + ' SEPARATOR ' + right[i]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "vocab from size: 87662\n",
+ "Most common words [['SEPARATOR', 4], ('SEPARATOR', 404287), ('the', 377593), ('what', 324635), ('is', 269934), ('i', 223893)]\n",
+ "Sample data [6, 7, 5, 1286, 63, 1286, 2502, 11, 565, 12] ['what', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in']\n"
+ ]
+ }
+ ],
+ "source": [
+ "concat = ' '.join(left).split()\n",
+ "vocabulary_size = len(list(set(concat)))\n",
+ "data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)\n",
+ "print('vocab from size: %d'%(vocabulary_size))\n",
+ "print('Most common words', count[4:10])\n",
+ "print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def position_encoding(inputs):\n",
+ " T = tf.shape(inputs)[1]\n",
+ " repr_dim = inputs.get_shape()[-1].value\n",
+ " pos = tf.reshape(tf.range(0.0, tf.to_float(T), dtype=tf.float32), [-1, 1])\n",
+ " i = np.arange(0, repr_dim, 2, np.float32)\n",
+ " denom = np.reshape(np.power(10000.0, i / repr_dim), [1, -1])\n",
+ " enc = tf.expand_dims(tf.concat([tf.sin(pos / denom), tf.cos(pos / denom)], 1), 0)\n",
+ " return tf.tile(enc, [tf.shape(inputs)[0], 1, 1])\n",
+ "\n",
+ "def layer_norm(inputs, epsilon=1e-8):\n",
+ " mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)\n",
+ " normalized = (inputs - mean) / (tf.sqrt(variance + epsilon))\n",
+ " params_shape = inputs.get_shape()[-1:]\n",
+ " gamma = tf.get_variable('gamma', params_shape, tf.float32, tf.ones_initializer())\n",
+ " beta = tf.get_variable('beta', params_shape, tf.float32, tf.zeros_initializer())\n",
+ " return gamma * normalized + beta\n",
+ "\n",
+ "def cnn_block(x, dilation_rate, pad_sz, hidden_dim, kernel_size):\n",
+ " x = layer_norm(x)\n",
+ " pad = tf.zeros([tf.shape(x)[0], pad_sz, hidden_dim])\n",
+ " x = tf.layers.conv1d(inputs = tf.concat([pad, x, pad], 1),\n",
+ " filters = hidden_dim,\n",
+ " kernel_size = kernel_size,\n",
+ " dilation_rate = dilation_rate)\n",
+ " x = x[:, :-pad_sz, :]\n",
+ " x = tf.nn.relu(x)\n",
+ " return x\n",
+ "\n",
+ "class Model:\n",
+ " def __init__(self, size_layer, num_layers, embedded_size,\n",
+ " dict_size, learning_rate, dropout, kernel_size = 5):\n",
+ " \n",
+ " def cnn(x, scope):\n",
+ " x += position_encoding(x)\n",
+ " with tf.variable_scope(scope, reuse = tf.AUTO_REUSE):\n",
+ " for n in range(num_layers):\n",
+ " dilation_rate = 2 ** n\n",
+ " pad_sz = (kernel_size - 1) * dilation_rate \n",
+ " with tf.variable_scope('block_%d'%i,reuse=tf.AUTO_REUSE):\n",
+ " x += cnn_block(x, dilation_rate, pad_sz, size_layer, kernel_size)\n",
+ " \n",
+ " with tf.variable_scope('logits', reuse=tf.AUTO_REUSE):\n",
+ " return tf.layers.dense(x, size_layer)[:, -1]\n",
+ " \n",
+ " self.X = tf.placeholder(tf.int32, [None, None])\n",
+ " self.Y = tf.placeholder(tf.int32, [None])\n",
+ " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n",
+ " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X)\n",
+ " \n",
+ " self.logits = cnn(embedded_left, 'left')\n",
+ " self.cost = tf.reduce_mean(\n",
+ " tf.nn.sparse_softmax_cross_entropy_with_logits(\n",
+ " logits = self.logits, labels = self.Y\n",
+ " )\n",
+ " )\n",
+ " \n",
+ " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)\n",
+ " correct_pred = tf.equal(\n",
+ " tf.argmax(self.logits, 1, output_type = tf.int32), self.Y\n",
+ " )\n",
+ " self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "size_layer = 128\n",
+ "num_layers = 4\n",
+ "embedded_size = 128\n",
+ "learning_rate = 1e-3\n",
+ "maxlen = 50\n",
+ "batch_size = 128\n",
+ "dropout = 0.8"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.cross_validation import train_test_split\n",
+ "\n",
+ "vectors = str_idx(left, dictionary, maxlen)\n",
+ "train_X, test_X, train_Y, test_Y = train_test_split(vectors, label, test_size = 0.2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py:1702: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).\n",
+ " warnings.warn('An interactive session is already active. This can '\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.cast instead.\n"
+ ]
+ }
+ ],
+ "source": [
+ "tf.reset_default_graph()\n",
+ "sess = tf.InteractiveSession()\n",
+ "model = Model(size_layer,num_layers,embedded_size,len(dictionary),learning_rate,dropout)\n",
+ "sess.run(tf.global_variables_initializer())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:32<00:00, 77.42it/s, accuracy=0.584, cost=0.645]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 241.53it/s, accuracy=0.678, cost=0.624]\n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 82.72it/s, accuracy=0.664, cost=0.638]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.000000, current acc: 0.649024\n",
+ "time taken: 35.25988554954529\n",
+ "epoch: 0, training loss: 0.639172, training acc: 0.645532, valid loss: 0.625583, valid acc: 0.649024\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.78it/s, accuracy=0.653, cost=0.605]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 249.91it/s, accuracy=0.756, cost=0.568]\n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 82.57it/s, accuracy=0.648, cost=0.625]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.649024, current acc: 0.686088\n",
+ "time taken: 33.05776762962341\n",
+ "epoch: 0, training loss: 0.599088, training acc: 0.681601, valid loss: 0.593228, valid acc: 0.686088\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.65it/s, accuracy=0.703, cost=0.568]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 250.40it/s, accuracy=0.7, cost=0.548] \n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 83.52it/s, accuracy=0.672, cost=0.615]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.686088, current acc: 0.700928\n",
+ "time taken: 33.1018283367157\n",
+ "epoch: 0, training loss: 0.572584, training acc: 0.705614, valid loss: 0.578908, valid acc: 0.700928\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.66it/s, accuracy=0.723, cost=0.556]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 251.01it/s, accuracy=0.778, cost=0.521]\n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 83.45it/s, accuracy=0.703, cost=0.604]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.700928, current acc: 0.705392\n",
+ "time taken: 33.0923171043396\n",
+ "epoch: 0, training loss: 0.550349, training acc: 0.723289, valid loss: 0.573883, valid acc: 0.705392\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.63it/s, accuracy=0.733, cost=0.545]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 249.63it/s, accuracy=0.767, cost=0.526]\n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 83.85it/s, accuracy=0.727, cost=0.582]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.705392, current acc: 0.706215\n",
+ "time taken: 33.11521649360657\n",
+ "epoch: 0, training loss: 0.530263, training acc: 0.737710, valid loss: 0.574223, valid acc: 0.706215\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.78it/s, accuracy=0.703, cost=0.507]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 249.87it/s, accuracy=0.722, cost=0.548]\n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 82.88it/s, accuracy=0.68, cost=0.566] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.706215, current acc: 0.712823\n",
+ "time taken: 33.06076192855835\n",
+ "epoch: 0, training loss: 0.512012, training acc: 0.749806, valid loss: 0.572262, valid acc: 0.712823\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.65it/s, accuracy=0.713, cost=0.539]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 249.05it/s, accuracy=0.711, cost=0.54] \n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 82.51it/s, accuracy=0.672, cost=0.576]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.712823, current acc: 0.715365\n",
+ "time taken: 33.11697006225586\n",
+ "epoch: 0, training loss: 0.495308, training acc: 0.760959, valid loss: 0.575378, valid acc: 0.715365\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.82it/s, accuracy=0.713, cost=0.55] \n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 250.17it/s, accuracy=0.689, cost=0.578]\n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 82.39it/s, accuracy=0.719, cost=0.558]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.715365, current acc: 0.718076\n",
+ "time taken: 33.03995633125305\n",
+ "epoch: 0, training loss: 0.480132, training acc: 0.770668, valid loss: 0.576161, valid acc: 0.718076\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.96it/s, accuracy=0.723, cost=0.532]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 250.37it/s, accuracy=0.689, cost=0.571]\n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 83.53it/s, accuracy=0.734, cost=0.556]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.718076, current acc: 0.718397\n",
+ "time taken: 32.98867201805115\n",
+ "epoch: 0, training loss: 0.466953, training acc: 0.778197, valid loss: 0.585377, valid acc: 0.718397\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.74it/s, accuracy=0.693, cost=0.579]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 250.47it/s, accuracy=0.722, cost=0.579]\n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 83.07it/s, accuracy=0.727, cost=0.532]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.718397, current acc: 0.719860\n",
+ "time taken: 33.069570541381836\n",
+ "epoch: 0, training loss: 0.454996, training acc: 0.786085, valid loss: 0.589913, valid acc: 0.719860\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.83it/s, accuracy=0.703, cost=0.545]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 250.12it/s, accuracy=0.744, cost=0.56] \n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 82.83it/s, accuracy=0.711, cost=0.518]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.719860, current acc: 0.722752\n",
+ "time taken: 33.03630042076111\n",
+ "epoch: 0, training loss: 0.443845, training acc: 0.792981, valid loss: 0.597150, valid acc: 0.722752\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.84it/s, accuracy=0.743, cost=0.536]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 249.00it/s, accuracy=0.744, cost=0.57] \n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 82.01it/s, accuracy=0.75, cost=0.504] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 33.04733848571777\n",
+ "epoch: 0, training loss: 0.433595, training acc: 0.798370, valid loss: 0.605825, valid acc: 0.720378\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.71it/s, accuracy=0.762, cost=0.505]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 250.76it/s, accuracy=0.756, cost=0.567]\n",
+ "train minibatch loop: 0%| | 9/2527 [00:00<00:30, 82.74it/s, accuracy=0.75, cost=0.51] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 33.075902462005615\n",
+ "epoch: 0, training loss: 0.423926, training acc: 0.803343, valid loss: 0.617053, valid acc: 0.721669\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:30<00:00, 82.80it/s, accuracy=0.723, cost=0.501]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:02<00:00, 251.00it/s, accuracy=0.778, cost=0.559]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 33.04087018966675\n",
+ "epoch: 0, training loss: 0.415806, training acc: 0.808235, valid loss: 0.627675, valid acc: 0.719070\n",
+ "\n",
+ "break epoch:0\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import time\n",
+ "\n",
+ "EARLY_STOPPING, CURRENT_CHECKPOINT, CURRENT_ACC, EPOCH = 3, 0, 0, 0\n",
+ "\n",
+ "while True:\n",
+ " lasttime = time.time()\n",
+ " if CURRENT_CHECKPOINT == EARLY_STOPPING:\n",
+ " print('break epoch:%d\\n' % (EPOCH))\n",
+ " break\n",
+ "\n",
+ " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n",
+ " pbar = tqdm(range(0, len(train_X), batch_size), desc='train minibatch loop')\n",
+ " for i in pbar:\n",
+ " batch_x = train_X[i:min(i+batch_size,train_X.shape[0])]\n",
+ " batch_y = train_Y[i:min(i+batch_size,train_X.shape[0])]\n",
+ " acc, loss, _ = sess.run([model.accuracy, model.cost, model.optimizer], \n",
+ " feed_dict = {model.X : batch_x,\n",
+ " model.Y : batch_y})\n",
+ " assert not np.isnan(loss)\n",
+ " train_loss += loss\n",
+ " train_acc += acc\n",
+ " pbar.set_postfix(cost=loss, accuracy = acc)\n",
+ " \n",
+ " pbar = tqdm(range(0, len(test_X), batch_size), desc='test minibatch loop')\n",
+ " for i in pbar:\n",
+ " batch_x = test_X[i:min(i+batch_size,test_X.shape[0])]\n",
+ " batch_y = test_Y[i:min(i+batch_size,test_X.shape[0])]\n",
+ " acc, loss = sess.run([model.accuracy, model.cost], \n",
+ " feed_dict = {model.X : batch_x,\n",
+ " model.Y : batch_y})\n",
+ " test_loss += loss\n",
+ " test_acc += acc\n",
+ " pbar.set_postfix(cost=loss, accuracy = acc)\n",
+ " \n",
+ " train_loss /= (len(train_X) / batch_size)\n",
+ " train_acc /= (len(train_X) / batch_size)\n",
+ " test_loss /= (len(test_X) / batch_size)\n",
+ " test_acc /= (len(test_X) / batch_size)\n",
+ " \n",
+ " if test_acc > CURRENT_ACC:\n",
+ " print(\n",
+ " 'epoch: %d, pass acc: %f, current acc: %f'\n",
+ " % (EPOCH, CURRENT_ACC, test_acc)\n",
+ " )\n",
+ " CURRENT_ACC = test_acc\n",
+ " CURRENT_CHECKPOINT = 0\n",
+ " else:\n",
+ " CURRENT_CHECKPOINT += 1\n",
+ " \n",
+ " print('time taken:', time.time()-lasttime)\n",
+ " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n",
+ " train_acc,test_loss,\n",
+ " test_acc))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/text-similarity/4.sentence-similarity-batchall-tripletloss.ipynb b/text-similarity/4.sentence-similarity-batchall-tripletloss.ipynb
deleted file mode 100644
index d0f9f8f..0000000
--- a/text-similarity/4.sentence-similarity-batchall-tripletloss.ipynb
+++ /dev/null
@@ -1,502 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import collections\n",
- "import random\n",
- "import tensorflow as tf"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "def build_dataset(words, n_words):\n",
- " count = [['GO', 0], ['PAD', 1], ['EOS', 2], ['UNK', 3]]\n",
- " count.extend(collections.Counter(words).most_common(n_words - 1))\n",
- " dictionary = dict()\n",
- " for word, _ in count:\n",
- " dictionary[word] = len(dictionary)\n",
- " data = list()\n",
- " unk_count = 0\n",
- " for word in words:\n",
- " index = dictionary.get(word, 0)\n",
- " if index == 0:\n",
- " unk_count += 1\n",
- " data.append(index)\n",
- " count[0][1] = unk_count\n",
- " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
- " return data, count, dictionary, reversed_dictionary\n",
- "\n",
- "def str_idx(corpus, dic, maxlen, UNK=3):\n",
- " X = np.zeros((len(corpus),maxlen))\n",
- " for i in range(len(corpus)):\n",
- " for no, k in enumerate(corpus[i][:maxlen][::-1]):\n",
- " val = dic[k] if k in dic else UNK\n",
- " X[i,-1 - no]= val\n",
- " return X\n",
- "\n",
- "def load_data(filepath):\n",
- " x1=[]\n",
- " x2=[]\n",
- " y=[]\n",
- " for line in open(filepath):\n",
- " l=line.strip().split(\"\\t\")\n",
- " if len(l)<2:\n",
- " continue\n",
- " if random.random() > 0.5:\n",
- " x1.append(l[0].lower())\n",
- " x2.append(l[1].lower())\n",
- " else:\n",
- " x1.append(l[1].lower())\n",
- " x2.append(l[0].lower())\n",
- " y.append(int(l[2]))\n",
- " return np.array(x1),np.array(x2),np.array(y)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "X1_text, X2_text, Y = load_data('train_snli.txt')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "vocab from size: 47170\n",
- "Most common words [('a', 959179), ('the', 341846), ('in', 273772), ('is', 248868), ('man', 173742), ('on', 154293)]\n",
- "Sample data [4, 38, 7, 17, 4, 16491, 2691, 20, 29356, 4] ['a', 'person', 'is', 'at', 'a', 'diner,', 'ordering', 'an', 'omelette.', 'a']\n"
- ]
- }
- ],
- "source": [
- "concat = (' '.join(X1_text.tolist() + X2_text.tolist())).split()\n",
- "vocabulary_size = len(list(set(concat)))\n",
- "data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)\n",
- "print('vocab from size: %d'%(vocabulary_size))\n",
- "print('Most common words', count[4:10])\n",
- "print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [],
- "source": [
- "def _pairwise_distances(embeddings_left, embeddings_right, squared=False):\n",
- " dot_product = tf.matmul(embeddings_left, \n",
- " tf.transpose(embeddings_right))\n",
- " square_norm = tf.diag_part(dot_product)\n",
- " distances = tf.expand_dims(square_norm, 1) - 2.0 * dot_product + tf.expand_dims(square_norm, 0)\n",
- " distances = tf.maximum(distances, 0.0)\n",
- "\n",
- " if not squared:\n",
- " mask = tf.to_float(tf.equal(distances, 0.0))\n",
- " distances = distances + mask * 1e-16\n",
- " distances = tf.sqrt(distances)\n",
- " distances = distances * (1.0 - mask)\n",
- "\n",
- " return distances\n",
- "\n",
- "\n",
- "def _get_anchor_positive_triplet_mask(labels):\n",
- " indices_equal = tf.cast(tf.eye(tf.shape(labels)[0]), tf.bool)\n",
- " indices_not_equal = tf.logical_not(indices_equal)\n",
- " labels_equal = tf.equal(tf.expand_dims(labels, 0), tf.expand_dims(labels, 1))\n",
- " mask = tf.logical_and(indices_not_equal, labels_equal)\n",
- "\n",
- " return mask\n",
- "\n",
- "\n",
- "def _get_anchor_negative_triplet_mask(labels):\n",
- " labels_equal = tf.equal(tf.expand_dims(labels, 0), tf.expand_dims(labels, 1))\n",
- " mask = tf.logical_not(labels_equal)\n",
- "\n",
- " return mask\n",
- "\n",
- "def _get_triplet_mask(labels):\n",
- " indices_equal = tf.cast(tf.eye(tf.shape(labels)[0]), tf.bool)\n",
- " indices_not_equal = tf.logical_not(indices_equal)\n",
- " i_not_equal_j = tf.expand_dims(indices_not_equal, 2)\n",
- " i_not_equal_k = tf.expand_dims(indices_not_equal, 1)\n",
- " j_not_equal_k = tf.expand_dims(indices_not_equal, 0)\n",
- "\n",
- " distinct_indices = tf.logical_and(tf.logical_and(i_not_equal_j, i_not_equal_k), j_not_equal_k)\n",
- "\n",
- " label_equal = tf.equal(tf.expand_dims(labels, 0), tf.expand_dims(labels, 1))\n",
- " i_equal_j = tf.expand_dims(label_equal, 2)\n",
- " i_equal_k = tf.expand_dims(label_equal, 1)\n",
- "\n",
- " valid_labels = tf.logical_and(i_equal_j, tf.logical_not(i_equal_k))\n",
- " mask = tf.logical_and(distinct_indices, valid_labels)\n",
- "\n",
- " return mask\n",
- "def batch_all_triplet_loss(labels, embeddings_left, embeddings_right, margin, squared=False):\n",
- " pairwise_dist = _pairwise_distances(embeddings_left, embeddings_right, squared=squared)\n",
- "\n",
- " anchor_positive_dist = tf.expand_dims(pairwise_dist, 2)\n",
- " assert anchor_positive_dist.shape[2] == 1, \"{}\".format(anchor_positive_dist.shape)\n",
- " anchor_negative_dist = tf.expand_dims(pairwise_dist, 1)\n",
- " assert anchor_negative_dist.shape[1] == 1, \"{}\".format(anchor_negative_dist.shape)\n",
- "\n",
- " triplet_loss = anchor_positive_dist - anchor_negative_dist + margin\n",
- "\n",
- " mask = _get_triplet_mask(labels)\n",
- " mask = tf.to_float(mask)\n",
- " triplet_loss = tf.multiply(mask, triplet_loss)\n",
- "\n",
- " triplet_loss = tf.maximum(triplet_loss, 0.0)\n",
- "\n",
- " valid_triplets = tf.to_float(tf.greater(triplet_loss, 1e-16))\n",
- " num_positive_triplets = tf.reduce_sum(valid_triplets)\n",
- " num_valid_triplets = tf.reduce_sum(mask)\n",
- " fraction_positive_triplets = num_positive_triplets / (num_valid_triplets + 1e-16)\n",
- "\n",
- " triplet_loss = tf.reduce_sum(triplet_loss) / (num_positive_triplets + 1e-16)\n",
- "\n",
- " return triplet_loss, fraction_positive_triplets"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [],
- "source": [
- "class Model:\n",
- " def __init__(self, size_layer, num_layers, embedded_size,\n",
- " dict_size, learning_rate, dimension_output):\n",
- " \n",
- " def cells(reuse=False):\n",
- " return tf.nn.rnn_cell.LSTMCell(size_layer,\n",
- " initializer=tf.orthogonal_initializer(),reuse=reuse)\n",
- " \n",
- " def rnn(inputs, reuse=False):\n",
- " with tf.variable_scope('model', reuse = reuse):\n",
- " rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cells() for _ in range(num_layers)])\n",
- " outputs, _ = tf.nn.dynamic_rnn(rnn_cells, inputs, dtype = tf.float32)\n",
- " return tf.layers.dense(outputs[:,-1], dimension_output)\n",
- " \n",
- " self.X_left = tf.placeholder(tf.int32, [None, None])\n",
- " self.X_right = tf.placeholder(tf.int32, [None, None])\n",
- " self.Y = tf.placeholder(tf.float32, [None])\n",
- " self.batch_size = tf.shape(self.X_left)[0]\n",
- " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n",
- " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X_left)\n",
- " embedded_right = tf.nn.embedding_lookup(encoder_embeddings, self.X_right)\n",
- " \n",
- " self.output_left = rnn(embedded_left, False)\n",
- " self.output_right = rnn(embedded_right, True)\n",
- " \n",
- " self.cost, fraction = batch_all_triplet_loss(self.Y, self.output_left, \n",
- " self.output_right, margin=0.5, squared=False)\n",
- " \n",
- " self.distance = tf.sqrt(tf.reduce_sum(tf.square(tf.subtract(self.output_left,self.output_right)),1,keep_dims=True))\n",
- " self.distance = tf.div(self.distance, tf.add(tf.sqrt(tf.reduce_sum(tf.square(self.output_left),1,keep_dims=True)),\n",
- " tf.sqrt(tf.reduce_sum(tf.square(self.output_right),1,keep_dims=True))))\n",
- " self.distance = tf.reshape(self.distance, [-1])\n",
- " \n",
- " self.temp_sim = tf.subtract(tf.ones_like(self.distance),\n",
- " tf.rint(self.distance))\n",
- " correct_predictions = tf.equal(self.temp_sim, self.Y)\n",
- " self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, \"float\"))\n",
- " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "size_layer = 256\n",
- "num_layers = 2\n",
- "embedded_size = 128\n",
- "learning_rate = 1e-3\n",
- "dimension_output = 300\n",
- "maxlen = 50\n",
- "batch_size = 128"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "WARNING:tensorflow:From :29: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "keep_dims is deprecated, use keepdims instead\n"
- ]
- }
- ],
- "source": [
- "tf.reset_default_graph()\n",
- "sess = tf.InteractiveSession()\n",
- "model = Model(size_layer,num_layers,embedded_size,len(dictionary),\n",
- " learning_rate,dimension_output)\n",
- "sess.run(tf.global_variables_initializer())"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
- " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
- ]
- }
- ],
- "source": [
- "from sklearn.cross_validation import train_test_split\n",
- "\n",
- "vectors_left = str_idx(X1_text, dictionary, maxlen)\n",
- "vectors_right = str_idx(X2_text, dictionary, maxlen)\n",
- "train_X_left, test_X_left, train_X_right, test_X_right, train_Y, test_Y = train_test_split(vectors_left,\n",
- " vectors_right,\n",
- " Y,\n",
- " test_size = 0.2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 2297/2297 [04:34<00:00, 8.38it/s, accuracy=0.4, cost=0.488] \n",
- "test minibatch loop: 100%|██████████| 575/575 [00:24<00:00, 24.62it/s, accuracy=0, cost=0] \n",
- "train minibatch loop: 0%| | 1/2297 [00:00<04:32, 8.44it/s, accuracy=0.469, cost=0.5]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 298.59579825401306\n",
- "epoch: 0, training loss: 0.500898, training acc: 0.499684, valid loss: 0.499920, valid acc: 0.498360\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 2297/2297 [04:34<00:00, 8.38it/s, accuracy=0.4, cost=0.431] \n",
- "test minibatch loop: 100%|██████████| 575/575 [00:24<00:00, 23.60it/s, accuracy=0, cost=0] \n",
- "train minibatch loop: 0%| | 1/2297 [00:00<04:31, 8.47it/s, accuracy=0.469, cost=0.499]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 298.59440088272095\n",
- "epoch: 1, training loss: 0.500313, training acc: 0.499691, valid loss: 0.499942, valid acc: 0.498360\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 2297/2297 [04:34<00:00, 8.37it/s, accuracy=0.4, cost=0.497] \n",
- "test minibatch loop: 100%|██████████| 575/575 [00:24<00:00, 23.57it/s, accuracy=0, cost=0] \n",
- "train minibatch loop: 0%| | 1/2297 [00:00<04:27, 8.58it/s, accuracy=0.469, cost=0.501]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 298.7852301597595\n",
- "epoch: 2, training loss: 0.500308, training acc: 0.499620, valid loss: 0.499977, valid acc: 0.498360\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 2297/2297 [04:34<00:00, 8.38it/s, accuracy=0.4, cost=0.469] \n",
- "test minibatch loop: 100%|██████████| 575/575 [00:24<00:00, 23.67it/s, accuracy=0, cost=0] \n",
- "train minibatch loop: 0%| | 1/2297 [00:00<04:42, 8.13it/s, accuracy=0.469, cost=0.5]"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 298.5429759025574\n",
- "epoch: 3, training loss: 0.500356, training acc: 0.499620, valid loss: 0.499817, valid acc: 0.498360\n",
- "\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "train minibatch loop: 100%|██████████| 2297/2297 [04:34<00:00, 8.38it/s, accuracy=0.4, cost=0.476] \n",
- "test minibatch loop: 100%|██████████| 575/575 [00:24<00:00, 23.52it/s, accuracy=0, cost=0] \n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "time taken: 298.66775345802307\n",
- "epoch: 4, training loss: 0.500305, training acc: 0.499620, valid loss: 0.500060, valid acc: 0.498360\n",
- "\n"
- ]
- }
- ],
- "source": [
- "from tqdm import tqdm\n",
- "import time\n",
- "\n",
- "for EPOCH in range(5):\n",
- " lasttime = time.time()\n",
- " \n",
- " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n",
- " pbar = tqdm(range(0, len(train_X_left), batch_size), desc='train minibatch loop')\n",
- " for i in pbar:\n",
- " batch_x_left = train_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_x_right = train_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_y = train_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " acc, loss, _ = sess.run([model.accuracy, model.cost, model.optimizer], \n",
- " feed_dict = {model.X_left : batch_x_left, \n",
- " model.X_right: batch_x_right,\n",
- " model.Y : batch_y})\n",
- " assert not np.isnan(loss)\n",
- " train_loss += loss\n",
- " train_acc += acc\n",
- " pbar.set_postfix(cost = loss, accuracy = acc)\n",
- " \n",
- " pbar = tqdm(range(0, len(test_X_left), batch_size), desc='test minibatch loop')\n",
- " for i in pbar:\n",
- " batch_x_left = test_X_left[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_x_right = test_X_right[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " batch_y = test_Y[i:min(i+batch_size,train_X_left.shape[0])]\n",
- " acc, loss = sess.run([model.accuracy, model.cost], \n",
- " feed_dict = {model.X_left : batch_x_left, \n",
- " model.X_right: batch_x_right,\n",
- " model.Y : batch_y})\n",
- " test_loss += loss\n",
- " test_acc += acc\n",
- " pbar.set_postfix(cost = loss, accuracy = acc)\n",
- " \n",
- " train_loss /= (len(train_X_left) / batch_size)\n",
- " train_acc /= (len(train_X_left) / batch_size)\n",
- " test_loss /= (len(test_X_left) / batch_size)\n",
- " test_acc /= (len(test_X_left) / batch_size)\n",
- " \n",
- " print('time taken:', time.time()-lasttime)\n",
- " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n",
- " train_acc,test_loss,\n",
- " test_acc))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[array([1.], dtype=float32), array([0.9535016], dtype=float32)]"
- ]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "left = str_idx(['a person is outdoors, on a horse.'], dictionary, maxlen)\n",
- "right = str_idx(['a person on a horse jumps over a broken down airplane.'], dictionary, maxlen)\n",
- "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
- " model.X_right: right})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[array([1.], dtype=float32), array([0.9941587], dtype=float32)]"
- ]
- },
- "execution_count": 12,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "left = str_idx(['i love you'], dictionary, maxlen)\n",
- "right = str_idx(['you love i'], dictionary, maxlen)\n",
- "sess.run([model.temp_sim,1-model.distance], feed_dict = {model.X_left : left, \n",
- " model.X_right: right})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.5.2"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/text-similarity/5.transformer-crossentropy.ipynb b/text-similarity/5.transformer-crossentropy.ipynb
new file mode 100644
index 0000000..4a093e1
--- /dev/null
+++ b/text-similarity/5.transformer-crossentropy.ipynb
@@ -0,0 +1,898 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# !wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/home/jupyter/.local/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
+ " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
+ ]
+ }
+ ],
+ "source": [
+ "import tensorflow as tf\n",
+ "import re\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from tqdm import tqdm\n",
+ "import collections\n",
+ "from unidecode import unidecode\n",
+ "from sklearn.cross_validation import train_test_split"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def build_dataset(words, n_words):\n",
+ " count = [['PAD', 0], ['GO', 1], ['EOS', 2], ['UNK', 3], ['SEPARATOR', 4]]\n",
+ " count.extend(collections.Counter(words).most_common(n_words - 1))\n",
+ " dictionary = dict()\n",
+ " for word, _ in count:\n",
+ " dictionary[word] = len(dictionary)\n",
+ " data = list()\n",
+ " unk_count = 0\n",
+ " for word in words:\n",
+ " index = dictionary.get(word, 0)\n",
+ " if index == 0:\n",
+ " unk_count += 1\n",
+ " data.append(index)\n",
+ " count[0][1] = unk_count\n",
+ " reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n",
+ " return data, count, dictionary, reversed_dictionary\n",
+ "\n",
+ "def str_idx(corpus, dic, maxlen, UNK=3):\n",
+ " X = np.zeros((len(corpus),maxlen))\n",
+ " for i in range(len(corpus)):\n",
+ " for no, k in enumerate(corpus[i][:maxlen][::-1]):\n",
+ " val = dic[k] if k in dic else UNK\n",
+ " X[i,-1 - no]= val\n",
+ " return X\n",
+ "\n",
+ "def cleaning(string):\n",
+ " string = unidecode(string).replace('.', ' . ').replace(',', ' , ')\n",
+ " string = re.sub('[^A-Za-z\\- ]+', ' ', string)\n",
+ " string = re.sub(r'[ ]+', ' ', string).strip()\n",
+ " return string.lower()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " qid1 | \n",
+ " qid2 | \n",
+ " question1 | \n",
+ " question2 | \n",
+ " is_duplicate | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 3 | \n",
+ " 4 | \n",
+ " What is the story of Kohinoor (Koh-i-Noor) Dia... | \n",
+ " What would happen if the Indian government sto... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 5 | \n",
+ " 6 | \n",
+ " How can I increase the speed of my internet co... | \n",
+ " How can Internet speed be increased by hacking... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 3 | \n",
+ " 7 | \n",
+ " 8 | \n",
+ " Why am I mentally very lonely? How can I solve... | \n",
+ " Find the remainder when [math]23^{24}[/math] i... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 4 | \n",
+ " 9 | \n",
+ " 10 | \n",
+ " Which one dissolve in water quikly sugar, salt... | \n",
+ " Which fish would survive in salt water? | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id qid1 qid2 question1 \\\n",
+ "0 0 1 2 What is the step by step guide to invest in sh... \n",
+ "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
+ "2 2 5 6 How can I increase the speed of my internet co... \n",
+ "3 3 7 8 Why am I mentally very lonely? How can I solve... \n",
+ "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n",
+ "\n",
+ " question2 is_duplicate \n",
+ "0 What is the step by step guide to invest in sh... 0 \n",
+ "1 What would happen if the Indian government sto... 0 \n",
+ "2 How can Internet speed be increased by hacking... 0 \n",
+ "3 Find the remainder when [math]23^{24}[/math] i... 0 \n",
+ "4 Which fish would survive in salt water? 0 "
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv('quora_duplicate_questions.tsv', delimiter='\\t').dropna()\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "left, right, label = df['question1'].tolist(), df['question2'].tolist(), df['is_duplicate'].tolist()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(array([0, 1]), array([255024, 149263]))"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "np.unique(label, return_counts = True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 404287/404287 [00:07<00:00, 52786.23it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "for i in tqdm(range(len(left))):\n",
+ " left[i] = cleaning(left[i])\n",
+ " right[i] = cleaning(right[i])\n",
+ " left[i] = left[i] + ' SEPARATOR ' + right[i]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "vocab from size: 87662\n",
+ "Most common words [['SEPARATOR', 4], ('SEPARATOR', 404287), ('the', 377593), ('what', 324635), ('is', 269934), ('i', 223893)]\n",
+ "Sample data [6, 7, 5, 1286, 63, 1286, 2502, 11, 565, 12] ['what', 'is', 'the', 'step', 'by', 'step', 'guide', 'to', 'invest', 'in']\n"
+ ]
+ }
+ ],
+ "source": [
+ "concat = ' '.join(left).split()\n",
+ "vocabulary_size = len(list(set(concat)))\n",
+ "data, count, dictionary, rev_dictionary = build_dataset(concat, vocabulary_size)\n",
+ "print('vocab from size: %d'%(vocabulary_size))\n",
+ "print('Most common words', count[4:10])\n",
+ "print('Sample data', data[:10], [rev_dictionary[i] for i in data[:10]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def position_encoding(inputs):\n",
+ " T = tf.shape(inputs)[1]\n",
+ " repr_dim = inputs.get_shape()[-1].value\n",
+ " pos = tf.reshape(tf.range(0.0, tf.to_float(T), dtype=tf.float32), [-1, 1])\n",
+ " i = np.arange(0, repr_dim, 2, np.float32)\n",
+ " denom = np.reshape(np.power(10000.0, i / repr_dim), [1, -1])\n",
+ " enc = tf.expand_dims(tf.concat([tf.sin(pos / denom), tf.cos(pos / denom)], 1), 0)\n",
+ " return tf.tile(enc, [tf.shape(inputs)[0], 1, 1])\n",
+ "\n",
+ "def layer_norm(inputs, epsilon=1e-8):\n",
+ " mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)\n",
+ " normalized = (inputs - mean) / (tf.sqrt(variance + epsilon))\n",
+ " params_shape = inputs.get_shape()[-1:]\n",
+ " gamma = tf.get_variable('gamma', params_shape, tf.float32, tf.ones_initializer())\n",
+ " beta = tf.get_variable('beta', params_shape, tf.float32, tf.zeros_initializer())\n",
+ " return gamma * normalized + beta\n",
+ "\n",
+ "def self_attention(inputs, is_training, num_units, num_heads = 8, activation=None):\n",
+ " T_q = T_k = tf.shape(inputs)[1]\n",
+ " Q_K_V = tf.layers.dense(inputs, 3*num_units, activation)\n",
+ " Q, K, V = tf.split(Q_K_V, 3, -1)\n",
+ " Q_ = tf.concat(tf.split(Q, num_heads, axis=2), 0)\n",
+ " K_ = tf.concat(tf.split(K, num_heads, axis=2), 0)\n",
+ " V_ = tf.concat(tf.split(V, num_heads, axis=2), 0)\n",
+ " align = tf.matmul(Q_, K_, transpose_b=True)\n",
+ " align *= tf.rsqrt(tf.to_float(K_.get_shape()[-1].value))\n",
+ " paddings = tf.fill(tf.shape(align), float('-inf'))\n",
+ " lower_tri = tf.ones([T_q, T_k])\n",
+ " lower_tri = tf.linalg.LinearOperatorLowerTriangular(lower_tri).to_dense()\n",
+ " masks = tf.tile(tf.expand_dims(lower_tri,0), [tf.shape(align)[0],1,1])\n",
+ " align = tf.where(tf.equal(masks, 0), paddings, align)\n",
+ " align = tf.nn.softmax(align)\n",
+ " align = tf.layers.dropout(align, 0.1, training=is_training) \n",
+ " x = tf.matmul(align, V_)\n",
+ " x = tf.concat(tf.split(x, num_heads, axis=0), 2)\n",
+ " x += inputs\n",
+ " x = layer_norm(x)\n",
+ " return x\n",
+ "\n",
+ "def ffn(inputs, hidden_dim, activation=tf.nn.relu):\n",
+ " x = tf.layers.conv1d(inputs, 4* hidden_dim, 1, activation=activation) \n",
+ " x = tf.layers.conv1d(x, hidden_dim, 1, activation=None)\n",
+ " x += inputs\n",
+ " x = layer_norm(x)\n",
+ " return x\n",
+ "\n",
+ "class Model:\n",
+ " def __init__(self, size_layer, num_layers, embedded_size,\n",
+ " dict_size, learning_rate, dropout, kernel_size = 5):\n",
+ " \n",
+ " def cnn(x, scope):\n",
+ " x += position_encoding(x)\n",
+ " with tf.variable_scope(scope, reuse = tf.AUTO_REUSE):\n",
+ " for n in range(num_layers):\n",
+ " with tf.variable_scope('attn_%d'%i,reuse=tf.AUTO_REUSE):\n",
+ " x = self_attention(x, True, size_layer)\n",
+ " with tf.variable_scope('ffn_%d'%i, reuse=tf.AUTO_REUSE):\n",
+ " x = ffn(x, size_layer)\n",
+ " \n",
+ " with tf.variable_scope('logits', reuse=tf.AUTO_REUSE):\n",
+ " return tf.layers.dense(x, 2)[:, -1]\n",
+ " \n",
+ " self.X = tf.placeholder(tf.int32, [None, None])\n",
+ " self.Y = tf.placeholder(tf.int32, [None])\n",
+ " encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1))\n",
+ " embedded_left = tf.nn.embedding_lookup(encoder_embeddings, self.X)\n",
+ " \n",
+ " self.logits = cnn(embedded_left, 'left')\n",
+ " self.cost = tf.reduce_mean(\n",
+ " tf.nn.sparse_softmax_cross_entropy_with_logits(\n",
+ " logits = self.logits, labels = self.Y\n",
+ " )\n",
+ " )\n",
+ " \n",
+ " self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)\n",
+ " correct_pred = tf.equal(\n",
+ " tf.argmax(self.logits, 1, output_type = tf.int32), self.Y\n",
+ " )\n",
+ " self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "size_layer = 128\n",
+ "num_layers = 4\n",
+ "embedded_size = 128\n",
+ "learning_rate = 1e-4\n",
+ "maxlen = 50\n",
+ "batch_size = 128\n",
+ "dropout = 0.8"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.cross_validation import train_test_split\n",
+ "\n",
+ "vectors = str_idx(left, dictionary, maxlen)\n",
+ "train_X, test_X, train_Y, test_Y = train_test_split(vectors, label, test_size = 0.2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Colocations handled automatically by placer.\n",
+ "WARNING:tensorflow:From :4: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.cast instead.\n",
+ "WARNING:tensorflow:From :20: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use keras.layers.dense instead.\n",
+ "WARNING:tensorflow:From :33: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use keras.layers.dropout instead.\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n",
+ "WARNING:tensorflow:From :41: conv1d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use keras.layers.conv1d instead.\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.cast instead.\n"
+ ]
+ }
+ ],
+ "source": [
+ "tf.reset_default_graph()\n",
+ "sess = tf.InteractiveSession()\n",
+ "model = Model(size_layer,num_layers,embedded_size,len(dictionary),learning_rate,dropout)\n",
+ "sess.run(tf.global_variables_initializer())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.20it/s, accuracy=0.663, cost=0.652]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 110.07it/s, accuracy=0.644, cost=0.674]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:54, 46.61it/s, accuracy=0.648, cost=0.617]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.000000, current acc: 0.654326\n",
+ "time taken: 60.44020199775696\n",
+ "epoch: 0, training loss: 0.639404, training acc: 0.640978, valid loss: 0.628099, valid acc: 0.654326\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.62it/s, accuracy=0.663, cost=0.619]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 112.44it/s, accuracy=0.622, cost=0.669]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 47.00it/s, accuracy=0.68, cost=0.62] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.654326, current acc: 0.667128\n",
+ "time taken: 59.827545404434204\n",
+ "epoch: 0, training loss: 0.621935, training acc: 0.659585, valid loss: 0.614735, valid acc: 0.667128\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.69it/s, accuracy=0.683, cost=0.577]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 112.01it/s, accuracy=0.6, cost=0.683] \n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:54, 46.61it/s, accuracy=0.68, cost=0.621] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.667128, current acc: 0.672164\n",
+ "time taken: 59.77066659927368\n",
+ "epoch: 0, training loss: 0.610259, training acc: 0.670584, valid loss: 0.608394, valid acc: 0.672164\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.65it/s, accuracy=0.713, cost=0.564]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 111.70it/s, accuracy=0.656, cost=0.666]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 46.84it/s, accuracy=0.711, cost=0.604]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.672164, current acc: 0.679227\n",
+ "time taken: 59.83059549331665\n",
+ "epoch: 0, training loss: 0.601291, training acc: 0.679090, valid loss: 0.602495, valid acc: 0.679227\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.56it/s, accuracy=0.703, cost=0.556]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 112.42it/s, accuracy=0.6, cost=0.659] \n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 46.75it/s, accuracy=0.695, cost=0.601]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.679227, current acc: 0.685867\n",
+ "time taken: 59.903602838516235\n",
+ "epoch: 0, training loss: 0.592938, training acc: 0.687245, valid loss: 0.597082, valid acc: 0.685867\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 46.87it/s, accuracy=0.743, cost=0.548]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 111.96it/s, accuracy=0.633, cost=0.672]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 46.98it/s, accuracy=0.695, cost=0.585]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.685867, current acc: 0.688751\n",
+ "time taken: 59.562599897384644\n",
+ "epoch: 0, training loss: 0.585165, training acc: 0.693349, valid loss: 0.592944, valid acc: 0.688751\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 46.82it/s, accuracy=0.752, cost=0.529]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 112.44it/s, accuracy=0.622, cost=0.704]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 46.92it/s, accuracy=0.719, cost=0.585]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.688751, current acc: 0.692926\n",
+ "time taken: 59.60137748718262\n",
+ "epoch: 0, training loss: 0.577756, training acc: 0.700359, valid loss: 0.590633, valid acc: 0.692926\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:54<00:00, 46.72it/s, accuracy=0.733, cost=0.524]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 112.66it/s, accuracy=0.622, cost=0.695]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 46.71it/s, accuracy=0.719, cost=0.597]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.692926, current acc: 0.694126\n",
+ "time taken: 59.701225996017456\n",
+ "epoch: 0, training loss: 0.570621, training acc: 0.705953, valid loss: 0.587987, valid acc: 0.694126\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.07it/s, accuracy=0.743, cost=0.517]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 112.60it/s, accuracy=0.667, cost=0.664]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 47.16it/s, accuracy=0.75, cost=0.59] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.694126, current acc: 0.697845\n",
+ "time taken: 59.29985284805298\n",
+ "epoch: 0, training loss: 0.563849, training acc: 0.711581, valid loss: 0.585073, valid acc: 0.697845\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 46.92it/s, accuracy=0.752, cost=0.49] \n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 112.64it/s, accuracy=0.689, cost=0.684]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 47.25it/s, accuracy=0.734, cost=0.591]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.697845, current acc: 0.699698\n",
+ "time taken: 59.466017723083496\n",
+ "epoch: 0, training loss: 0.557104, training acc: 0.716393, valid loss: 0.583814, valid acc: 0.699698\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 46.91it/s, accuracy=0.733, cost=0.527]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 113.03it/s, accuracy=0.644, cost=0.68] \n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:54, 46.28it/s, accuracy=0.75, cost=0.56] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.699698, current acc: 0.700679\n",
+ "time taken: 59.46453809738159\n",
+ "epoch: 0, training loss: 0.551015, training acc: 0.721082, valid loss: 0.580544, valid acc: 0.700679\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.04it/s, accuracy=0.762, cost=0.522]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 113.48it/s, accuracy=0.678, cost=0.651]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 47.44it/s, accuracy=0.758, cost=0.556]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.700679, current acc: 0.702092\n",
+ "time taken: 59.29327607154846\n",
+ "epoch: 0, training loss: 0.545043, training acc: 0.725462, valid loss: 0.581033, valid acc: 0.702092\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.21it/s, accuracy=0.762, cost=0.516]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 113.11it/s, accuracy=0.7, cost=0.654] \n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:54, 46.67it/s, accuracy=0.727, cost=0.55] "
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.702092, current acc: 0.702943\n",
+ "time taken: 59.11387062072754\n",
+ "epoch: 0, training loss: 0.539628, training acc: 0.729723, valid loss: 0.581183, valid acc: 0.702943\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.15it/s, accuracy=0.762, cost=0.502]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 112.97it/s, accuracy=0.633, cost=0.693]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:52, 47.68it/s, accuracy=0.758, cost=0.545]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.702943, current acc: 0.705497\n",
+ "time taken: 59.19653916358948\n",
+ "epoch: 0, training loss: 0.533567, training acc: 0.734188, valid loss: 0.578577, valid acc: 0.705497\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.06it/s, accuracy=0.743, cost=0.483]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 112.83it/s, accuracy=0.644, cost=0.721]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 47.13it/s, accuracy=0.727, cost=0.544]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.705497, current acc: 0.709658\n",
+ "time taken: 59.30323553085327\n",
+ "epoch: 0, training loss: 0.528961, training acc: 0.737278, valid loss: 0.575870, valid acc: 0.709658\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.01it/s, accuracy=0.782, cost=0.481]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 113.54it/s, accuracy=0.7, cost=0.699] \n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:52, 47.92it/s, accuracy=0.805, cost=0.487]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 59.32865643501282\n",
+ "epoch: 0, training loss: 0.522808, training acc: 0.741622, valid loss: 0.579368, valid acc: 0.706827\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.29it/s, accuracy=0.733, cost=0.481]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 113.10it/s, accuracy=0.622, cost=0.675]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 47.33it/s, accuracy=0.789, cost=0.505]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 59.023605823516846\n",
+ "epoch: 0, training loss: 0.517364, training acc: 0.744728, valid loss: 0.578737, valid acc: 0.709103\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.16it/s, accuracy=0.792, cost=0.454]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 113.06it/s, accuracy=0.567, cost=0.64] \n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:52, 47.79it/s, accuracy=0.789, cost=0.486]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.709658, current acc: 0.711080\n",
+ "time taken: 59.17823839187622\n",
+ "epoch: 0, training loss: 0.512706, training acc: 0.748938, valid loss: 0.575415, valid acc: 0.711080\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.10it/s, accuracy=0.782, cost=0.43] \n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 112.75it/s, accuracy=0.656, cost=0.655]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:54, 46.70it/s, accuracy=0.766, cost=0.531]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 59.26551961898804\n",
+ "epoch: 0, training loss: 0.507218, training acc: 0.751649, valid loss: 0.579230, valid acc: 0.709997\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.01it/s, accuracy=0.832, cost=0.41] \n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 113.18it/s, accuracy=0.622, cost=0.669]\n",
+ "train minibatch loop: 0%| | 5/2527 [00:00<00:53, 47.25it/s, accuracy=0.734, cost=0.526]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 59.346855878829956\n",
+ "epoch: 0, training loss: 0.502882, training acc: 0.755138, valid loss: 0.583503, valid acc: 0.707928\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 2527/2527 [00:53<00:00, 47.27it/s, accuracy=0.802, cost=0.441]\n",
+ "test minibatch loop: 100%|██████████| 632/632 [00:05<00:00, 113.33it/s, accuracy=0.622, cost=0.659]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 59.0352988243103\n",
+ "epoch: 0, training loss: 0.498010, training acc: 0.757788, valid loss: 0.579649, valid acc: 0.709758\n",
+ "\n",
+ "break epoch:0\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "import time\n",
+ "\n",
+ "EARLY_STOPPING, CURRENT_CHECKPOINT, CURRENT_ACC, EPOCH = 3, 0, 0, 0\n",
+ "\n",
+ "while True:\n",
+ " lasttime = time.time()\n",
+ " if CURRENT_CHECKPOINT == EARLY_STOPPING:\n",
+ " print('break epoch:%d\\n' % (EPOCH))\n",
+ " break\n",
+ "\n",
+ " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n",
+ " pbar = tqdm(range(0, len(train_X), batch_size), desc='train minibatch loop')\n",
+ " for i in pbar:\n",
+ " batch_x = train_X[i:min(i+batch_size,train_X.shape[0])]\n",
+ " batch_y = train_Y[i:min(i+batch_size,train_X.shape[0])]\n",
+ " acc, loss, _ = sess.run([model.accuracy, model.cost, model.optimizer], \n",
+ " feed_dict = {model.X : batch_x,\n",
+ " model.Y : batch_y})\n",
+ " assert not np.isnan(loss)\n",
+ " train_loss += loss\n",
+ " train_acc += acc\n",
+ " pbar.set_postfix(cost=loss, accuracy = acc)\n",
+ " \n",
+ " pbar = tqdm(range(0, len(test_X), batch_size), desc='test minibatch loop')\n",
+ " for i in pbar:\n",
+ " batch_x = test_X[i:min(i+batch_size,test_X.shape[0])]\n",
+ " batch_y = test_Y[i:min(i+batch_size,test_X.shape[0])]\n",
+ " acc, loss = sess.run([model.accuracy, model.cost], \n",
+ " feed_dict = {model.X : batch_x,\n",
+ " model.Y : batch_y})\n",
+ " test_loss += loss\n",
+ " test_acc += acc\n",
+ " pbar.set_postfix(cost=loss, accuracy = acc)\n",
+ " \n",
+ " train_loss /= (len(train_X) / batch_size)\n",
+ " train_acc /= (len(train_X) / batch_size)\n",
+ " test_loss /= (len(test_X) / batch_size)\n",
+ " test_acc /= (len(test_X) / batch_size)\n",
+ " \n",
+ " if test_acc > CURRENT_ACC:\n",
+ " print(\n",
+ " 'epoch: %d, pass acc: %f, current acc: %f'\n",
+ " % (EPOCH, CURRENT_ACC, test_acc)\n",
+ " )\n",
+ " CURRENT_ACC = test_acc\n",
+ " CURRENT_CHECKPOINT = 0\n",
+ " else:\n",
+ " CURRENT_CHECKPOINT += 1\n",
+ " \n",
+ " print('time taken:', time.time()-lasttime)\n",
+ " print('epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'%(EPOCH,train_loss,\n",
+ " train_acc,test_loss,\n",
+ " test_acc))"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/text-similarity/6.bert.ipynb b/text-similarity/6.bert.ipynb
new file mode 100644
index 0000000..7032275
--- /dev/null
+++ b/text-similarity/6.bert.ipynb
@@ -0,0 +1,621 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# !wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv\n",
+ "# !pip3 install bert-tensorflow --user\n",
+ "# !wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip\n",
+ "# !unzip uncased_L-12_H-768_A-12.zip"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import bert\n",
+ "from bert import run_classifier\n",
+ "from bert import optimization\n",
+ "from bert import tokenization\n",
+ "from bert import modeling\n",
+ "import numpy as np\n",
+ "import tensorflow as tf\n",
+ "import pandas as pd\n",
+ "from tqdm import tqdm"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "BERT_VOCAB = 'uncased_L-12_H-768_A-12/vocab.txt'\n",
+ "BERT_INIT_CHKPNT = 'uncased_L-12_H-768_A-12/bert_model.ckpt'\n",
+ "BERT_CONFIG = 'uncased_L-12_H-768_A-12/bert_config.json'\n",
+ "\n",
+ "tokenization.validate_case_matches_checkpoint(True, '')\n",
+ "tokenizer = tokenization.FullTokenizer(\n",
+ " vocab_file=BERT_VOCAB, do_lower_case=True)\n",
+ "MAX_SEQ_LENGTH = 100"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " id | \n",
+ " qid1 | \n",
+ " qid2 | \n",
+ " question1 | \n",
+ " question2 | \n",
+ " is_duplicate | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " What is the step by step guide to invest in sh... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 3 | \n",
+ " 4 | \n",
+ " What is the story of Kohinoor (Koh-i-Noor) Dia... | \n",
+ " What would happen if the Indian government sto... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 5 | \n",
+ " 6 | \n",
+ " How can I increase the speed of my internet co... | \n",
+ " How can Internet speed be increased by hacking... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 3 | \n",
+ " 7 | \n",
+ " 8 | \n",
+ " Why am I mentally very lonely? How can I solve... | \n",
+ " Find the remainder when [math]23^{24}[/math] i... | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 4 | \n",
+ " 9 | \n",
+ " 10 | \n",
+ " Which one dissolve in water quikly sugar, salt... | \n",
+ " Which fish would survive in salt water? | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " id qid1 qid2 question1 \\\n",
+ "0 0 1 2 What is the step by step guide to invest in sh... \n",
+ "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n",
+ "2 2 5 6 How can I increase the speed of my internet co... \n",
+ "3 3 7 8 Why am I mentally very lonely? How can I solve... \n",
+ "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n",
+ "\n",
+ " question2 is_duplicate \n",
+ "0 What is the step by step guide to invest in sh... 0 \n",
+ "1 What would happen if the Indian government sto... 0 \n",
+ "2 How can Internet speed be increased by hacking... 0 \n",
+ "3 Find the remainder when [math]23^{24}[/math] i... 0 \n",
+ "4 Which fish would survive in salt water? 0 "
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv('quora_duplicate_questions.tsv', delimiter='\\t').dropna()\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "left, right, label = df['question1'].tolist(), df['question2'].tolist(), df['is_duplicate'].tolist()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 404287/404287 [02:58<00:00, 2262.11it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "def _truncate_seq_pair(tokens_a, tokens_b, max_length):\n",
+ " while True:\n",
+ " total_length = len(tokens_a) + len(tokens_b)\n",
+ " if total_length <= max_length:\n",
+ " break\n",
+ " if len(tokens_a) > len(tokens_b):\n",
+ " tokens_a.pop()\n",
+ " else:\n",
+ " tokens_b.pop()\n",
+ "\n",
+ "input_ids, input_masks, segment_ids = [], [], []\n",
+ "\n",
+ "for i in tqdm(range(len(left))):\n",
+ " tokens_a = tokenizer.tokenize(left[i])\n",
+ " tokens_b = tokenizer.tokenize(right[i])\n",
+ " _truncate_seq_pair(tokens_a, tokens_b, MAX_SEQ_LENGTH - 3)\n",
+ " \n",
+ " tokens = []\n",
+ " segment_id = []\n",
+ " tokens.append(\"[CLS]\")\n",
+ " segment_id.append(0)\n",
+ " for token in tokens_a:\n",
+ " tokens.append(token)\n",
+ " segment_id.append(0)\n",
+ " tokens.append(\"[SEP]\")\n",
+ " segment_id.append(0)\n",
+ " for token in tokens_b:\n",
+ " tokens.append(token)\n",
+ " segment_id.append(1)\n",
+ " tokens.append(\"[SEP]\")\n",
+ " segment_id.append(1)\n",
+ " input_id = tokenizer.convert_tokens_to_ids(tokens)\n",
+ " input_mask = [1] * len(input_id)\n",
+ " \n",
+ " while len(input_id) < MAX_SEQ_LENGTH:\n",
+ " input_id.append(0)\n",
+ " input_mask.append(0)\n",
+ " segment_id.append(0)\n",
+ " \n",
+ " input_ids.append(input_id)\n",
+ " input_masks.append(input_mask)\n",
+ " segment_ids.append(segment_id)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bert_config = modeling.BertConfig.from_json_file(BERT_CONFIG)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "epoch = 10\n",
+ "batch_size = 60\n",
+ "warmup_proportion = 0.1\n",
+ "num_train_steps = int(len(left) / batch_size * epoch)\n",
+ "num_warmup_steps = int(num_train_steps * warmup_proportion)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class Model:\n",
+ " def __init__(\n",
+ " self,\n",
+ " dimension_output,\n",
+ " learning_rate = 2e-5,\n",
+ " ):\n",
+ " self.X = tf.placeholder(tf.int32, [None, None])\n",
+ " self.segment_ids = tf.placeholder(tf.int32, [None, None])\n",
+ " self.input_masks = tf.placeholder(tf.int32, [None, None])\n",
+ " self.Y = tf.placeholder(tf.int32, [None])\n",
+ " \n",
+ " model = modeling.BertModel(\n",
+ " config=bert_config,\n",
+ " is_training=True,\n",
+ " input_ids=self.X,\n",
+ " input_mask=self.input_masks,\n",
+ " token_type_ids=self.segment_ids,\n",
+ " use_one_hot_embeddings=False)\n",
+ " \n",
+ " output_layer = model.get_pooled_output()\n",
+ " self.logits = tf.layers.dense(output_layer, dimension_output)\n",
+ " self.logits = tf.identity(self.logits, name = 'logits')\n",
+ " \n",
+ " self.cost = tf.reduce_mean(\n",
+ " tf.nn.sparse_softmax_cross_entropy_with_logits(\n",
+ " logits = self.logits, labels = self.Y\n",
+ " )\n",
+ " )\n",
+ " \n",
+ " self.optimizer = optimization.create_optimizer(self.cost, learning_rate, \n",
+ " num_train_steps, num_warmup_steps, False)\n",
+ " correct_pred = tf.equal(\n",
+ " tf.argmax(self.logits, 1, output_type = tf.int32), self.Y\n",
+ " )\n",
+ " self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Colocations handled automatically by placer.\n",
+ "\n",
+ "WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.\n",
+ "For more information, please see:\n",
+ " * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md\n",
+ " * https://github.com/tensorflow/addons\n",
+ "If you depend on functionality not listed there, please file an issue.\n",
+ "\n",
+ "WARNING:tensorflow:From /home/jupyter/.local/lib/python3.6/site-packages/bert/modeling.py:358: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n",
+ "WARNING:tensorflow:From /home/jupyter/.local/lib/python3.6/site-packages/bert/modeling.py:671: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use keras.layers.dense instead.\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Deprecated in favor of operator or tf.math.divide.\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.cast instead.\n",
+ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use standard file APIs to check for files with this prefix.\n",
+ "INFO:tensorflow:Restoring parameters from uncased_L-12_H-768_A-12/bert_model.ckpt\n"
+ ]
+ }
+ ],
+ "source": [
+ "dimension_output = 2\n",
+ "learning_rate = 1e-5\n",
+ "\n",
+ "tf.reset_default_graph()\n",
+ "sess = tf.InteractiveSession()\n",
+ "model = Model(\n",
+ " dimension_output,\n",
+ " learning_rate\n",
+ ")\n",
+ "\n",
+ "sess.run(tf.global_variables_initializer())\n",
+ "var_lists = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope = 'bert')\n",
+ "saver = tf.train.Saver(var_list = var_lists)\n",
+ "saver.restore(sess, BERT_INIT_CHKPNT)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "train_input_ids, test_input_ids, train_input_masks, test_input_masks, train_segment_ids, test_segment_ids, train_Y, test_Y = train_test_split(\n",
+ " input_ids, input_masks, segment_ids, label, test_size = 0.2\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 5391/5391 [35:52<00:00, 2.88it/s, accuracy=0.966, cost=0.205]\n",
+ "test minibatch loop: 100%|██████████| 1348/1348 [03:06<00:00, 7.23it/s, accuracy=0.868, cost=0.271]\n",
+ "train minibatch loop: 0%| | 0/5391 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 0, pass acc: 0.000000, current acc: 0.867448\n",
+ "time taken: 2338.765106678009\n",
+ "epoch: 0, training loss: 0.396254, training acc: 0.802392, valid loss: 0.298802, valid acc: 0.867448\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 5391/5391 [35:50<00:00, 2.88it/s, accuracy=0.931, cost=0.159] \n",
+ "test minibatch loop: 100%|██████████| 1348/1348 [03:05<00:00, 7.25it/s, accuracy=0.947, cost=0.193]\n",
+ "train minibatch loop: 0%| | 0/5391 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 1, pass acc: 0.867448, current acc: 0.888630\n",
+ "time taken: 2336.9921276569366\n",
+ "epoch: 1, training loss: 0.267519, training acc: 0.884203, valid loss: 0.259737, valid acc: 0.888630\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 5391/5391 [35:51<00:00, 2.88it/s, accuracy=0.966, cost=0.12] \n",
+ "test minibatch loop: 100%|██████████| 1348/1348 [03:05<00:00, 7.25it/s, accuracy=0.921, cost=0.249] \n",
+ "train minibatch loop: 0%| | 0/5391 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 2, pass acc: 0.888630, current acc: 0.894423\n",
+ "time taken: 2336.949964761734\n",
+ "epoch: 2, training loss: 0.202998, training acc: 0.917379, valid loss: 0.256634, valid acc: 0.894423\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 58%|█████▊ | 3149/5391 [20:56<14:54, 2.51it/s, accuracy=0.95, cost=0.0977] IOPub message rate exceeded.\n",
+ "The notebook server will temporarily stop sending output\n",
+ "to the client in order to avoid crashing it.\n",
+ "To change this limit, set the config variable\n",
+ "`--NotebookApp.iopub_msg_rate_limit`.\n",
+ "\n",
+ "Current values:\n",
+ "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+ "NotebookApp.rate_limit_window=3.0 (secs)\n",
+ "\n",
+ "train minibatch loop: 100%|██████████| 5391/5391 [35:50<00:00, 2.88it/s, accuracy=0.966, cost=0.059] \n",
+ "test minibatch loop: 100%|██████████| 1348/1348 [03:05<00:00, 7.26it/s, accuracy=0.947, cost=0.171] \n",
+ "train minibatch loop: 0%| | 0/5391 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 3, pass acc: 0.894423, current acc: 0.896446\n",
+ "time taken: 2336.6024100780487\n",
+ "epoch: 3, training loss: 0.157257, training acc: 0.938867, valid loss: 0.278056, valid acc: 0.896446\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 1%| | 61/5391 [00:24<35:26, 2.51it/s, accuracy=0.95, cost=0.144] IOPub message rate exceeded.\n",
+ "The notebook server will temporarily stop sending output\n",
+ "to the client in order to avoid crashing it.\n",
+ "To change this limit, set the config variable\n",
+ "`--NotebookApp.iopub_msg_rate_limit`.\n",
+ "\n",
+ "Current values:\n",
+ "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+ "NotebookApp.rate_limit_window=3.0 (secs)\n",
+ "\n",
+ "train minibatch loop: 71%|███████ | 3811/5391 [25:20<10:30, 2.51it/s, accuracy=0.883, cost=0.233] IOPub message rate exceeded.\n",
+ "The notebook server will temporarily stop sending output\n",
+ "to the client in order to avoid crashing it.\n",
+ "To change this limit, set the config variable\n",
+ "`--NotebookApp.iopub_msg_rate_limit`.\n",
+ "\n",
+ "Current values:\n",
+ "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
+ "NotebookApp.rate_limit_window=3.0 (secs)\n",
+ "\n",
+ "test minibatch loop: 100%|██████████| 1348/1348 [03:05<00:00, 7.25it/s, accuracy=0.947, cost=0.164] \n",
+ "train minibatch loop: 0%| | 0/5391 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 4, pass acc: 0.896446, current acc: 0.898413\n",
+ "time taken: 2336.2252011299133\n",
+ "epoch: 4, training loss: 0.124672, training acc: 0.953999, valid loss: 0.313267, valid acc: 0.898413\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 100%|██████████| 5391/5391 [35:51<00:00, 2.88it/s, accuracy=1, cost=0.0268] \n",
+ "test minibatch loop: 100%|██████████| 1348/1348 [03:05<00:00, 7.25it/s, accuracy=0.947, cost=0.316] \n",
+ "train minibatch loop: 0%| | 0/5391 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "time taken: 2336.9758813381195\n",
+ "epoch: 5, training loss: 0.101208, training acc: 0.963977, valid loss: 0.333253, valid acc: 0.897881\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "train minibatch loop: 59%|█████▉ | 3189/5391 [21:12<14:38, 2.51it/s, accuracy=0.917, cost=0.217] "
+ ]
+ }
+ ],
+ "source": [
+ "from tqdm import tqdm\n",
+ "import time\n",
+ "\n",
+ "EARLY_STOPPING, CURRENT_CHECKPOINT, CURRENT_ACC, EPOCH = 3, 0, 0, 0\n",
+ "\n",
+ "while True:\n",
+ " lasttime = time.time()\n",
+ " if CURRENT_CHECKPOINT == EARLY_STOPPING:\n",
+ " print('break epoch:%d\\n' % (EPOCH))\n",
+ " break\n",
+ "\n",
+ " train_acc, train_loss, test_acc, test_loss = 0, 0, 0, 0\n",
+ " pbar = tqdm(\n",
+ " range(0, len(train_input_ids), batch_size), desc = 'train minibatch loop'\n",
+ " )\n",
+ " for i in pbar:\n",
+ " index = min(i + batch_size, len(train_input_ids))\n",
+ " batch_x = train_input_ids[i: index]\n",
+ " batch_masks = train_input_masks[i: index]\n",
+ " batch_segment = train_segment_ids[i: index]\n",
+ " batch_y = train_Y[i: index]\n",
+ " acc, cost, _ = sess.run(\n",
+ " [model.accuracy, model.cost, model.optimizer],\n",
+ " feed_dict = {\n",
+ " model.Y: batch_y,\n",
+ " model.X: batch_x,\n",
+ " model.segment_ids: batch_segment,\n",
+ " model.input_masks: batch_masks\n",
+ " },\n",
+ " )\n",
+ " assert not np.isnan(cost)\n",
+ " train_loss += cost\n",
+ " train_acc += acc\n",
+ " pbar.set_postfix(cost = cost, accuracy = acc)\n",
+ " \n",
+ " pbar = tqdm(range(0, len(test_input_ids), batch_size), desc = 'test minibatch loop')\n",
+ " for i in pbar:\n",
+ " index = min(i + batch_size, len(test_input_ids))\n",
+ " batch_x = test_input_ids[i: index]\n",
+ " batch_masks = test_input_masks[i: index]\n",
+ " batch_segment = test_segment_ids[i: index]\n",
+ " batch_y = test_Y[i: index]\n",
+ " acc, cost = sess.run(\n",
+ " [model.accuracy, model.cost],\n",
+ " feed_dict = {\n",
+ " model.Y: batch_y,\n",
+ " model.X: batch_x,\n",
+ " model.segment_ids: batch_segment,\n",
+ " model.input_masks: batch_masks\n",
+ " },\n",
+ " )\n",
+ " test_loss += cost\n",
+ " test_acc += acc\n",
+ " pbar.set_postfix(cost = cost, accuracy = acc)\n",
+ "\n",
+ " train_loss /= len(train_input_ids) / batch_size\n",
+ " train_acc /= len(train_input_ids) / batch_size\n",
+ " test_loss /= len(test_input_ids) / batch_size\n",
+ " test_acc /= len(test_input_ids) / batch_size\n",
+ "\n",
+ " if test_acc > CURRENT_ACC:\n",
+ " print(\n",
+ " 'epoch: %d, pass acc: %f, current acc: %f'\n",
+ " % (EPOCH, CURRENT_ACC, test_acc)\n",
+ " )\n",
+ " CURRENT_ACC = test_acc\n",
+ " CURRENT_CHECKPOINT = 0\n",
+ " else:\n",
+ " CURRENT_CHECKPOINT += 1\n",
+ " \n",
+ " print('time taken:', time.time() - lasttime)\n",
+ " print(\n",
+ " 'epoch: %d, training loss: %f, training acc: %f, valid loss: %f, valid acc: %f\\n'\n",
+ " % (EPOCH, train_loss, train_acc, test_loss, test_acc)\n",
+ " )\n",
+ " EPOCH += 1"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.8"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/text-similarity/README.md b/text-similarity/README.md
index ccdebb7..bc9cb66 100644
--- a/text-similarity/README.md
+++ b/text-similarity/README.md
@@ -1,13 +1,3 @@
## How-to
-1. Download char based dataset from [https://drive.google.com/open?id=1HnMv7ulfh8yuq9yIrt_IComGEpDrNyo-](https://drive.google.com/open?id=1HnMv7ulfh8yuq9yIrt_IComGEpDrNyo-).
-
-2. Download word based dataset from [https://drive.google.com/open?id=1itu7IreU_SyUSdmTWydniGxW-JEGTjrv](https://drive.google.com/open?id=1itu7IreU_SyUSdmTWydniGxW-JEGTjrv).
-
-3. Unzip in the same notebooks location.
-
-4. Run any notebook using Jupyter Notebook.
-
-## Improvement
-
-It is better if the embedded loaded from Glove, or any pretrained model.
+1. Run any notebook using Jupyter Notebook.