Skip to content

Commit f66ea29

Browse files
committed
WIP: updates.
1 parent cb93e4b commit f66ea29

File tree

1 file changed

+41
-21
lines changed

1 file changed

+41
-21
lines changed

main.md

Lines changed: 41 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@ from sklearn.metrics.pairwise import pairwise_distances
3636
from sklearn.manifold.t_sne import (_joint_probabilities,
3737
_kl_divergence)
3838
from sklearn.utils.extmath import _ravel
39+
# Random state.
40+
RS = 20150101
3941

4042
# We'll use matplotlib for graphics.
4143
import matplotlib.pyplot as plt
@@ -93,7 +95,7 @@ Now let's run the t-SNE algorithm on the dataset. It just takes one line with sc
9395
<pre data-code-language="python"
9496
data-executable="true"
9597
data-type="programlisting">
96-
digits_proj = TSNE().fit_transform(digits.data)
98+
digits_proj = TSNE(random_state=RS).fit_transform(digits.data)
9799
</pre>
98100

99101
Here is a utility function used to display the transformed dataset. The color of each point refers to the actual digit (of course, this information was not used by the dimensionality reduction algorithm).
@@ -154,50 +156,56 @@ How do we choose the positions of the map points? We want to conserve the struct
154156

155157
<span class="math-tex" data-type="tex">\\(p_{j|i} = \frac{\exp\left(-\left| x_i - x_j\right|^2 \big/ 2\sigma_i^2\right)}{\displaystyle\sum_{k \neq i} \exp\left(-\left| x_i - x_k\right|^2 \big/ 2\sigma_i^2\right)}\\)</span>
156158

157-
This measures how close <span class="math-tex" data-type="tex">\\(x_j\\)</span> is from <span class="math-tex" data-type="tex">\\(x_i\\)</span>, considering a Gaussian distribution around <span class="math-tex" data-type="tex">\\(x_i\\)</span> with a given variance <span class="math-tex" data-type="tex">\\(\sigma_i^2\\)</span>. This variance is different for every point; it is chosen such that points in dense areas are given a smaller variance than points in sparse areas.
159+
This measures how close <span class="math-tex" data-type="tex">\\(x_j\\)</span> is from <span class="math-tex" data-type="tex">\\(x_i\\)</span>, considering a **Gaussian distribution** around <span class="math-tex" data-type="tex">\\(x_i\\)</span> with a given variance <span class="math-tex" data-type="tex">\\(\sigma_i^2\\)</span>. This variance is different for every point; it is chosen such that points in dense areas are given a smaller variance than points in sparse areas. The original paper details how this variance is computed exactly.
158160

159161
Now, we define the similarity as a symmetrized version of the conditional similarity:
160162

161163
<span class="math-tex" data-type="tex">\\(p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}\\)</span>
162164

163-
We obtain a similarity matrix for our original dataset. What does this matrix look like?
165+
We obtain a **similarity matrix** for our original dataset. What does this matrix look like?
164166

165-
We first reorder the data points according to the handwritten number.
167+
## Similarity matrix
168+
169+
Before showing the matrix, let's reorder the data points according to the handwritten number.
166170

167171
<pre data-code-language="python"
168172
data-executable="true"
169173
data-type="programlisting">
174+
# Dataset.
170175
X = np.vstack([digits.data[digits.target==i]
171176
for i in range(10)])
177+
# Target (digit).
172178
y = np.hstack([digits.target[digits.target==i]
173179
for i in range(10)])
174180
</pre>
175181

176-
The following function computes the similarity with a constant sigma.
182+
The following function computes the similarity with a constant <span class="math-tex" data-type="tex">\\(\sigma\\)</span>.
177183

178184
<pre data-code-language="python"
179185
data-executable="true"
180186
data-type="programlisting">
181187
def _joint_probabilities_constant_sigma(D, sigma):
182-
P = np.exp(-D**2/2*sigma**2)
188+
P = np.exp(-D**2/2 * sigma**2)
183189
P /= np.sum(P, axis=1)
184190
return P
185191
</pre>
186192

187-
We now compute the similarity with a sigma depending on the data point (found via a binary search). This algorith is implemented in scikit-learn's `_joint_probabilities` function.
193+
We now compute the similarity with a <span class="math-tex" data-type="tex">\\(\sigma_i\\)</span> depending on the data point (found via a binary search, according to the original t-SNE paper). This algorithm is implemented in the `_joint_probabilities` private function in scikit-learn's code.
188194

189195
<pre data-code-language="python"
190196
data-executable="true"
191197
data-type="programlisting">
192198
# Pairwise distances between all data points.
193199
D = pairwise_distances(X, squared=True)
200+
# Similarity with constant sigma.
194201
P_constant = _joint_probabilities_constant_sigma(D, .002)
202+
# Similarity with variable sigma.
195203
P_binary = _joint_probabilities(D, 30., False)
196204
# The output of this function needs to be reshaped to a square matrix.
197205
P_binary_s = squareform(P_binary)
198206
</pre>
199207

200-
Let's display the distance matrix of the data points, and the similarity matrix with both a constant and variable sigma.
208+
We can now display the distance matrix of the data points, and the similarity matrix with both a constant and variable sigma.
201209

202210
<pre data-code-language="python"
203211
data-executable="true"
@@ -213,15 +221,16 @@ plt.title("Distance matrix", fontdict={'fontsize': 16})
213221
plt.subplot(132)
214222
plt.imshow(P_constant[::10, ::10], interpolation='none', cmap=pal)
215223
plt.axis('off')
216-
plt.title("$p_{j|i}$ (constant sigma)", fontdict={'fontsize': 16})
224+
plt.title("$p_{j|i}$ (constant $\sigma$)", fontdict={'fontsize': 16})
217225

218226
plt.subplot(133)
219227
plt.imshow(P_binary_s[::10, ::10], interpolation='none', cmap=pal)
220228
plt.axis('off')
221-
plt.title("$p_{j|i}$ (binary search sigma)", fontdict={'fontsize': 16});
229+
plt.title("$p_{j|i}$ (variable $\sigma$)", fontdict={'fontsize': 16})
230+
plt.savefig('images/similarity.png', dpi=120)
222231
</pre>
223232

224-
We already observe the 10 groups in the data, corresponding to the 10 numbers.
233+
We can already observe the 10 groups in the data, corresponding to the 10 numbers.
225234

226235
Let's also define a similarity matrix for our map points.
227236

@@ -237,17 +246,22 @@ Let's assume that our map points are all connected with springs. The stiffness o
237246

238247
The final mapping is obtained when the equilibrium is reached.
239248

249+
Here is an illustration of a dynamic graph layout based on a similar idea. Nodes are connected via springs and the system evolves according to law of physics (example by [Mike Bostock](http://bl.ocks.org/mbostock/4062045)).
250+
251+
<iframe src="http://bl.ocks.org/rossant/raw/f06184034ba66a0bd06a/"
252+
style="border: 0; width: 620px; height: 440px; margin: 0; padding: 0;" />
253+
240254
## Algorithm
241255

242-
Remarkably, this analogy stems exactly from a natural mathematical algorithm. It corresponds to minimizing the [Kullback-Leiber](http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) divergence between the two distributions <span class="math-tex" data-type="tex">\\(\big(p_{ij}\big)\\)</span> and <span class="math-tex" data-type="tex">\\(\big(q_{ij}\big)\\)</span>:
256+
Remarkably, this physical analogy stems naturally from the mathematical algorithm. It corresponds to minimizing the [Kullback-Leiber](http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) divergence between the two distributions <span class="math-tex" data-type="tex">\\(\big(p_{ij}\big)\\)</span> and <span class="math-tex" data-type="tex">\\(\big(q_{ij}\big)\\)</span>:
243257

244258
<span class="math-tex" data-type="tex">\\(KL(P||Q) = \sum_{i, j} p_{ij} \, \log \frac{p_{ij}}{q_{ij}}.\\)</span>
245259

246260
This measures the distance between our two similarity matrices.
247261

248262
To minimize this score, we perform a gradient descent. The gradient can be computed analytically:
249263

250-
<span class="math-tex" data-type="tex">\\(\frac{\partial \, K!L(P || Q)}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) g\left( \left| x_i - x_j\right| \right) u_{ij} \quad \textrm{where} \, g(z) = \frac{z}{1+z^2}.\\)</span>
264+
<span class="math-tex" data-type="tex">\\(\frac{\partial \, KL(P || Q)}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) g\left( \left| x_i - x_j\right| \right) u_{ij} \quad \textrm{where} \, g(z) = \frac{z}{1+z^2}.\\)</span>
251265

252266
Here, <span class="math-tex" data-type="tex">\\(u_{ij}\\)</span> is a unit vector going from <span class="math-tex" data-type="tex">\\(y_j\\)</span> to <span class="math-tex" data-type="tex">\\(y_i\\)</span>. This gradient expresses the sum of all spring forces applied to map point <span class="math-tex" data-type="tex">\\(i\\)</span>.
253267

@@ -271,7 +285,7 @@ def _gradient_descent(objective, p0, it, n_iter, n_iter_without_progress=30,
271285
best_iter = 0
272286

273287
for i in range(it, n_iter):
274-
# We append the current position.
288+
# We save the current position.
275289
positions.append(p.copy())
276290

277291
new_error, grad = objective(p, *args)
@@ -307,7 +321,7 @@ Let's run the algorithm again, but this time saving all intermediate positions.
307321
<pre data-code-language="python"
308322
data-executable="true"
309323
data-type="programlisting">
310-
X_proj = TSNE().fit_transform(X)
324+
X_proj = TSNE(random_state=RS).fit_transform(X)
311325
</pre>
312326

313327
<pre data-code-language="python"
@@ -336,12 +350,12 @@ def make_frame_mpl(t):
336350

337351
animation = mpy.VideoClip(make_frame_mpl,
338352
duration=X_iter.shape[2]/40.)
339-
animation.write_gif("tsne.gif", fps=20)
353+
animation.write_gif("images/animation.gif", fps=20)
340354
</pre>
341355

342-
<img src="tsne.gif" />
356+
<img src="images/animation.gif" />
343357

344-
We can observe the different phases of the optimization. The details of the algorithm can be found in the original paper.
358+
We can clearly observe the different phases of the optimization, as described in the original paper.
345359

346360
Let's also create an animation of the similarity matrix of the map points. We'll observe that it's getting closer and closer to the similarity matrix of the data points.
347361

@@ -368,10 +382,10 @@ def make_frame_mpl(t):
368382

369383
animation = mpy.VideoClip(make_frame_mpl,
370384
duration=X_iter.shape[2]/40.)
371-
animation.write_gif("tsne_matrix.gif", fps=20)
385+
animation.write_gif("images/animation_matrix.gif", fps=20)
372386
</pre>
373387

374-
<img src="tsne_matrix.gif" />
388+
<img src="images/animation_matrix.gif" />
375389

376390
## The t-Student distribution
377391

@@ -401,8 +415,11 @@ for i, D in enumerate((2, 5, 10)):
401415
ax.hist(norm(points, axis=1),
402416
bins=np.linspace(0., 1., 50))
403417
ax.set_title('D=%d' % D, loc='left')
418+
plt.savefig('images/spheres.png', dpi=100)
404419
</pre>
405420

421+
![Spheres](images/spheres.png)
422+
406423
When reducing the dimensionality of a dataset, if we used the same Gaussian distribution for the data points and the map points, we could get an _imbalance_ among the neighbors of a given point. This imbalance would lead to an excess of attraction forces and a sometimes unappealing mapping. This is actually what happens in the original SNE algorithm, by Hinton and Roweis (2002).
407424

408425
The t-SNE algorithm works around this problem by using a t-Student with one degree of freedom (or Cauchy) distribution for the map points. This distribution has a much heavier tail than the Gaussian distribution, which _compensates_ the original imbalance. For a given data similarity between two data points, the two corresponding map points will need to be much further apart in order for their similarity to match the data similarity. This is can be seen in the following plot.
@@ -415,9 +432,12 @@ gauss = np.exp(-z**2)
415432
cauchy = 1/(1+z**2)
416433
plt.plot(z, gauss, label='Gaussian distribution')
417434
plt.plot(z, cauchy, label='Cauchy distribution')
418-
plt.legend();
435+
plt.legend()
436+
plt.savefig('images/distributions.png', dpi=120)
419437
</pre>
420438

439+
![Gaussian and Cauchy distributions](images/distributions.png)
440+
421441
## Conclusion
422442

423443
The t-SNE algorithm provides an effective method to visualize a complex dataset. It successfully uncovers hidden structures in the data, exposing natural clusters and smooth nonlinear variations along the dimensions. It has been implemented in many languages, including Python, and it can be easily used thanks to the scikit-learn library.

0 commit comments

Comments
 (0)