Skip to content

Commit 6dadb19

Browse files
committed
WIP: use seaborn.
1 parent 63c6a41 commit 6dadb19

File tree

1 file changed

+40
-77
lines changed

1 file changed

+40
-77
lines changed

main.md

Lines changed: 40 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -6,27 +6,15 @@ Computers have no problem processing that many dimensions. However, we humans ar
66

77
How can we possibly reduce the dimensionality of a dataset from an arbitrary number to two or three, which is what we're doing when we visualize data on a screen?
88

9-
The answer lies in the observation that many real-world datasets have a low intrinsic dimensionality, even though they're embedded in a high-dimensional space. Imagine that you're shooting a panoramic landscape with your camera, while rotating around yourself. We can consider every picture as a point in a 16,000,000-dimensional space (assuming a 16 megapixels camera). Yet, the set of pictures approximately lie on a three-dimensional space (yaw, pitch, roll). This low-dimensional space is embedded in the high-dimensional space in a complex, nonlinear way. Hidden in the data, this structure can only be recovered with specific mathematical methods.
9+
The answer lies in the observation that many real-world datasets have a low intrinsic dimensionality, even though they're embedded in a high-dimensional space. Imagine that you're shooting a panoramic landscape with your camera, while rotating around yourself. We can consider every picture as a point in a 16,000,000-dimensional space (assuming a 16 megapixels camera). Yet, the set of pictures approximately lie in a three-dimensional space (yaw, pitch, roll). This low-dimensional space is embedded within the high-dimensional space in a complex, nonlinear way. Hidden in the data, this structure can only be recovered via specific mathematical methods.
1010

11-
This is the topic of manifold learning, also called nonlinear dimensionality reduction, a branch of machine learning (more specifically, _unsupervised learning_). It is still an active area of research today to develop algorithms that can automatically recover a hidden structure in a high-dimensional dataset.
11+
This is the topic of [**manifold learning**](http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction), also called **nonlinear dimensionality reduction**, a branch of machine learning (more specifically, _unsupervised learning_). It is still an active area of research today to develop algorithms that can automatically recover a hidden structure in a high-dimensional dataset.
1212

13-
This post is an introduction to a popular dimensonality reduction algorithm: **t-distributed stochastic neighbor embedding (t-SNE)**. Developed by Laurens van der Maaten and Geoffrey Hinton (now working at Google), this algorithm has been successfully applied to many real-world datasets. Here, we'll see the key concepts of the method, when applied to a toy dataset (handwritten digits). We'll use Python and the scikit-learn library.
13+
This post is an introduction to a popular dimensonality reduction algorithm: [**t-distributed stochastic neighbor embedding (t-SNE)**](http://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). Developed by [Laurens van der Maaten](http://lvdmaaten.github.io/) and [Geoffrey Hinton](http://www.cs.toronto.edu/~hinton/), this algorithm has been successfully applied to many real-world datasets. Here, we'll see the key concepts of the method, when applied to a toy dataset (handwritten digits). We'll use Python and the [scikit-learn](http://scikit-learn.org/stable/index.html) library.
1414

1515
## Visualizing handwritten digits.
1616

17-
TODO
18-
(detail the dataset, nsamples, ndimensions)
19-
(final output of tSNE)
20-
21-
<pre data-code-language="python"
22-
data-executable="true"
23-
data-type="programlisting">
24-
#TODO
25-
</pre>
26-
27-
28-
29-
17+
Let's first import a handful of libraries.
3018

3119
<pre data-code-language="python"
3220
data-executable="true"
@@ -49,28 +37,39 @@ from sklearn.manifold.t_sne import (_joint_probabilities,
4937
_kl_divergence)
5038
from sklearn.utils.extmath import _ravel
5139

52-
# matplotlib for graphics.
40+
# We'll use matplotlib for graphics.
5341
import matplotlib.pyplot as plt
5442
import matplotlib.patheffects as PathEffects
5543
import matplotlib
5644
%matplotlib inline
5745

58-
# We import seaborn for improve aesthetics.
46+
# We import seaborn to make nice plots.
5947
import seaborn as sns
6048
sns.set_style('darkgrid')
61-
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
49+
sns.set_palette('muted')
50+
sns.set_context("notebook", font_scale=1.5,
51+
rc={"lines.linewidth": 2.5})
6252

6353
# We'll generate an animation with matplotlib and moviepy.
6454
from moviepy.video.io.bindings import mplfig_to_npimage
6555
import moviepy.editor as mpy
6656
</pre>
6757

68-
Illustration on digit dataset.
58+
6959

7060
<pre data-code-language="python"
7161
data-executable="true"
7262
data-type="programlisting">
7363
digits = load_digits()
64+
</pre>
65+
66+
TODO
67+
(detail the dataset, nsamples, ndimensions)
68+
(final output of tSNE)
69+
70+
<pre data-code-language="python"
71+
data-executable="true"
72+
data-type="programlisting">
7473
tsne = TSNE()
7574
digits_proj = tsne.fit_transform(digits.data)
7675
</pre>
@@ -79,10 +78,12 @@ digits_proj = tsne.fit_transform(digits.data)
7978
data-executable="true"
8079
data-type="programlisting">
8180
def scatter(x, colors):
81+
palette = np.array(sns.color_palette("hls", 10))
82+
8283
f = plt.figure(figsize=(8, 8))
8384
ax = plt.subplot(aspect='equal')
8485
sc = ax.scatter(x[:,0], x[:,1], lw=0, s=40,
85-
c=colors.astype(np.int));
86+
c=palette[colors.astype(np.int)]);
8687
plt.xlim(-25, 25);
8788
plt.ylim(-25, 25);
8889
ax.axis('off');
@@ -105,9 +106,6 @@ def scatter(x, colors):
105106
scatter(digits_proj, digits.target);
106107
</pre>
107108

108-
109-
110-
111109
## Mathematical framework
112110

113111
Let's explain how the algorithm works. First, a few definitions.
@@ -120,18 +118,14 @@ How do we choose the positions of the map points? We want to conserve the struct
120118

121119
<span class="math-tex" data-type="tex">\\(p_{j|i} = \frac{\exp\left(-\left| x_i - x_j\right|^2 \big/ 2\sigma_i^2\right)}{\displaystyle\sum_{k \neq i} \exp\left(-\left| x_i - x_k\right|^2 \big/ 2\sigma_i^2\right)}\\)</span>
122120

123-
This measures how close $x_j$ is from $x_i$, considering a Gaussian distribution around $x_i$ with a given variance $\sigma_i^2$. This variance is different for every point; it is chosen such that points in dense areas are given a smaller variance than points in sparse areas.
121+
This measures how close <span class="math-tex" data-type="tex">\\(x_j\\)</span> is from <span class="math-tex" data-type="tex">\\(x_i\\)</span>, considering a Gaussian distribution around <span class="math-tex" data-type="tex">\\(x_i\\)</span> with a given variance <span class="math-tex" data-type="tex">\\(\sigma_i^2\\)</span>. This variance is different for every point; it is chosen such that points in dense areas are given a smaller variance than points in sparse areas.
124122

125123
Now, we define the similarity as a symmetrized version of the conditional similarity:
126124

127-
$$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}$$
125+
<span class="math-tex" data-type="tex">\\(p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}\\)</span>
128126

129127
We obtain a similarity matrix for our original dataset.
130128

131-
132-
133-
134-
135129
<pre data-code-language="python"
136130
data-executable="true"
137131
data-type="programlisting">
@@ -141,12 +135,6 @@ y = np.hstack([digits.target[digits.target==i]
141135
for i in range(10)])
142136
</pre>
143137

144-
<pre data-code-language="python"
145-
data-executable="true"
146-
data-type="programlisting">
147-
#X = scale(X)
148-
</pre>
149-
150138
<pre data-code-language="python"
151139
data-executable="true"
152140
data-type="programlisting">
@@ -175,64 +163,54 @@ P_binary_s = squareform(P_binary)
175163
data-executable="true"
176164
data-type="programlisting">
177165
plt.figure(figsize=(12, 4))
166+
pal = sns.light_palette("blue", as_cmap=True)
178167

179168
plt.subplot(131)
180-
plt.imshow(D, interpolation='none')
169+
plt.imshow(D[::10, ::10], interpolation='none', cmap=pal)
181170
plt.axis('off')
182171
plt.title("Distance matrix", fontdict={'fontsize': 16})
183172

184173
plt.subplot(132)
185-
plt.imshow(P_constant, interpolation='none')
174+
plt.imshow(P_constant[::10, ::10], interpolation='none', cmap=pal)
186175
plt.axis('off')
187176
plt.title("$p_{j|i}$ (constant sigma)", fontdict={'fontsize': 16})
188177

189178
plt.subplot(133)
190-
plt.imshow(P_binary_s, interpolation='none')
179+
plt.imshow(P_binary_s[::10, ::10], interpolation='none', cmap=pal)
191180
plt.axis('off')
192181
plt.title("$p_{j|i}$ (binary search sigma)", fontdict={'fontsize': 16});
193182
</pre>
194183

195-
196-
197-
198-
199-
200-
201184
Let's also define a similarity matrix for our map points.
202185

203-
$$q_{ij} = \frac{f(\left| x_i - x_j\right|)}{\displaystyle\sum_{k \neq i} f(\left| x_i - x_k\right|)} \quad \textrm{with} \, f(z) = \frac{1}{1+z^2}.$$
186+
<span class="math-tex" data-type="tex">\\(q_{ij} = \frac{f(\left| x_i - x_j\right|)}{\displaystyle\sum_{k \neq i} f(\left| x_i - x_k\right|)} \quad \textrm{with} \, f(z) = \frac{1}{1+z^2}.\\)</span>
204187

205188
This is the same idea as for the data points, but with a different distribution (t-Student, or Cauchy distribution, instead of a Gaussian distribution). We'll elaborate on this choice later.
206189

207-
Whereas the data similarity matrix $\big(p_{ij}\big)$ is fixed, the map similarity matrix $\big(q_{ij}\big)$ depends on the map points. What we want is for these two matrices to be as close as possible. This would mean that similar data points yield similar map points.
190+
Whereas the data similarity matrix <span class="math-tex" data-type="tex">\\(\big(p_{ij}\big)\\)</span> is fixed, the map similarity matrix <span class="math-tex" data-type="tex">\\(\big(q_{ij}\big)\\)</span> depends on the map points. What we want is for these two matrices to be as close as possible. This would mean that similar data points yield similar map points.
208191

209192
## A physical analogy
210193

211-
Let's assume that our map points are all connected with springs. The stiffness of a spring connecting points $i$ and $j$ depends on the mismatch between the similarity of the two data points and the similarity of the two map points, that is, $p_{ij} - q_{ij}$. Now, we let the system evolve according to the law of physics. If two map points are far apart while the data points are close, they are attracted together. If they are close while the data points are dissimilar, they are repelled.
194+
Let's assume that our map points are all connected with springs. The stiffness of a spring connecting points <span class="math-tex" data-type="tex">\\(i\\)</span> and <span class="math-tex" data-type="tex">\\(j\\)</span> depends on the mismatch between the similarity of the two data points and the similarity of the two map points, that is, <span class="math-tex" data-type="tex">\\(p_{ij} - q_{ij}\\)</span>. Now, we let the system evolve according to the law of physics. If two map points are far apart while the data points are close, they are attracted together. If they are close while the data points are dissimilar, they are repelled.
212195

213196
The final mapping is obtained when the equilibrium is reached.
214197

215198
## Algorithm
216199

217-
Remarkably, this analogy stems exactly from a natural mathematical algorithm. It corresponds to minimizing the Kullback-Leiber divergence between the two distributions $\big(p_{ij}\big)$ and $\big(q_{ij}\big)$:
200+
Remarkably, this analogy stems exactly from a natural mathematical algorithm. It corresponds to minimizing the Kullback-Leiber divergence between the two distributions <span class="math-tex" data-type="tex">\\(\big(p_{ij}\big)\\)</span> and <span class="math-tex" data-type="tex">\\(\big(q_{ij}\big)\\)</span>:
218201

219-
$$KL(P||Q) = \sum_{i, j} p_{ij} \, \log \frac{p_{ij}}{q_{ij}}.$$
202+
<span class="math-tex" data-type="tex">\\(KL(P||Q) = \sum_{i, j} p_{ij} \, \log \frac{p_{ij}}{q_{ij}}.\\)</span>
220203

221204
This measures the distance between our two similarity matrices.
222205

223206
To minimize this score, we perform a gradient descent. The gradient can be computed analytically:
224207

225-
$$\frac{\partial KL(P || Q)}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) g\left( \left| x_i - x_j\right| \right) \quad \textrm{where} \, g(z) = \frac{z}{1+z^2}.$$
208+
<span class="math-tex" data-type="tex">\\(\frac{\partial \, K\!L(P || Q)}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij}) g\left( \left| x_i - x_j\right| \right) \quad \textrm{where} \, g(z) = \frac{z}{1+z^2}.\\)</span>
226209

227-
This gradient expresses the sum of all spring forces applied to map point $i$.
210+
This gradient expresses the sum of all spring forces applied to map point <span class="math-tex" data-type="tex">\\(i\\)</span>.
228211

229212
Now, let's illustrate this process by creating an animation of the convergence.
230213

231-
232-
233-
234-
235-
236214
<pre data-code-language="python"
237215
data-executable="true"
238216
data-type="programlisting">
@@ -323,7 +301,7 @@ Q = squareform(Q)
323301

324302
f = plt.figure(figsize=(6, 6))
325303
ax = plt.subplot(aspect='equal')
326-
im = ax.imshow(Q)
304+
im = ax.imshow(Q, interpolation='none', cmap=pal)
327305
plt.axis('tight');
328306
plt.axis('off');
329307
</pre>
@@ -346,17 +324,9 @@ animation.write_gif("anim2.gif", fps=20)
346324

347325
<img src="anim2.gif" />
348326

349-
350-
351-
352-
353-
354-
355-
356327
## The t-Student distribution
357328

358-
Let's now explain the choice of the t-Student distribution for the map points, while a normal distribution is used for the data points. It is well known that the volume of the $N$-dimensional ball of radius $r$ scales as $r^N$. When $N$ is large, if we pick random points uniformly in the ball, most points will be close to the surface, and very few will be near the center.
359-
329+
Let's now explain the choice of the t-Student distribution for the map points, while a normal distribution is used for the data points. It is well known that the volume of the <span class="math-tex" data-type="tex">\\(N\\)</span>-dimensional ball of radius <span class="math-tex" data-type="tex">\\(r\\)</span> scales as <span class="math-tex" data-type="tex">\\(r^N\\)</span>. When <span class="math-tex" data-type="tex">\\(N\\)</span> is large, if we pick random points uniformly in the ball, most points will be close to the surface, and very few will be near the center.
360330

361331
<pre data-code-language="python"
362332
data-executable="true"
@@ -382,13 +352,9 @@ for i, D in enumerate((2, 5, 10)):
382352
ax.set_title('D=%d' % D, loc='left')
383353
</pre>
384354

355+
When reducing the dimensionality of a dataset, if we used the same Gaussian distribution for the data points and the map points, this mathematical fact would result in an _imbalance_ among the neighbors of a given point. This imbalance would lead to an excess of attraction forces and a sometimes unappealing mapping. This is actually what happens in the original SNE algorithm, by Hinton and Roweis (2002).
385356

386-
When reducing the dimensionality of a dataset, if we used the same Gaussian distribution for the data points and the map points, this mathematical fact would result in an *imbalance* among the neighbors of a given point. This imbalance would lead to an excess of attraction forces and a sometimes unappealing mapping. This is actually what happens in the original SNE algorithm, by Hinton and Roweis (2002).
387-
388-
The t-SNE algorithm works around this problem by using a t-Student with one degree of freedom (or Cauchy) distribution for the map points. This distribution has a much heavier tail than the Gaussian distribution, which *compensates* the original imbalance. For a given data similarity between two data points, the two corresponding map points will need to be much further apart in order for their similarity to match the data similarity.
389-
390-
391-
357+
The t-SNE algorithm works around this problem by using a t-Student with one degree of freedom (or Cauchy) distribution for the map points. This distribution has a much heavier tail than the Gaussian distribution, which _compensates_ the original imbalance. For a given data similarity between two data points, the two corresponding map points will need to be much further apart in order for their similarity to match the data similarity.
392358

393359
<pre data-code-language="python"
394360
data-executable="true"
@@ -401,14 +367,11 @@ plt.plot(z, cauchy, label='Cauchy distribution');
401367
plt.legend();
402368
</pre>
403369

404-
405-
406-
407370
## Conclusion
408371

409372
The t-SNE algorithm provides an effective method to visualize a complex dataset. It successfully uncovers hidden structures in the data, exposing natural clusters or smooth nonlinear variations along the dimensions. It has been implemented in many languages, including Python, and it can be easily used thanks to the scikit-learn library.
410373

411-
The references below link to some optimizations and improvements that can be made to the algorithm and implementations. In particular, the algorithm described here is quadratic in the number of samples, which makes it unscalable to large datasets. One could for example obtain an $O(N \log N)$ complexity by using the Barnes-Hut algorithm to accelerate the N-body simulation via a quadtree or an octree.
374+
The references below link to some optimizations and improvements that can be made to the algorithm and implementations. In particular, the algorithm described here is quadratic in the number of samples, which makes it unscalable to large datasets. One could for example obtain an <span class="math-tex" data-type="tex">\\(O(N \log N)\\)</span> complexity by using the Barnes-Hut algorithm to accelerate the N-body simulation via a quadtree or an octree.
412375

413376
## References
414377

0 commit comments

Comments
 (0)