Skip to content

Daniel-Sinkin/DeepLearning

Repository files navigation

Introduction

Implementation of different deep learning specific things

Implementations

  • Encoder-Decoder Transformer in PyTorch, build to be aligned with Vasvani et al. 2017
  • UNet based diffusion Model
  • LeNet

Visualisations

These visualisations are created for this blog post: https://daniel-sinkin.github.io/2025/07/31/Bishop-Visualisation.html

Denoising Auto Encoder

Low dimensional example

Suppose we have natural data laying in ${(x, 0, 0, 0, 0) : x \in \mathbb{R}} \subseteq{R}^5$, this forms a 1 dimensional manifold (due to it being a subvectorspace). If there is noise applied to for example the natural data $(1, 0, 0, 0, 0)$ we might get something like $(1.1, -0.1, 0.2, 0.3, -0.2) \in \mathbb{R}^5$ and a DAE would for example try to learn to squish all coordinates that are not the first one to 0.

In this example I trained a simple 5 -> 16 -> 16 -> 5 MLP with ReLU activation for the denoising.

Manifold projection

Inspired by https://www.deeplearningbook.org Chapter 14 Autoencoders Page 512.

EM Algo

EM Algo

Markov Chain Monte Carlo

Suppose we have a sample point (in this example $z_t = (x_t, y_t) \in \mathbb{R}^2$). We sample a pre-defined proposal distribution. For the basic Metropolis algorithm, it needs to be symmetric, in the sense that $p(x|y) = p(y|x)$ holds. Isotropic multivariate Gaussian $\mathcal{N}(0, \sigma^2 I)$ and uniform distribution $U((-a, a) \times (-a, a))$ both satisfy this condition. In this example, we use a Gaussian distribution as the proposal function.

We sample a direction $\delta \sim \mathcal{N}(0, \sigma^2 I)$, and for our target function (an unnormalised distribution) $\tilde{p}$, we compute the (possibly overflowing) acceptance value of this step as follows: $\tilde{A}(z_t, z_t + \delta) := \frac{p(z_t + \delta)}{p(z_t)}$.

As we want an acceptance probability, we clip it to be at most 1: $A(z_t, z_t + \delta) = \min(1, \tilde{A}(z_t, z_t + \delta)) = \min\left(1, \frac{p(z_t + \delta)}{p(z_t)}\right)$.

We then accept this step with probability $A(z_t, z_t + \delta)$, which is done by sampling $u \sim U(0, 1)$ and checking if $u \leq A(z_t, z_t + \delta)$ (which is always true if $\tilde{A}(z_t, z_t + \delta) \geq 1.0$).

If we accept, we define $z_{t + 1} = z_t + \delta$; if we reject, we simply discard the step and remain at the same position, i.e., $z_{t + 1} = z_t$.

Rejection Sampling

Suppose we want to sample a complex distribution $p(z)$ what is easy to evaluate if we ignore normalisation, i.e., such that $\tilde{p}(z) = C p(z), C > 0$ is easy to evaluate. Take some simple to evaluate distribution $q(z)$ (for example normal) and find $k > 0$ such that $kq(z) > \tilde{p}(z)$ for all $z$. Then do uniform sampling on $\xi \sim U(0, kq(z_0))$ and accept if $\xi \leq \tilde{p}(z)$, reject otherwise. The acceptance rate is proportional to $\frac{1}{k}$ so we want to have k as small as possible.

Ridge Regression

Here I implemented the discussion in this stackexchange answer https://stats.stackexchange.com/a/151351 which shows the "Ridge" that we get with overfitting explicitly Ridge Regression Plot

Curve Fitting

Assuming we are fitting a curve with Gaussian Noise, then for any fixed x value the distribution of the target value is given as follows (based on Figure 1.28 in Bishop, Pattern Recognition and Machine Learning, 2006) Curve Fitting Gaussian

PCA

We take our samples $X = {x_1, \dots, x_N} \subseteq \mathbb{R}^D$ and construct the covariance matrix $\Sigma_X = \sum_{i = 1}^N x_i^\top x_i$. This is a symmetric positive definite matrix (assuming all samples are linearly independent) as such it has $N$ eigenvectors $v_1, \dots, v_N$ and $n$ positive (!) eigenvalues $\lambda_1, \dots, \lambda_N$, without loss of generality we can assume that $\lambda_1 \geq \lambda_2 \geq \dots \lambda_N$.

The first $k$ eigenvectors then give us the most important to extract the most important components PCA

Effect of regularisation

Suppose we have a covariance matrix $\Sigma \in \mathbb{R}^{N \times N}$, we can regularise by making the diagonal more dominant, applying so-called Tychonoff (or Tikhonov) Regularisation: $\Sigma_\alpha = \alpha I_{N \times N} + \Sigma$ The effect of this is it makes the underlying equicontours more spherical, in the sense that the eigenvalues are closer together ($\lambda_i / \lambda_{i + 1}$ goes closer to 1) Regularisation

Box Muller

A somewhat efficient algorithm for sample standard normal random variable using a transform on uniforms distributed variables. Suppose we want to sample $2N$ standard normal variables.

First start with $X^{(0)} := {(x_i, y_i) : 1 \leq i \leq N}$ random variables such that $x_i, y_i \sim U(0, 1)$ and reject all samples where $r_i^2 := x_i^2 + y_i^2 > 1.0,$ giving us $X^{(1)} = {(x_i, y_i) \in X : r_i^2 \leq 1.0}.$ BM Square Disk We then apply the transform $s_i^{(1)} = x_i \sqrt{\frac{-2\log(r_i^2)}{r_i^2}}$ and $s_i^{(2)} = y_i \sqrt{\frac{-2\log(r_i^2)}{r_i^2}}$ it can be shown that $s_i^{(1)}$ and $s_i^{(1)}$ are iid with $s_i^{(j)} \sim \mathcal{N}(0, 1)$. BM Square Disk This uses the rotational symmetry of the 2 dimensional normal distribution. We can see more accurately what the transformation does by highlighting a particular slice BM Slice BM Slice

Integral Density

When doing Bayesian inference, if $p(\hat{\theta}|x)$ is very close to 1 for a particular $\hat{\theta}$ and close to $0$ for all others then $p(x|X) \approx p(X|\hat{\theta})$ This visualises this phenomonon Integral Density 0020 2600 Integral Density 0330 2000 Integral Density 1000 2650

Gaussian Process

Normal Gaussian Process With added noise Gaussian Process Noise

Contribution Histograms

Gaussian Process Hist 50 Gaussian Process Hist 65 Gaussian Process Hist 105

Heatmap of different Kernels

Gaussian Process Kernel Heatmap

References

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

About

Implementation of different deep learning specific things

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published