diff --git a/README.md b/README.md index 8225183..c060760 100644 --- a/README.md +++ b/README.md @@ -5,3 +5,9 @@ These are a portion of the notes I kept for the lectures in my Master's in ETH Z Notes are by no means intended to be complete or comprehensive. If you see some gaps or omitted details it is possibly because either I find the topic very general or I did not really understand it at all. Me being lazy to format it could be another possible reason. That said, I welcome any suggestions on additional content. Similarly, there could be mistakes in the notes either because I copied and pasted parts from various sources or that I misunderstood the content. Please send me a pull request or an e-mail if I have a typo or any kind of misinformation in the notes. + +Cheatsheets that were based on someone else's original work are as follows: + +- AML cheatsheet adapted from [here](https://github.com/plokchen/eth-ml-exam-summary). +- CIL cheatsheet adapted from [here](https://github.com/tyxeron/eth-cil-exam-summary), which in turn is a fork of [this](https://github.com/groggi/eth-cil-exam-summary). +- PAI cheatsheet adapted from [this](https://legacy.amiv.ethz.ch/system/files/studiumsunterlagen/pai_zfg_final.docx) in [here](https://legacy.amiv.ethz.ch/studium/unterlagen/132). \ No newline at end of file diff --git a/cheatsheets/aml-cheatsheet.pdf b/cheatsheets/aml-cheatsheet.pdf new file mode 100644 index 0000000..ad045db Binary files /dev/null and b/cheatsheets/aml-cheatsheet.pdf differ diff --git a/cheatsheets/cil-cheatsheet.pdf b/cheatsheets/cil-cheatsheet.pdf new file mode 100644 index 0000000..c27fba9 Binary files /dev/null and b/cheatsheets/cil-cheatsheet.pdf differ diff --git a/cheatsheets/mlhc-cheatsheet.pdf b/cheatsheets/mlhc-cheatsheet.pdf new file mode 100644 index 0000000..1998c12 Binary files /dev/null and b/cheatsheets/mlhc-cheatsheet.pdf differ diff --git a/cheatsheets/pai-cheatsheet.pdf b/cheatsheets/pai-cheatsheet.pdf new file mode 100644 index 0000000..20fa762 Binary files /dev/null and b/cheatsheets/pai-cheatsheet.pdf differ diff --git a/Advanced Machine Learning.pdf b/notes/Advanced Machine Learning.pdf similarity index 100% rename from Advanced Machine Learning.pdf rename to notes/Advanced Machine Learning.pdf diff --git a/Computational Intelligence Lab.pdf b/notes/Computational Intelligence Lab.pdf similarity index 100% rename from Computational Intelligence Lab.pdf rename to notes/Computational Intelligence Lab.pdf diff --git a/Computer Vision.pdf b/notes/Computer Vision.pdf similarity index 100% rename from Computer Vision.pdf rename to notes/Computer Vision.pdf diff --git a/Machine Learning for Health Care.pdf b/notes/Machine Learning for Health Care.pdf similarity index 100% rename from Machine Learning for Health Care.pdf rename to notes/Machine Learning for Health Care.pdf diff --git a/Machine Perception.pdf b/notes/Machine Perception.pdf similarity index 100% rename from Machine Perception.pdf rename to notes/Machine Perception.pdf diff --git a/Natural Language Understanding.pdf b/notes/Natural Language Understanding.pdf similarity index 100% rename from Natural Language Understanding.pdf rename to notes/Natural Language Understanding.pdf diff --git a/Probabilistic Artificial Intelligence.pdf b/notes/Probabilistic Artificial Intelligence.pdf similarity index 100% rename from Probabilistic Artificial Intelligence.pdf rename to notes/Probabilistic Artificial Intelligence.pdf diff --git a/Statistical Learning Theory.pdf b/notes/Statistical Learning Theory.pdf similarity index 100% rename from Statistical Learning Theory.pdf rename to notes/Statistical Learning Theory.pdf diff --git a/word-embeddings.pdf b/notes/word-embeddings.pdf similarity index 100% rename from word-embeddings.pdf rename to notes/word-embeddings.pdf diff --git a/src/aml-cheatsheet/0Basics.tex b/src/aml-cheatsheet/0Basics.tex new file mode 100644 index 0000000..053598b --- /dev/null +++ b/src/aml-cheatsheet/0Basics.tex @@ -0,0 +1,51 @@ +\section{Basics} +$f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{- \frac{1}{2} \frac{(x-\mu)^2}{\sigma^2}},\quad \mathcal{N}(x|\mu, \sigma)$\\ +$f(x) = \frac{1}{\sqrt{(2\pi)^d\det\Sigma}} e^{- \frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)},\quad \mathcal{N}(x|\mu, \Sigma)$\\ +Condition number: $\kappa(A)=\frac{\sigma_{max}(A)}{\sigma_{min}(A)}$ \\ +f(x) on a: $f(a)+\tfrac{f'(a)}{1!}(x-a) + \tfrac{f''(a)}{2!}(x-a)^2 + ...$ \\ +Binomial: $f(k,n,p) {=} Pr(X=k) {=} \binom nk p^k (1{-}p)^{n{-}k}$ \\ +$\ln(p(x|\mu, \Sigma)) {=} {-}\tfrac{d}{2}\ln(2\pi) {-} \tfrac{\ln|\Sigma|}{2} {-} \tfrac{1}{2}(x{-}\mu)^T\Sigma(x{-}\mu)$ \\ +$X {\sim} \mathcal{N}(\mu,\Sigma)$, $Y{=}A{+}BX \Rightarrow Y{\sim}\mathcal{N}(A{+}B\mu,B\Sigma B^T)$ // +General p-norm: $\norm{ x }_p = (\sum_{i=1}^n |x_i|^p)^{1/p}$ + +\subsection*{Moments} +\begin{inparaitem}[\color{red}\textbullet] +% Variance +\item $Var[X]=\int_x(x-\mu)^2p(x) dx$ \\ +\item $Var[X]=E[(X-E[X])^2]=E[X^2]-E[X]^2$ \\ +\item $Var[X{+}Y]=Var[X]{+}Var[Y]{+}2Cov[X,Y]$ \\ +% Covariance +\item $Cov[X,Y] = E[(X - E[X])(Y - E[Y])]$ \\ +\item $Cov[aX,bY]{=}abCov[X,Y]$ \\ +\item $K_{\bm{XY}} = cov(X,Y) = E[XY^T] - E[X]E[Y^T]$ +\end{inparaitem} +\subsection*{Calculus} +\begin{inparaitem}[\color{red}\textbullet] + \item Part.: $\int u(x)v'(x) dx = u(x)v(x) - \int v(x)u'(x) dx$\\ + \item Chain r.: $\frac{f(y)}{g(x)} = \frac{dz}{dx} \Big|_{x=x_0}= \frac{dz}{dy}\Big|_{z=g(x_0)}\cdot \frac{dy}{dx} \Big|_{x=x_0}$ \\ + %\item $g_x(1) = g_x(0) + g'_x(0) + \int_{0}^{1} g_x''(s)(1-s) ds$ \\ + %\item $g(\mathbf{w}+\delta) - g(\mathbf{w}) = %\int_{\mathbf{w}}^{\mathbf{w+\delta}} \nabla g(\mathbf{u}) du = (\int_{0}^{1} \nabla g(\mathbf{w}+t\delta)dt) \cdot \delta$\\ + \item $\frac{\partial}{\partial \mathbf{x}}(\mathbf{b}^\top \mathbf{x}) = \frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^\top \mathbf{b}) = \mathbf{b}$ + \item $\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^\top \mathbf{x}) = 2\mathbf{x}$ \\ + \item $\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^\top \mathbf{A}\mathbf{x}) = (\mathbf{A}^\top + \mathbf{A})\mathbf{x} \stackrel{\text{\tiny A sym.}}{=} 2\mathbf{A}\mathbf{x}$ \\ + \item $\frac{\partial}{\partial \mathbf{x}}(\mathbf{b}^\top \mathbf{A}\mathbf{x}) = \mathbf{A}^\top \mathbf{b}$ + \item $\frac{\partial}{\partial \mathbf{X}}(\mathbf{c}^\top \mathbf{X} \mathbf{b}) = \mathbf{c}\mathbf{b}^\top$ \\ + \item $\frac{\partial}{\partial \mathbf{X}}(\mathbf{c}^\top \mathbf{X}^\top \mathbf{b}) = \mathbf{b}\mathbf{c}^\top$ + \item $\frac{\partial}{\partial \mathbf{x}}(\| \mathbf{x}-\mathbf{b} \|_2) = \frac{\mathbf{x}-\mathbf{b}}{\|\mathbf{x}-\mathbf{b}\|_2}$ \\ + \item $\frac{\partial}{\partial \mathbf{x}}(\|\mathbf{x}\|^2_2) = \frac{\partial}{\partial \mathbf{x}} (\|\mathbf{x}^\top \mathbf{x}\|_2) = 2\mathbf{x}$ + \item $\frac{\partial}{\partial \mathbf{X}}(\|\mathbf{X}\|_F^2) = 2\mathbf{X}$ \\ + \item $x^T A x = Tr(x^T A x) = Tr(x x^T A) = Tr(A x x^T)$ \\ + \item $\tfrac{\partial}{\partial A} Tr(AB) {=} B^T$ + \item $\frac{\partial}{\partial A} log|A| {=} A^{-T}$ \\ + \item $\text{sigmoid}(x) = \sigma(x) = \frac{1}{1+\exp(-x)}$ \\ + \item $\nabla \text{sigmoid}(x) = \text{sigmoid}(x)(1-\text{sigmoid}(x))$ \\ + \item $\nabla \text{tanh}(x) = 1-\text{tanh}^2(x)$ + \item $tanhx {=} \frac{sinhx}{coshx} {=} \frac{e^{x}-e^{-x}}{e^{x} + e^{x}}$ +\end{inparaitem} +\subsection*{Probability / Statistics} +\begin{compactdesc} + \item[Bayes' Rule]$ P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}\frac{P(X|Y)P(Y)}{\sum\limits^k_{i=1}P(X|Y_i)P(Y_i)}$\\ + \item[MGF] $\mathbf{M}_X(t)=\mathbb{E}[e^{\mathbf{t}^T \mathbf{X}}]$, $\mathbf{X}=(X_1,.., X_n) $ +\end{compactdesc} +\subsection*{Jensen's inequality} + X:random variable \& $\varphi$:convex function $\rightarrow$ $\varphi(\mathbb{E}[X]) \leq \mathbb{E}[\varphi(X)]$ diff --git a/src/aml-cheatsheet/10NeuralNet.tex b/src/aml-cheatsheet/10NeuralNet.tex new file mode 100644 index 0000000..e8f3064 --- /dev/null +++ b/src/aml-cheatsheet/10NeuralNet.tex @@ -0,0 +1,9 @@ +\section{Neural Network} +\subsection*{Backpropagation} +For each unit $j$ on the output layer:\\ +- Compute error signal: $\delta_j = \ell_j'(f_j)$\\ +- For each unit $i$ on layer $L$: $\frac{\partial}{\partial w_{j,i}} = \delta_j v_i$ + +For each unit $j$ on hidden layer $l=\{L-1,..,1\}$:\\ +- Error signal: $\delta_j = \phi'(z_j) \sum_{i\in Layer_{l+1}} w_{i,j}\delta_i$\\ +- For each unit $i$ on layer $l-1$: $\frac{\partial}{\partial w_{j,i}} = \delta_j v_i$ diff --git a/src/aml-cheatsheet/10TimeSeries.tex b/src/aml-cheatsheet/10TimeSeries.tex new file mode 100644 index 0000000..d2f3d7b --- /dev/null +++ b/src/aml-cheatsheet/10TimeSeries.tex @@ -0,0 +1,53 @@ +% -*- root: Main.tex -*- +\section{Time series} +\subsection*{Markov Model} +Markov assumption: $P(Y_t|Y_{1:t-1}) = P(Y_t|Y_{t-1})$\\ +Stationarity assumption:\\ +$P(Y_{t+1}=y_1|Y_t=y_2) = P(Y_t=y_1|Y_{t-1}=y_2)$\\ +Product rule:\\ +$P(Y_t,...,Y_1) = P(Y_t|Y_{t-1},...,Y_1)\cdot ... \cdot P(Y_1)$\\ +Sum rule:\\ +$P(Y_{t+2}|Y_{1:t}) = \sum_{Y_{t+1}^i} P(Y_{t+2}Y_{t+1}^1|Y_{1:t})$ +\subsection*{Hidden Markov Model} +triplet $M = (\Sigma, Q, \Theta)$\\ +$\Sigma$ symbols, $Q$ states, $\Theta=(A,E)$ transition and emission, $e_k(b)$ emission prob. $x_k \in Q, b \in \Sigma$ +\subsection*{Forward/Backward - Alternative} +Goal: $P(x_t|s) \propto P(x_t,s) = P(s_{t+1:n}|x_t)P(x_t,s_{1:k})$ +\subsection*{Evaluation (Forward/Backward)} +Transition A and emission E known. Sequence s given.\\ +Wanted: prob that s is generated by HMM.\\ +\textbf{Forward:}\\ +Wanted: $f_l(s_t) = P(x_t = l, s_{1:t})$\\ +$f_l(s_{t+1}) = e_l(s_{t+1})\sum_k f_k(s_t) a_{k,l}$,\\ +$f_l(s_1) = \pi_l e_l(s_1) \forall l \in Q$\\ +\textbf{Backward:}\\ +Wanted: $b_l(s_t) = P(s_{t+1:n}|x_t = l)$\\ +$b_l(s_t) = \sum_k e_k(s_{t+1}) b_k(s_{t+1}) a_{l,k}$,\\ +$b_l(s_n) = 1 \forall l \in Q$\\ +Complexity in time: $\mathcal{O}(|\Sigma|^2 \cdot T)$ + +\subsection*{Decoding (Viterbi)} +Given: Observation sequence $O= \{O_1 O_2 \dots O_T \}$, $a_{ij} = P(q_{t+1} = S_j | q_t = S_i)$, $b_j(k)=P(v_k \text{at t} |q_t = S_j)$ \\ +Wanted: most likely path $Q = \{q_1,q_2,\ldots q_T\}$\\ +$\delta_t (i) $ best score along single path, at a time t, which accounts for the first t observations and ends in $S_i$\\ +$\delta_t (j) = max_{1 \leq i \leq N}[\delta_{t-1} (i)a_{ij}]b_j(O_t) $\\ +$\phi_t(j)=argmax_{1\leq i \leq N} [\delta_{t-1}(i)a_{ij}]$\\ +Time: $\mathcal{O}(|S|^2 \cdot T)$ +Space $\mathcal{O}(|S| \cdot T)$ + +\subsection*{Decoding (Viterbi) - Alternative} +Transition $a_{i,j} = P(x_{t+1} = j |x_t = i)$ and emission $e_l(s_t) = P(s_t|x_t=l)$ known. Sequence s given.\\ +Wanted: Most likely path x responsible for the sequence.\\ +$v_l(s_{t+1}) = e_l(s_{t+1}) \max_k(v_k(s_t) a_{k,l})$\\ +$v_l(s_1) = \pi_l e_l(s_1) \forall l \in Q$\\ +Time: $\mathcal{O}(|\Sigma|^2 \cdot T)$, Space: $\mathcal{O}(|\Sigma| \cdot T)$ +\subsection*{Learning (Baum-Welch)} +Know: Set of sequences $s^1,...,s^m$\\ +Wanted: max transition A and emission E\\ +\textbf{E-step I:} Compute all $f_k(s_t^j)$ (forward-algo.) \& $b_k(s_t^j)$ (backward algo.)\\ +\textbf{E-step II:} Compute $A_{kl}$, $E_k(b)$ for all states and symbols\\ +$A_{kl} = \sum_{j=1}^{m} \frac{1}{P(\textbf{s}^j)} \sum_{t=1}^{n}f_k^j (s_t^j)a_{kl}e_l(s_{t+1}^j)b_l^j(s_{t+1}^j)$\\ +$E_k(b)=\sum_{j=1}^{m}\frac{1}{P(\textbf{s}^j)}\sum_{t|S_t^j=b}^{n}f_k^j(s_t^j)b_k^j(s_t^j)$\\ +\textbf{M-step:} Compute param. estimates $a_{kl}$, $e_k(b)$\\ +$a_{kl}=\frac{A_{kl}}{\sum_{i=1}^{n}A_{ki}}$, $e_k(b)=\frac{E_k(b)}{\sum_{b'}E_k(b')}$\\ +Complexity: $\mathcal{O}(|\Sigma|^2)$ in storage (space). \ No newline at end of file diff --git a/src/aml-cheatsheet/1Regression.tex b/src/aml-cheatsheet/1Regression.tex new file mode 100644 index 0000000..9b89ad9 --- /dev/null +++ b/src/aml-cheatsheet/1Regression.tex @@ -0,0 +1,92 @@ +% -*- root: Main.tex -*- +\section{Regression} +%\subsection*{Linear Regression} +%Error: $\hat{R}(w) = \sum_{i=1}^n (y_i - w^Tx_i)^2 = ||Xw-y||^2_2$\\ +%Closed form: $w^*=(X^T X)^{-1} X^T y$\\ +%Gradient: $\nabla_w \hat{R}(w) = 2X^T (Xw-y)$ +\subsection*{Estimation} +Consistency: $\hat{\theta_n} \stackrel{\text{\tiny P}}{\rightarrow} \theta$, +i.e. $\forall\epsilon P \{|\hat{\theta_n}-\theta| \geq\epsilon\} \stackrel{\tiny n \to\infty}{\longrightarrow} 0 $\\ +Asymptotic normality: $\sqrt{N}(\theta - \hat{\theta_n}) \to \mathcal{N}(0, J^{-1}IJ^{-1})$ \\ +Asymptotic efficiency: $\hat{\theta_n}$ has the smallest variance among all possible consistent estimators (for large enough N), i.e. $\lim_{n\to\infty} (V[\hat{\theta_n}]I(\theta))^{-1} = 1$ + $\hat{\theta}_{MAP} := \argmax_\theta \left \{ \sum_{i=1}^n log(p(x_i | \theta) + log(p(\theta)) \right\}$ +\subsection*{Rao-Cramer} +$\Lambda = \frac{\partial \log \mathbb{P}(x|\theta )}{\partial x}$ (score function), $E[\Lambda ]=0$\\ +Fisher information: $I= \mathbb{V}[\Lambda]$ \\ +$J= E[\Lambda^{2}]= -E[\frac{\partial^2 \log \mathbb{P}(x|\theta ) }{\partial \theta \partial \theta ^{T}}]= -E[\frac{\partial \Lambda}{\partial \theta}]$ \\ +variance of an estimator is bounded from below by the inverse of Fisher information \\ +MSE bound: $E[(\hat \theta -\theta )^{2}] \geq \frac{[1 + b^{\prime} (\theta)]^{2}}{n E[\Lambda ^{2}]} + b_{\hat \theta}^{2}$ \\ +Biased estimators: $var(\hat{\theta}) \geq \frac{[1 + b^{\prime}(\theta)]^2}{I(\theta)}$ \\ +Efficiency: $e(\hat{\theta}) = \frac{I(\theta)^{-1}}{var(\hat{\theta})} \leq 1$ \\ +Cauchy-Schwarz: $|E(X,Y)|^2 \leq E(X)^2 E(Y)^2$ + +\subsection*{Regularized regression} +Error: $\hat{R}(w) = \sum \limits_{i=1}^n (y_i - w^Tx_i)^2 + \lambda ||w||_2^2$ (Ridge) \\ +Closed form: $w^*=(X^T X + \lambda I)^{-1} X^T y$ (Ridge)\\ +%Grad: $\nabla_w \hat{R}(w) = -2 \sum_{i=1}^n (y_i-w^T x_i) \cdot x_i + 2 \lambda w$\\ +{\small{} Shrinkage:} $Xw^*{=}\sum_{j=1}^{d} u_j\frac{\sigma_j^2}{\sigma_j^2+\lambda}u_j^Ty$, $X{=}U\Sigma V^T$ +LASSO: $w^* = \underset{w}{\operatorname{argmin}} \sum \limits_{i=1}^n (y_i - w^Tx_i)^2 + \lambda ||w||_1$ + +\subsection*{Bayesian linear regression} + Model: \= $y = X^T \beta + \epsilon$, with $\epsilon \sim + \mathcal{N}(\epsilon | 0, \sigma^2 I)$ or + \> $P(y | X, \beta, \sigma) = \mathcal{N}(y | X^T \beta , \sigma^2 I)$ + $P(\beta | \Lambda) = \mathcal{N} (\beta | 0, \Lambda^{-1})$, Post: $P(\beta | X, y, \Lambda) = \mathcal{N}(\beta | \mu_\beta, \Sigma_\beta)$ + $\mu_\beta = (X^T X + \sigma^2 \Lambda)^{-1} X^T y$, $\Sigma_\beta = \sigma^2(X^T X + \sigma^2 \Lambda)^{-1}$ + Prediction: \> $y_{new} = \hat{\beta}_{\scaleto{MAP}{4pt}}^T x_{new} = \mu_\beta ^T x_{new}$ + $P(y_{new} | x_{new}, X, y, \beta) + = \mathcal{N}(\mu_\beta ^T x_{new}, \sigma^2 + x_{new}^T \Sigma_\beta x_{new})$ + +\subsection*{Combination of Regression Models:} +$\text{bias}[\hat{f}(x)] = \frac{1}{B} \sum_{i=1}^{B} \text{bias}[\hat{f}_i(x)]$\\ +Var$[\hat{f}(x)] = \frac{1}{B^2}\sum_i$ Var$[\hat{f}_i(x)] ++ \frac{1}{B^2}\sum_{i,j:i\neq j}$ Cov$[\hat{f}_i(x), \hat{f}_j(x)] \approx \frac{\sigma^2}{B}$ +% \subsection*{Smoothing Splines} +% $RSS(f,\lambda) = \sum\limits_{i=1}^n (y_i - f(x_i))^2 + \lambda \int (f''(x))^2dx$\\ + +\subsection*{RSS Estimator} +$\hat{\beta} \sim \mathcal{N}(\beta,(X^TX)^{-1}\sigma^2)$. +%\textbf{Unbiasedness}: $\mathbb{E}[\hat{\beta}] = \mathbb{E}[(X^TX)^{-1}X^Ty] = (X^TX)^{-1}X^T\mathbb{E}[X\beta+\epsilon] = (X^TX)^{-1}(X^TX)\beta+X^T\mathbb{E}[\epsilon] = \beta + 0$ +%\textbf{Variance of} $a^T\hat{\beta}$: $\mathbb{V}(a^T(X^TX)^{-1}X^T(X\beta + \epsilon)) = \mathbb{V}(a^T\beta) + \mathbb{E}(a^T(X^TX)^{-1}X^T\epsilon\epsilon^TX(X^TX)^{-1}a) = \sigma^2 a^T(X^TX)^{-1}a$ + +%\subsection*{Gauss-Markov Theorem} +%For any linear estimator $\widetilde{\theta}=c^T\mathbf{y}$ that is unbiased for $a^T\beta$ it holds: $\mathbb{V}(a^T\hat{\beta}) \leq \mathbb{V}(c^T\mathbf{y})$\\ +%Proof: Let $c^T \mathbf{y} = a^T\hat{\beta} + a^T\mathbf{D}\mathbf{y} = a^T((\mathbf{X^TX})^{-1}\mathbf{X}^T + \mathbf{D})\mathbf{y}$ be an unbiased estimator of $a^T \beta$; then it follow $a^T \mathbf{DX}\beta = 0$ which implies $\mathbf{DX} = 0$.\\ +%$\mathbb{V}(c^T \mathbf{y}) = \mathbb{E}[(c^T \mathbf{y})^2]-\mathbb{E}(c^T \mathbf{y})^2 = c^T(\mathbb{E}\mathbf{y}\mathbf{y}^T - \mathbb{E}\mathbf{y}\mathbb{E}\mathbf{y}^T)c = \sigma^2 c^T c $ +%= $\sigma^2 \big( a^T ((\mathbf{X^T X})^{-1}\mathbf{X}^T + \mathbf{D}) (\mathbf{X}(\mathbf{X^T X})^{-1}+\mathbf{D}^T)a \big )$\\ +%= $\sigma^2 \big( a^T (\mathbf{X^T X})^{-1}a +\mathbf{DD^T}a \big )$ +%= $\mathbb{V}(a^T\hat{\beta}) + a^T \mathbf{DD^T}a \geq \mathbb{V}(a^T\hat{\beta})$ (note: $\mathbf{DD^T}$ is PSD) + +\subsection*{Bias vs. Variance} +\setlength{\mathindent}{0cm} +$ +\E_D\E_{X,Y}\left(\hat{f}(X)-Y\right)^2 = \\ +\E_D\E_X\left(\hat{f}(X) - \E(Y|X)\right)^2 + \E_{X,Y}\left(Y - \E(Y|X)\right)^2\\ += \E_X \E_D\left(\hat{f}(X) - \E_D(\hat{f}(X))\right)^2 \text{(variance)}\\ ++ \E_X\left(\E_D(\hat{f}(X)) - \E(Y|X)\right)^2 \text{(bias}^2)\\ ++ \E_{X,Y}\left(Y - \E(Y|X)\right)^2 \text{(noise)} +$\\ +%High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).\\ +%High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs. + +% \subsection*{Gradient Descent} +% 1. Start arbitrary $w_o \in \mathbb{R}$\\ +% 2. For $i$ do $w_{t+1} = w_t - \eta_t \nabla \hat{R}(w_t)$ + +%\subsection*{Curse of Dimensionality} +%To obtain a reliable estimate at a given regularity, the required number of samples grows exponentially with the dimension of the sample space. + +% \subsection*{Expected Error} +% For generalization, minimize the expected error +% $R(w) = \int P(x,y) (y-w^Tx)^2 \partial x \partial y$\\ +% $= \mathbb{E}_{x,y}[(y-w^Tx)^2]$ + + +\subsection*{Ridge Parametric to nonparametric} +Ansatz: $w=\sum_i \alpha_i x$\\ +$w^* = \underset{w}{\operatorname{argmin}} \sum_i (w^Tx_i-y_i)^2 + \lambda ||w||_2^2$ = \\ +${\operatorname{argmin}}_{\alpha_{1:n}} \sum_{i=1}^n (\sum_{j=1}^n \alpha_j x_j^T x_i - y_i)^2 + \lambda \sum_i \sum_j \alpha_i \alpha_j (x_i^T x_j)$\\ +$= {\operatorname{argmin}}_{\alpha_{1:n}} \sum_{i=1}^n (\alpha^T K_i - y_i)^2 + \lambda \alpha^T K \alpha$\\ +$= {\operatorname{argmin}}_{\alpha} ||\alpha^T K -y||_2^2 + \lambda \alpha^T K \alpha$\\ +Closed form: $\alpha^* = (K+\lambda I)^{-1} y$\\ +Prediction: $y^*= w^{*T} x = \sum_{i=1}^n \alpha_i^* k(x_i,x)$ \ No newline at end of file diff --git a/src/aml-cheatsheet/2Bayes.tex b/src/aml-cheatsheet/2Bayes.tex new file mode 100644 index 0000000..f35608f --- /dev/null +++ b/src/aml-cheatsheet/2Bayes.tex @@ -0,0 +1,13 @@ +% -*- root: Main.tex -*- +\section{Bayesian Methods} +\subsection*{MLE} +$\theta^* = \operatorname{argmax}_\theta P(y|x,\theta) $\\ +$= {\operatorname{argmax}}_\theta \prod_{i=1}^n P(y_i|x_i, \theta) \text{\quad (iid)}$\\ +$= {\operatorname{argmax}}_\theta \sum_{i=1}^n log P(y_i|x_i,\theta)$ + +\subsection*{MAP} +$w^* = \underset{w}{\operatorname{argmax}} P(w|x,y) = \underset{w}{\operatorname{argmax}} \frac{P(w|x) P(y|x,w)}{P(y|x)}$\\ +$=\underset{w}{\operatorname{argmax}} log P(w) + \sum_i log P(y_i|x_i,w) + const.$ + +\subsection*{MLE = MAP} +$n \rightarrow \infty$ or prior is uniformly distr. \ No newline at end of file diff --git a/src/aml-cheatsheet/2GaussianProcess.tex b/src/aml-cheatsheet/2GaussianProcess.tex new file mode 100644 index 0000000..2d595d1 --- /dev/null +++ b/src/aml-cheatsheet/2GaussianProcess.tex @@ -0,0 +1,102 @@ +% -*- root: Main.tex -*- +\section{Gaussian Processes} +% A GP $\{X_t\}_t$ is a collection of random variables from which any finite sample has a joint Gaussian distribution.\\ +% For any finite set of points $T=\{t_1, \dots, t_n\}$ from a GP, it hold that $(X_{t_1}, \dots,X_{t_n})\sim \mathcal{N}(\pmb{\mu_T},\pmb{\Sigma_T})$ with $\pmb{\mu_T} = (\mu(t_1),\dots,\mu(t_n))$, $\pmb{\Sigma_T}(i,j)=k(X_{t_i},X_{t_j})$ +\subsection*{Gaussian Process} +%$p\big(\begin{bmatrix} +%\mathbf{y} \\ +%y^*\\ +%\end{bmatrix}|x^*,\mathbf{X}, \sigma \big) = \mathcal{N}\big(\begin{bmatrix} +%\mathbf{y} \\ +%y^*\\ +%\end{bmatrix} | \mathbf{0},\begin{bmatrix} +%\mathbf{C_n} & \mathbf{k} \\ +%\mathbf{k}^T & c +%\end{bmatrix} \big)$\\ +%with $\mathbf{C_n} = \mathbf{K} + \sigma^2 \mathbf{I}, c = k(x_{n+1},x_{n+1}) + \sigma^2,\\ +%\mathbf{k}=k(x_{n+1}, \mathbf{X}), \mathbf{K}=k(\mathbf{X}, \mathbf{X})$\\ +%$p(y^*|x^*, X, y) = \mathcal{N}(y^*|\mu, \sigma^2)$\\ +%with $\mu = k^T C_n^{-1}y, \sigma^2 = c-k^TC_n^{-1}k$\\ + +$[y_1, y_2, ...]^T\!=\!X\beta\!+\!\epsilon \sim \mathcal{N}(y|0,X \Lambda^{-1} X^T+ \sigma^2 I) $ +$y \sim \mathcal{N}(y | m(X), K(X,X) + \sigma^2 I) = P(y|X,\Theta)$ + +%Joint dist.: {\footnotesize $p([y, y_{n+1}] | x_{n+1}, X, \sigma) +%\sim \mathcal{N}([y, y_{n+1}]|, K_{n+1} + \sigma^2I)$} + +$\left[\begin{smallmatrix} y \\y_{n+1}\end{smallmatrix}\right] \sim \mathcal{N}\left(\left[\begin{smallmatrix} y \\y_{n+1}\end{smallmatrix}\right]|[\begin{smallmatrix} m(X) \\m(x_{n+1})\end{smallmatrix}], [\begin{smallmatrix} C_n & k \\ k^T & c\end{smallmatrix}]\right)$ + +$p(y_{n+1}|x_{n+1}, X, y)) = \mathcal{N}(y_{n+1} | \mu_{n+1}, \sigma^2_{n+1})$ \\ +$\mu_{n+1} = m(x_{n+1})+k^T C^{-1}_n (y\!-\!m(X))$ \\ +$\sigma^2_{n+1} = c - k^T C^{-1}_n k$,$k = k(x_{n+1}, X)$ \\ +$c = k(x_{n+1},x_{n+1})\!+\!\sigma^2$,$C_n = K_n + \sigma^2 I$ + +\subsection*{GP Hyperparameter Optimization} +Log-likelihood:\\ +$l(Y|\theta) = -\frac{n}{2} \log(2\pi) - \frac{1}{2} \log |C_n| - \frac{1}{2} Y^T C_n^{-1}Y$\\ +Set of hyperparameters $\theta$ determine parameters $C_n$. Gradient descent: $\nabla_{\theta_i}l(Y|\theta) = -\frac{1}{2}tr(C_n^{-1} \frac{\partial C_n}{\partial \theta_i}) + \frac{1}{2} Y^T C_n^{-1} \frac{\partial C_n}{\partial \theta_i} C_n^{-1} Y$ + +\subsection*{Kernels} + $K(x, y) = <\phi(x), \phi(y)>$ for some feature mapping $\phi(x)$\\ + Psd Gram Matrix: $c^TKc \geq 0, \sum_i\sum_jc_ic_jk(x_i,x_j)\geq 0$\\ + All principal minors of K need $det \geq 0$;\newline + $k(x, y) = k(y, x) ;\hspace{2mm} k(x,x) \geq 0;\hspace{2mm} k(x,x)k(v,v) \geq k(x,y)^2$ + Closure Properties: {\tiny psd prop. closed under pointwise limits (since each $K_n$ is a kernel)} \\ + $k(x,y) = k_1(x,y) + k_2(x,y)$, $k(x,y) = k_1(x,y)k_2(x,y)$\\ + $k(x,y) = f(x)f(y)$, $k(x,y) = k_3(\phi(x),\phi(y))$\\ + $k(x,y) = \exp(\alpha k_1(x,y)), \alpha > 0$, $|X \cap Y| = kernel$\\ + $k(x,y) = p(k_1(x,y)), \, p(\cdot)$ {\tiny\text{polynomial with pos. coeff.}}\\ + $k(x,y)=k_1(x,y)/ \sqrt{(k_1(x,x) k_1(y,y)}$\\ + Gaussian (rbf): $k(x,y) = \exp( -\tfrac{||x-y||^2}{2\sigma^2})$ {\tiny inf.dim.}\\ + Sigmoid: $k(x,y) = \tanh(k\cdot x^Ty - b)$ {\tiny\text{not valid for $\forall k,b$}} \\ + Polynomial: $k(x,y) {=} (x^Ty {+} c)^d$,$d\in N$,$c\geq0$ \\ + Periodic: $k(x,y) = \sigma ^2 exp(\frac{2\sin ^2 (\pi |x-y|/p)}{\ell ^2})$ + +% \subsubsection*{Polynomial kernel} +% $k_1 = (x^Ty)^m$ represents monomial of deg m \\ +% $k_2 = (1+x^Ty)^m$ represents monomials up to deg m + +%\subsubsection*{Properties of kernel} +%\begin{inparaitem} +% \item k must be symmetric\\ +% \item the kernel matrix must be SPD +%\end{inparaitem} + +%\subsubsection*{Kernel matrix} +%The kernel matrix $K$ is SPD \\ +%$K = +%\begin{bmatrix} +%k(x_1,x_1) & \dots & k(x_1,x_n) \\ +%\vdots & \ddots & \vdots \\ +%k(x_n, x_1) & \dots & k(x_n,x_n) +%\end{bmatrix}$\\ +%$\left ( XX^T \right )$ for inner product as kernel. + +% \subsubsection*{semi-positive-definite matrices} +% $M \in \mathbb{R}^{n\times n}$ is SPD $\Leftrightarrow$\\ +% $\forall x \in \mathbb{R}^n: x^TMx \geq 0 \Leftrightarrow$\\ +% all eigenvalues of $M$ are positive $\geq 0$ + +%\subsection*{Nearest Neighbor k-NN} +%$y=sign(\sum \limits_{i=1}^n y_i [x_i \text{ among k nearest neighbors of } x])$ + +%$k_1(x,y) + k_2(x,y)$ , +%$k_1(x,y) \cdot k_2(x,y)$, +%$c \cdot k_1(x,y)$ for $c>0$ , +%$f(k_1(x,y))$, where $f$ is exponential/polynomial with positive coefficents\\ +%$k(x,y) = \phi(x)^T \phi(y)$, for some $\phi$ ass. with k + +%\subsubsection*{Parametric vs. Nonparametric} +%\emph{Parametric}: have finite set of parameters\\ +%E.g. linear regression, perceptron,...\\ +%$f(x) = w^Tx, w\in \mathbb{R}^d$ (d is independent of #data) +% \begin{itemize} +% \item[+] computationally not complex +% \end{itemize} + +%\emph{Nonparametric}: grows in complexity with the size of the data\\ +%E.g. kernelized Perceptron, k-NN,...\\ +%$f(x) = \sum_{i=1}^n \alpha_i y_i k(x_i,x_n)$ (depends on #data)\\ +% \begin{itemize} +% \item[+] potentially much more expressive. +% \end{itemize} \ No newline at end of file diff --git a/src/aml-cheatsheet/3NumericalEstimatesMethods.tex b/src/aml-cheatsheet/3NumericalEstimatesMethods.tex new file mode 100644 index 0000000..6b3569e --- /dev/null +++ b/src/aml-cheatsheet/3NumericalEstimatesMethods.tex @@ -0,0 +1,23 @@ +% -*- root: Main.tex -*- +\section{Numerical Estimating Methods} +Actual Risk: $\mathcal{R}(f) := \E_{x,y}[(y-f(x))^2]$ \\ +Empiricial Risk: $\hat{\mathcal{R}}(f) = \frac{1}{n}\sum_i (y_i - f(x_i))^2$\\ +Generalization Error: $G(f) = |\hat{\mathcal{R}}(f) - \mathcal{R}(f)|$ +\subsection*{$K$-fold cross validation} +$\hat{f}^{-\nu} \in \argmin_f \frac{1}{|Z^{-\nu}|} \sum_{i \in Z^{-\nu}} (y_i - f(x_i))^2$\\ +$\hat{\mathcal{R}}^{cv} = \frac{1}{n} \sum_i(y_i - \hat{f}^{-\kappa(i)}(x_i))^2$, $k(i)$ is fold $i^{th}$ fold \\ +Problem: systematic tendency to underfit. +\subsection*{Leave-one-out} +unbiased, high variance \\ +$\hat{f}^{-i} \in \argmin_f \frac{1}{n-1} \sum_{j:j \neq i} L(y_i,f(x_i))$ \\ +$\hat{\mathcal{R}}^{LOOCV} = \frac{1}{n} \sum_i L(y_i, \hat{f}^{-i}(x_i))$ +\subsection*{Bootstrapping} +Resampling with replacement from data $D$ to produce $B$ boostrap datasets $D^{*b}$. $S(D)$ is expected generalization error of prediction model trained on $D$. Var: $\sigma ^2(S) = \frac{1}{B-1}\sum_{b=1}^B(S(D^{*b})-\bar{S})^2$ with mean: $\hat{R}_{boot}(f)=\bar{S}=\frac{1}{B}\sum_{b=1}^B(\frac{1}{N}\sum_{i=1}^NL(y_i,\hat{f}_{D^{*b}}(x_i)))$ with $\hat{f}_{D^{*b}}(x_i)$ being the prediction model. $\hat{R}_{boot}^{LOO}(f) = \frac{1}{N}\sum_{i=1}^N\frac{1}{|C^{-i}|}\sum_{b\in C^{-i}}L(y_i,\hat{f}_{D^{*b}}(x_i))$ where $C^{-i}$ denotes the set of bootstrap sets not containing data point $i$. Note: $L$ can be $I_{\{c(x_i)\not =y_i\}}$. +$\hat{R}_{boot}$ is optimistic. Hence use: $\hat{R}^{.0632}=0.368\hat{R}_{boot}+0.632\hat{R}_{boot}^{(LOO)}$. \\ +Prob. not to appear in set: $(1-\frac{1}{n})^n = \frac{1}{e}$ for $n \rightarrow \infty$ +\subsection*{Jackknife} +Goal: Numerical estimate of bias of an estimator $\hat{S}_n$. Jackknife estimator: $\hat{S}^{JK}=\hat{S}_n-bias^{JK}$ with $bias^{JK}=(n-1)(\tilde{S}_n-\hat{S}_n)$ with $\tilde{S}_n=\frac{1}{n}\sum_{i=1}^n\hat{S}^{(-i)}_{n-1}$ with $\hat{S}^{(-i)}_{n-1}$ being the leave-1-out estimator. +\subsection*{Information Criteria} +$BIC = ln(n)k - 2ln(\hat{L})$, $AIC = 2k - 2ln(\hat{L})$\\ +$TIC = 2trace[I_1(\theta_k)J_1^{-1}(\theta_k)] - 2ln(\hat{L})$, +where k: num. params, n: num. data points, likelihood: $\hat{L}=p(X|\theta_k,M)$ \ No newline at end of file diff --git a/src/aml-cheatsheet/4Classification.tex b/src/aml-cheatsheet/4Classification.tex new file mode 100644 index 0000000..904a503 --- /dev/null +++ b/src/aml-cheatsheet/4Classification.tex @@ -0,0 +1,36 @@ +% -*- root: Main.tex -*- +\section{Classification} +\subsection*{Loss-Functions} +True class: $y \in \{-1,1\}$, pred. $z \in [-1,1]$\\ +Cross-entropy (log loss): ($y'=\tfrac{(1+y)}{2}$ and $z'=\tfrac{(1+z)}{2}$) $L(y',z') {=} -[y'log(z') {+} (1-y')log(1-z')]$ \\ +Hinge Loss: $L(y,z) = max(0, 1-yz)$ \\ +Perceptron Loss: $L(y,z) = max(0, -yz)$ \\ +Logistic loss: $L(y,z) = log(1 + exp(-yz))$ \\ +Square loss: $L(y,z) = \tfrac{1}{2}(1-yz)^2$ \\ +Exponential loss: $L(y,z) = exp(-yz)$ \\ +Binomial deviance: $L(y,z) = 1 + exp(-2yz)$ \\ +0/1 Loss: $L(y,z) = \mathbb{I}\{sign(z)\neq y\}$ \\ + +\subsection*{Perceptron} +Gradient descent: $a(k+1) = a(k) - \eta(k)\nabla J(a(k))$ \\ +$J(a)\approx J(a(k))+\nabla J^T(a-a(k)) + \tfrac{1}{2}(a-a(k))^T H (a-a(k))$, $H{=}\frac{\partial^2 J}{\partial a_i \partial a_j}$ \\ +$2^{nd}$ order algorithm: $\eta_{opt} = \frac{\norm{\nabla J}^2}{\nabla J^T H \nabla J}$ \\ +Newton's rule: $a(k+1){=}a(k){-}H^{{-}1}\nabla J$\\ +Perceptron criteria: $J_p(a)=\sum_{\widetilde{x}\in\widetilde{\mathcal{X}}^{mc}} (-a^T \widetilde{x})$ \\ +Perceptron rule: $a(k+1)=a(k)+\eta(k)\sum_{\widetilde{x}\in\widetilde{\mathcal{X}}^{mc}} \widetilde{x}$ \\ +Perceptron convergence:$\left \| a(k+1)- \alpha \hat a \right \|^{2} = \left \| a(k)- \alpha \hat a \right \|^{2} + 2(a(k)- \alpha \hat a)^{T} \tilde x^{k} + \left \| \tilde x^{k} \right \|^{2} \leq \left \| a(k)- \alpha \hat a \right \|^{2} -2\alpha \gamma + \beta ^{2}$ +where $\beta^{2} = max_{i} \left \| \tilde x_{i \in \tilde X^{mc} } \right \| ^{2}$ and $\gamma = min_{i \in \tilde X^{mc} } (\hat a^{T} \tilde x_{i}) > 0 $ for $\alpha= \beta^{2} / \gamma$ then $k_{0}= \alpha^{2}\left \|\hat a \right \|^{2} / \beta^{2}= \beta^{2}\left \|\hat a \right \|^{2} / \gamma^{2}$ +%\subsection*{Bayesian Decision Theory} +%Est. cond. dist: $P(y|x,w) = Ber(\sigma(w^Tx))$\\ +%Action set: $\mathcal{A} = \{ +1, -1\}$\\ +%Cost fn: $C(y,a) = \{ +%\begin{array}{lr} +% c_{FP} \text{ , if $y=-1$ and $a=+1$}\\ +% c_{FN} \text{ , if $y=+1$ and $a=-1$}\\ +% 0 \text{ , otherwise} +%\end{array} +%$ +%The action that minimizes the expected cost is:\\ +%$C_+ = \mathbb{E}_y[C(y,+1)|x] = P(y=+1|x) \cdot 0 + (P(y=-1)|x) \cdot %c_{FP}$\\ +%$C_- = \mathbb{E}_y[C(y,-1)|x] = P(y=+1|x) \cdot c_{FN} + P(y=-1|x) \cdot %0$\\ +%Predict +1 if $C_+ \leq C_-$ diff --git a/src/aml-cheatsheet/5DesignLinearDiscriminant.tex b/src/aml-cheatsheet/5DesignLinearDiscriminant.tex new file mode 100644 index 0000000..cc6920a --- /dev/null +++ b/src/aml-cheatsheet/5DesignLinearDiscriminant.tex @@ -0,0 +1,16 @@ +% -*- root: Main.tex -*- +\section{Design of Discriminant} + + +\subsection*{Fisher's Linear Discriminant:} +$\mathbb{R}^d \rightarrow \mathbb{R}^{(k-1)}$: +$\vec{y}_i = \vec{w}_i^T\vec{x}, 1 \leq i \leq k - 1, \vec{y} = W^T\vec{x}$ + +{\footnotesize Criterion:} $J(W) {=} \frac{|W^T\Sigma_B W|}{|W^T\Sigma_w W|} {\stackrel{\text{\tiny 2 classes}}{=}} \frac{(m_2 - m_1)^2}{s_1^2 + s_2^2} +\rightarrow \stackrel{\text{\tiny maximize}}{\text{\tiny $d/dW = 0$}}$ \\ +$\Sigma_B = \sum_i n_i (m_i-m)(m_i-m)^T$ {\tiny(Between class variance)} \\ +$\Sigma_W = \sum_i \sum_{x \in X_i} (x - m_i)(x - m_i)^T$ +{\tiny(Within class variance)} \\ +$m_i = \frac{1}{n_i} \sum_{x \in X_i} x$, $m = \frac{1}{n}\sum_x x$ + +solution: $\hat{w} \stackrel{\text{\tiny 2 classes}} {=} \Sigma_W^{-1} (m_1 - m_2)$ \ No newline at end of file diff --git a/src/aml-cheatsheet/6SupportVectorMachine.tex b/src/aml-cheatsheet/6SupportVectorMachine.tex new file mode 100644 index 0000000..4419bf2 --- /dev/null +++ b/src/aml-cheatsheet/6SupportVectorMachine.tex @@ -0,0 +1,61 @@ +% -*- root: Main.tex -*- + +\section{SVM} +Primal Problem: {\scriptsize ($C \rightarrow \infty$: Hard Margin)}\\ + {\footnotesize $\min_w \frac{1}{2} w^Tw + C \sum_{i=1}^n \xi_i \hspace{3mm} + s.t. \hspace{2mm} z_i(w^T \phi(y_i) +w_0) \geq 1-\xi_i, \, \xi_i \geq 0$ }\\ +Dual Problem:\hspace{1mm}: + $L(w,w_0,\xi,\alpha,\beta)=\frac{1}{2}w^Tw + C\sum_{i=1}^n\xi_i - \sum_{i=1}^{n}\beta_i\xi_i \\ + \tab\tab\tab-\sum_{i=1}^{n} \alpha_i(z_i(w^T\phi(y_i) + w_0) -1+\xi_i)\\ + \max_\alpha L(a) = \sum_{i=1}^n\alpha_i - \frac{1}{2} \sum_{i,j=1}^n z_i z_j + \alpha_i \alpha_j \phi(y_i, y_j)\\ + s.t. \, \sum_{j=1}^n z_j \alpha_j = 0 \, \wedge C \geq \alpha_i \geq 0 $\\ +optimal hyperplane:\mbox{} $w^* = \sum_{i=1}^n \alpha_i^* z_i \phi(y_i)$ \\ + $w_0^* = \frac{1}{n_s} \sum_{i \in S}(z_n - \sum_{j \in S} \alpha_j z_j \phi(y_i,y_j))\\ + \stackrel{\text{\tiny linear}}{=} -\frac{1}{2}(min_{i:z_i=1} w^{*T}y_i + + max_{i:z_i=-1} w^{*T}y_i)$\\ +Only for support vectors:\mbox{} $\alpha_i^* > 0$\\ +Prediction:\mbox{} + $z(y)=sign(\sum_{i=1}^n \alpha_i z_i \phi(y,y_i) + w_0) \\ + \stackrel{\text{\tiny linear}}{=} sign(w^{*T}x+w_0)$\\ +Homog. Coordinates:\mbox{} condition $\sum_{j=1}^n z_j \alpha_j = 0$ falls away. +% \subsection*{Kernelized SVM} +% $ +% \max_{\alpha} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j k(x_i, x_j), \text{ s.t. } 0 \geq \alpha_i \geq C +% $\\ +% Classify: $y = sign(\sum_{i=1}^{n} \alpha_i y_i k(x_i, x))$ + +% \subsection*{How to find $a^T$?} +% $a = \{w_0,w\}$ used along $\widetilde{x} = \{1,x\}$ + +% Gradient Descent: $a(k+1) = a(k) - \eta(k) \nabla J(a(k))$ + +% Newton method: 2nd order Taylor to get $\eta_{opt} = H^{-1}$ with $H=\frac{\partial^2 J}{\partial a_i \partial a_j}$ + +% $J$ is the cost matrix, popular choice is + + +% \subsection*{Perceptron Algorithm} +% Stochastic Gradient + Perceptron loss\\ + +% \emph{Theorem:} If $D$ is linearly seperable $\Rightarrow$ Perceptron will obtain a linear seperator. + +% \subsection*{Support Vector Machine} +% Try to maximize a 'band' around the seperator.\\ + +% \subsection*{Matrix-Vector Gradient} +% %multiply transposed matrix to the same side as its occurance w.r.t. derivate variable: $\beta \in \mathbb{R}^d$ +% $\nabla_\beta ( ||y-X\beta||_2^2 + \lambda ||\beta||_2^2 ) = 2X^T (y-X\beta) + 2\lambda \beta$\\ + +% \subsection*{Hinge loss} +% loss for support vector machine.\\ +% $l_{SVM}(w,x_i,y_i) = \max \{0,1-y_iw^Tx_i\} + \lambda ||w||_2^2$\\ +% derivation:\\ +% $\frac{\partial}{\partial w_k} l_{SVM}(w,y_i,x_i) = \left \{ +% \begin{array}{lr} +% 0 \text{ , if } 1-y_iw^Tx_i < 0 \\ +% -y_ix_{i,k} + 2\lambda w_k \text{ , otherwise} +% \end{array} \right. $ + +% \subsection*{Sparse L1-SVM} +% $\underset{w}{\operatorname{argmin}} \sum \limits_{i=1}^n \max (0, 1-y_i w^T x_i) + \lambda ||w||_1$ diff --git a/src/aml-cheatsheet/7NonLinearSVM.tex b/src/aml-cheatsheet/7NonLinearSVM.tex new file mode 100644 index 0000000..bc21512 --- /dev/null +++ b/src/aml-cheatsheet/7NonLinearSVM.tex @@ -0,0 +1,10 @@ +% -*- root: Main.tex -*- +\section{Non-linear SVM} +% \subsection*{Kernel SVM} +% \textcolor{red}{TODO: add how to kernelize} +\subsection*{Multiclass SVM} +$\min_{w, \eta\geq 0} \frac{1}{2} w^T w + C \sum_i \xi_i$\\ +s.t. $\forall y_i \in Y:( w_{z_i}^T y_i) - max_{z \not = z_i} (w_z^T y_i) \geq 1-\xi_i$ +\subsection*{Structured SVM} +$\min_{w, \eta} \frac{\lambda}{2} ||w||^2 + \frac{1}{n} \sum_{i=1}^n \eta_i$, $\eta \geq H_i(w) \forall i$, where\\ +$H_i(w)=\max_{y \in Y(x_i)} L(y_i, y) - w^T(\phi(x_i,y_i) - \phi(x_i,y))$ \ No newline at end of file diff --git a/src/aml-cheatsheet/8Ensemble.tex b/src/aml-cheatsheet/8Ensemble.tex new file mode 100644 index 0000000..bddacb0 --- /dev/null +++ b/src/aml-cheatsheet/8Ensemble.tex @@ -0,0 +1,36 @@ +% -*- root: Main.tex -*- +\section{Ensemble method} +\subsection*{Random Forest} +for b=1:B do:\\ +draw a bootstrap sample $D_b$\\ +repeat until node size<$n_{min}$:\\ +1. select $m$ features from $p$ features\\ +2. pick the best variable and split-point\\ +3. Split the node accordingly\\ +return the forest $\{\hat{c}_b(x)\}_{b=1}^B$ + +Boosting: Train weak learners sequentially on all data, but reweight misclassifed samples higher, Bias $\downarrow$ +\subsection*{Adaboost} +Initialize weights $w_i = 1/n$, for b=1:B do:\\ +1. Fit classifier $c_b(x)$ with weights $w_i$\\ +2. Compute error $\epsilon_b = \sum_i w_i^{(b)} \mathbbm{1}_{[c_b(x_i) \not = y_i]} / \sum_i w_i^{(b)}$\\ +3. Compute coeff. $\alpha_b = log(\frac{1-\epsilon_b}{\epsilon_b})$\\ +4. Update weights $w_i = w_i \exp(\alpha_b \mathbbm{1}_{[y_i \not = c_b(x_i)]})$ +Return $\hat{c}_B(x) = \text{sign} \left ( \sum_{b=1}^B \alpha_b c_b(x) \right )$\\ +Loss: Exponential loss function\\ +Model: Additive logistic regression\\ +Bayesian approach (assumes posteriors)\\ +Newtonlike updates (Gradient Descent) + +%\newpage + +\subsection*{Bagging} +%\textbf{for} $b=1$ to $B$ \textbf{do}:\\ +%1. $Z^{*b}=$ b-th bootstrap sample from Z\\ +%2. Construct classifier $c_b$ based on $Z^{*b}$\\ +\textbf{return} ensemble class. $\hat{c}_B(x)=sgn(\sum_{i=1}^{B} c_i(x))$\\ +\textbf{Works}: Covariance small (different subset for training), Variance small (similar behaviour of weak learners), biases weakly affected.\\ +%\textbf{Bag. aggr. pred.}: $h_B(x)=E_{D'\sim D}[h_{D'}(x)]$\\ +%\textbf{Ideal aggr. pred.}: $h_A(x)=E_{D\sim P(x,y)}[h_D(x)]$\\ +%$E_D[L(y,h_D(x))]=E_D[(y-h_D(x))^2]=E_D[y^2]-2E_D[y\cdot h_D(x)]+E_D[h_D(x)^2]=y^2-2y\cdot E_D[h_D(x)]+E_D[h_D(x)^2]\geq y^2-2y\cdot E_D[h_D(x)]+E_D[h_D(x)]^2=y^2-2y\cdot h_A(x)+h_A(x)^2=(y-h_A(x))^2=L(y,h_A(x))$\\ +\textbf{Bias$\downarrow$\&Var.$\downarrow$}: Use complex decision tree (bias$\downarrow$), ensemble mult. decision trees (var$\downarrow$) diff --git a/src/aml-cheatsheet/8Unsupervised.tex b/src/aml-cheatsheet/8Unsupervised.tex new file mode 100644 index 0000000..65f4e78 --- /dev/null +++ b/src/aml-cheatsheet/8Unsupervised.tex @@ -0,0 +1,19 @@ +\section{Unsupervised Learning} +\subsection*{Histogram} +$H = (H_{1},..,H_{k})$ with $H_{i} = \#\{x \in S|x \in I_{j}\}$ with $I_{j}$ = k pairwise distinct subintervals. +Histogram as density estimation: $\widetilde{H} = \frac{1}{n}( H_{1},..H_{k})$ +\subsection*{Parzen} +$ +\hat{p}_n = \frac{1}{n} \sum\limits_{i=1}^n \frac{1}{V_n} \phi(\frac{x-x_i}{h_n}) +$ +where $\int \phi(x)dx = 1$ +Problems: 1) $V_0$ too small - noisy, $V_0$ too big: oversmoothed 2) Different behavior of the data distribution may +require different strategies in different parts of +the feature space.\\ +$\int \frac{1}{N} \sum_{i=1}^N\phi(\frac{|x-x_i|}{h}) dx_i = \frac{1}{N}\frac{1}{V} \sum_{i=1}^N \int \phi(\frac{|x-x_i|}{h}) dx_i = \frac{1}{VN} \cdot VN = 1$ + +\subsection*{K-NN} +$\hat{p}_n = \frac{1}{V_k} \text{ volume with } k \text{ neighbours}$\\ +error rate of 1-NN classifier is bounded by twice the Bayes error rate +\subsection*{K-means} +$L(\mu) = \sum_{i=1}^{n} \min_{j\in\{1...k\}} \|x_i - \mu_y \|_2^2$ \ No newline at end of file diff --git a/src/aml-cheatsheet/9MixtureModel.tex b/src/aml-cheatsheet/9MixtureModel.tex new file mode 100644 index 0000000..b7734ee --- /dev/null +++ b/src/aml-cheatsheet/9MixtureModel.tex @@ -0,0 +1,58 @@ +% -*- root: Main.tex -*- +% \subsection*{EM for GMM} +% Compute cluster membership weight for each point $x_i$ in cluster k, given $\theta_k=(\mu_k,\Sigma_k)$. $\mathbb{E}[z_k|x_i]= P(z_k=1|x_i; \theta)$ \textbf{E}: $\gamma_k(\mathbf{x}_i) = \frac{P(z_k=1;\theta_k) P(x_i|z_k=1;\theta_k)}{P(x_i;\theta_k)} = +% \frac{\boldsymbol{\pi}_k \mathcal{N}(\mathbf{x}_n | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \boldsymbol{\pi}_j \mathcal{N}(\mathbf{x}_n | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$ +% % \item[M:] $\sum_{i=1}^{n} \log P(x_i,z_i;\theta)=\\ +% % \sum_{i=1}^{n} \log[\sum_{k=1}^{K} \pi_k P(x_i|z_i;\theta_k)] = +% % \sum_{i=1}^{n} \log[\sum_{k=1}^{K} \gamma_k(x_i)\frac{\pi_k P(x_i|z_i;\theta_k)}{\gamma_k(x_i)}] \geq +% % \sum\limits_{i=1}^{n} \sum\limits_{k=1}^{K}\gamma_k(x_i)[\log P(x_i|z_i;\theta_k) + \log \pi_k - \log \gamma_k(x_i)]$\\ +% % $\frac{\partial}{\partial \pi_k} \sum\limits_{i=1}^{n} \sum\limits_{k=1}^{K}\gamma_k(x_i)[\log P(x_i|z_i;\theta_k) + \log \pi_k - \log \gamma_k(x_i)] + \lambda (\sum\limits_{j=1}^{K} \pi_j -1) \stackrel{\text{!}}{=} 0 \Leftrightarrow \pi_k = \sum\limits_{i=1}^{N} \frac{ \gamma_k(x_i)}{-\lambda}$;$ \sum\limits_{k=1}^{K} \pi_k = 1 =\sum_{k,n=1}^{K,N} \gamma_k(\mathbf{x}_i)\frac{1}{-\lambda} \Leftrightarrow \lambda = N$ +% % $\boldsymbol{\mu}_k := \frac{\sum_{n=1}^N \gamma_k(x_i) \mathbf{x}_n}{\sum_{n=1}^N \gamma_k(x_i)}$, and $\Sigma_k = \frac{\sum_{n=1}^N \gamma_k(x_i) (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_k - \boldsymbol{\mu}_k)^T}{\sum_{n=1}^N \gamma_k(x_i)}$ +% \textbf{M}: $(\mu^*,\Sigma^*) = \argmax_\theta \mathbb{E}_{\gamma}(\log[p(x|\theta)]) = \argmax_\theta \sum_{i=1}^{n}{\gamma_i(\log[p(x_i|\theta)])}$. $\frac{\partial}{\partial\mu},\frac{\partial}{\partial\Sigma}=0\rightarrow \boldsymbol{\mu}_k:=\frac{\sum_{n=1}^N \gamma_k(x_i) \mathbf{x}_n}{\sum_{n=1}^N \gamma_k(x_i)}$, $\Sigma_k = \frac{\sum_{n=1}^N \gamma_k(x_i) (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_k - \boldsymbol{\mu}_k)^T}{\sum_{n=1}^N \gamma_k(x_i)}$ + + +% \begin{compactdesc} +% \item[Assignment variable:] $\mathbf{z}_k \in \{0, 1\}$, $\sum_{k=1}^K \mathbf{z}_k = 1$, $\operatorname{Pr}(\mathbf{z}_k = 1) = \boldsymbol{\pi}_k \Leftrightarrow p(\mathbf{z}) = \prod_{k=1}^K \boldsymbol{\pi}_k^{\mathbf{z}_k}$, $\pi_k=$mixing prop. of cluster k +% \item[Complete data distribution:] $p(\mathbf{x}, \mathbf{z}) = \prod_{k=1}^K \left( \boldsymbol{\pi}_k \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)^{\mathbf{z}_k}$ +% \item[Likelihood of observed data (iid) \\ $\mathbf{X=[x_1,..,x_N]}$:] $p(\mathbf{X} | \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \prod_{n=1}^N p(\mathbf{x}_n) = \prod_{n=1}^N \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_n | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ +% \item[Log-likelihood:] $\ln p(\mathbf{X} | \boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma}) =\break \sum_{i=1}^N \ln \left( \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)$ +% \end{compactdesc} + +\subsection*{Gaussian Mixtures} + Estimate $\hat{\theta} = \{\mu_1,...,\mu_k, \Sigma_1,...,\Sigma_k\}$ that maximize the likelihood of sample feature vectors $\mathcal{X} = \{x_1,..., x_n \}$: \\ + $p(\mathcal{X} | \pi_i, ..., \pi_k, \theta_1, ..., \theta_k) = \prod_{x \in \mathcal{X}} \sum_{c\leq k} \pi_c p(x|\theta_c)$\\ + Log-Likelihood: $L(\mathcal{X} | \pi, \theta) = \sum_{x\in \mathcal{X}} + \log \sum_{c \leq k} \pi_c p(x|\theta_c)$ + +\subsection*{Expectation Maximization} +$L(\X,M|\theta) = \sum_{x \in \X} \sum_{c=1}^k M_{xc} \log(\pi_c P(x|\theta_c)$ \\ +$Q(\theta; \theta^{(j)}) = \E_{M}[L(\X,M|\theta)| \X, \theta^{(j)}]$, M latent variable + + +$M_{xc} = 1$ if cluster $c$ has generated $x$, else $M_{xc} = 0$ +$\\ \E_M[M_{xc}|\X, \theta^{(j)}] = P(M_{xc}=1) = P(c | x, \theta^{(j)}) = +\frac{P(x|c,\theta^{(j)}) P(c|\theta^{(j)})}{P(x|\theta^{(j)})} += \frac{\pi_c P(x|c,\theta^{(j)})}{\sum_{c=1}^K \pi_c P(x|c,\theta^{(j)})} +=: \gamma_{xc}$ +\begin{algorithmic}[1] + \While{not converged} + \State E-Step: + \tab Compute $\gamma_{xc}$ for all $x, c$ + \tab Compute $m_c := \sum_x \gamma_{xc}$ for all $c$ + \State M-Step: max $Q(\theta; \theta^{(j)}) \hspace{5mm} s.t. \sum_c \pi_c = 1$ + \vspace{-3mm} + \addtolength{\jot}{-3mm} + \begin{align*} + \mu_c^{(j+1)} &= \tfrac{\sum_{x \in \mathcal{X}} \gamma_{xc} x}{m_c} \hskip 15pt + \pi_c^{(j+1)} = \tfrac{1}{|\mathcal{X}|} m_c \\ + \Sigma_c^{(j+1)} &= \tfrac{\sum_{x \in \mathcal{X}} \gamma_{xc}(x-\mu_c)(x-\mu_c)^T} + {m_c} + \end{align*} + \vspace{-6mm} + \EndWhile +\end{algorithmic} +\vspace{-6mm} +\paragraph{Lagrangian with fixed $\gamma_{xc}$} \mbox{}\\ +$L = \sum_x \sum_c \gamma_{xc} \log(\pi_c P(x|c,\theta_c)) - +\lambda(\sum_c \pi_c - 1)$\\ +For GMM: $P(x|c,\theta^{(j)}) = \mathcal{N}(x | \mu_c, \Sigma_c)$ \ No newline at end of file diff --git a/src/aml-cheatsheet/Main.tex b/src/aml-cheatsheet/Main.tex new file mode 100644 index 0000000..701f27d --- /dev/null +++ b/src/aml-cheatsheet/Main.tex @@ -0,0 +1,325 @@ +\documentclass[11pt,landscape,a4paper,fleqn]{article} +\usepackage[utf8]{inputenc} +\usepackage[ngerman]{babel} +\usepackage{tikz} +\usepackage{bbm} +\usetikzlibrary{shapes,positioning,arrows,fit,calc,graphs,graphs.standard} +\usepackage[nosf]{kpfonts} +\usepackage[t1]{sourcesanspro} +%\usepackage[lf]{MyriadPro} +%\usepackage[lf,minionint]{MinionPro} +\usepackage{multicol} +\usepackage{wrapfig} +\usepackage[top=5mm,bottom=5mm,left=5mm,right=5mm]{geometry} +\usepackage[framemethod=tikz]{mdframed} +\usepackage{microtype} +\usepackage{paralist} % for compacter lists +\usepackage{bm} +\usepackage{algorithm} +\usepackage{algpseudocode} + +\makeatletter +\def\BState{\State\hskip-\ALG@thistlm} +\makeatother + + +\let\bar\overline + +\definecolor{myblue}{cmyk}{1,.72,0,.38} +\definecolor{myorange}{cmyk}{0.9,0,1,0.2} +\definecolor{myred}{cmyk}{0.7,0,0.7,0.6} + +\pgfdeclarelayer{background} +\pgfsetlayers{background,main} + +\everymath\expandafter{\the\everymath \color{myblue}} +%\everydisplay\expandafter{\the\everydisplay \color{myblue}} + +\renewcommand{\baselinestretch}{.8} +\pagestyle{empty} + +\global\mdfdefinestyle{header}{% +linecolor=gray,linewidth=1pt,% +leftmargin=0mm,rightmargin=0mm,skipbelow=0mm,skipabove=0mm, +} + +\makeatletter +\renewcommand{\section}{\@startsection{section}{1}{0mm}% + {.2ex}% + {.2ex}%x + {\color{myred}\sffamily\small\bfseries}} +\renewcommand{\subsection}{\@startsection{subsection}{1}{0mm}% + {.2ex}% + {.2ex}%x + {\color{myorange}\sffamily\bfseries}} +\renewcommand{\subsubsection}{\@startsection{subsubsection}{1}{0mm}% + {.2ex}% + {.2ex}%x + {\sffamily\bfseries}} + + +% math helpers +\DeclareMathOperator*{\argmin}{arg\,min} +\DeclareMathOperator*{\argmax}{arg\,max} +\newcommand{\E}{\mathbb{E}} + +\makeatother +\setlength{\parindent}{0pt} + +\newcommand{\imp}[1]{\boxed{\boldsymbol{#1}}} % Einrahmung und Fett +\newcommand{\w}{\omega} +\newcommand{\ud}{\,\mathrm{d}}% Differential +\newcommand{\norm}[1]{\left\lVert#1\right\rVert} +\newcommand{\X}{\mathcal{X}} + +% compress equations +%\medmuskip=0mu +%\thinmuskip=0mu +%\thickmuskip=0mu + +\begin{document} +\small +\begin{multicols*}{4} + \input{0Basics} + \input{1Regression.tex} + \input{2GaussianProcess.tex} +% \input{2Bayes.tex} + \input{3NumericalEstimatesMethods.tex} + \input{4Classification.tex} + \input{5DesignLinearDiscriminant.tex} + \input{6SupportVectorMachine.tex} + \input{7NonLinearSVM.tex} + \input{8Ensemble.tex} +% \input{8Unsupervised.tex} + \input{9MixtureModel.tex} +% \input{10TimeSeries.tex} + \input{10NeuralNet.tex} + +% -*- root: Main.tex -*- + +% -- PAC LEARNING +\section{PAC Learning} +Empirical error: $\hat{\mathcal{R}}_n(c) = \tfrac{1}{n}\sum_{i=1}^n \mathbb{I}_{\{c(x_i)\neq y\}}$ \\ +Expected error: $\mathcal{R}(c) = P\{c(x)\neq y\}$ \\ +ERM: $\hat{c}_n^* = \argmin_{c\in\mathcal{C}} \hat{\mathcal{R}}_n(c)$ \\ +opt: $c^* \in \min_{c\in\mathcal{C}} \mathcal{R}(c)$, $|\mathcal{C}|$ finite \\ +Generalization error: $\mathcal{R}(\hat{c}_n^*) = P\{ \hat{c}_n^*(x)\neq y \}$ \\ +VC ineq.: $\mathcal{R}(\hat{c}_n^*) - \inf\limits_{c\in\mathcal{C}}\mathcal{R}(c) \leq 2\sup\limits_{c\in\mathcal{C}}|\hat{\mathcal{R}}_n(c) - \mathcal{R}(c)|$ \\ +$P\{ \mathcal{R}(\hat{c}_n^*) - \mathcal{R}(c^*) > \epsilon \} \leq P\{ \sup\limits_{c\in\mathcal{C}}|\hat{\mathcal{R}}_n(c) - \mathcal{R}(c)| > \frac{\epsilon}{2} \} \\ +\leq 2|\mathcal{C}| exp(-2n\epsilon ^2 /4) \leq 8s(\mathcal{A},n)exp(-n\epsilon ^{2} /32)$ and $s(\mathcal{A},n) \leq n^{\mathcal{V_{\mathcal{A}}}}$ \\ +Markov ineq: $P\{X\geq\epsilon\} \leq \tfrac{\mathbb{E}[X]}{\epsilon}$ (for nonneg. X) \\ +Boole's inequality: $P(\bigcup_i A_i) \leq \sum_i P(A_i)$ \\ +Hoeffding's lemma: $\mathbb{E}[e^{sX}] \leq exp(\tfrac{1}{8}s^2(b-a)^2)$ where $\mathbb{E}[X]=0$, $P(X\in[a,b])=1$ \\ +Hoeffding's: $P\{S_n {-} \mathbb{E}[S_n] {\geq} t\} {\leq} exp({-} \frac{2t^2}{\sum_i (b_i - a_i)^2})$ \\ +Normalized: $P\{\widetilde{S}_n {-} \mathbb{E}[\widetilde{S}_n] {\geq} \epsilon\} {\leq} exp({-} \frac{2n^2 \epsilon ^2}{\sum_i (b_i {-} a_i)^2})$ \\ +{\small Error bound: $P\{ \sup\limits_{c\in\mathcal{C}}|\hat{\mathcal{R}}_n(c) - \mathcal{R}(c)| > \epsilon \} \leq 2|\mathcal{C}| exp(-2n\epsilon ^2)$} \\ +The $\mathcal{VC}$ dimension of a model $f$ is the maximum number of points that can be arranged so that $f$ shatters them. + +% -- DIRICHLET PROCESS +\section{Nonparametric Bayesian methods} +$Dir(x|\alpha) = \frac{1}{B(\alpha)} \prod_{k=1}^n x_k^{a_k - 1}$, $B(\alpha) = \frac{\prod_{k=1}^n \Gamma(\alpha_k)}{\Gamma(\sum_{k=1}^n \alpha_k)}$ \\ +$\mathbb{E}[1] = \sum_{i=1}^N \frac{\alpha}{\alpha + i} \sim(\alpha log(N))$ \\ +de Finetti: $p(X_1, ..., X_n) {=} \int (\prod_{i=1}^n p(x_i|G))dP(G)$ \\ +\[ p(z_i=k|\bm{z}_{-i},\bm{x},\alpha,\bm{\mu}) = \begin{cases} + \frac{N_{k,-i}}{\alpha + N - 1} p(x_i|\bm{x}_{-i,k},\bm{\mu}) \;\exists k \\ + \frac{\alpha}{\alpha + N - 1} p(x_i|\bm{\mu}) \;\text{otherwise} + \end{cases} +\] +DP generative model: \\ +\begin{inparaitem}[\color{red}\textbullet] +\item Centers of the clusters: $\mu_k \sim \mathcal{N}(\mu_0, \sigma_0)$ \\ +\item Prob.s of clusters: $\rho = (\rho_1, \rho_2) \sim GEM(\alpha)$ \\ +\item Assignments to clusters: $z_i \sim Categorical(\rho)$ \\ +\item Coordinates of data points: $\mathcal{N}(\mu_{z_i}, \sigma)$ +\end{inparaitem} + +%\newpage + +%\subsection{Misc} +%\textbf{Lagrangian:} $f(x,y) s.t. g(x,y) = c$\\ +%$ +%\mathcal{L}(x, y, \gamma) = f(x,y) - \gamma ( g(x,y)-c) +%$\\ +%\textbf{Parametric learning}: model is parametrized with a finite set of parameters, like linear regression, linear SVM, etc. \\ +%\textbf{Nonparametric learning}: models grow in complexity with quantity of data: kernel SVM, k-NN, etc.\\ +%\textbf{Empirical variance}: Look for dense and sparse regions. Regularize so that sparse regions are not contained (decr. variance). Measure by Variance CV of some classifiers. + +% -*- root: Main.tex -*- +%\section{Ensemble Methods} +%Use combination of simple hypotheses (weak learners) to create one strong learner. +% +%strong learners: minimum error is below some $\delta < 0.5$ +% +%weak learner: maximum error is below $0.5$ +%\begin{equation} +%f(x) = \sum_{i=1}^{n} \beta_i h_i(x) +%\end{equation} +%\textbf{Boosting}: train on all data, but reweigh misclassified samples higher. +% +%\subsubsection*{Decision Trees} +%\textbf{Stumps}: partition linearly along 1 axis\\ +%$h(x) = sign(a x_i - t)$\\ +%\textbf{Decision Tree}: recursive tree of stumps, leaves have labels. To train, either label if leaf's data is pure enough, or split data based on score. +% +% +%\subsubsection*{Ada Boost} +%Effectively minimize exponential loss.\\ +%$f^*(x) = \argmin_{f\in F} \sum_{i=1}^{n} \exp(-y_i f(x_i))$\\ +%Train $m$ weak learners, greedily selecting each one +%\begin{equation*} +%(\beta_i, h_i) = \argmin_{\beta,h} \sum_{i=1}^{n} \exp(-y_i (f_{i-1} (x_j) + \beta h(x_j))) +%\end{equation*} +%\begin{compactdesc} +% \item $c_b(x) \text { trained with } w_i$ \\ +% \item $\epsilon_b = \sum\limits_i^n \frac{w_i^b}{\sum\limits_i^n w_i^b} I_{c(x_i) \neq y_i} $\\ +% \item $\alpha_b = log \frac{1-\epsilon_b}{\epsilon_b} $\\ +% \item $w^{b+1}_i = w^b_i \cdot exp(\alpha_b I_{y_i \neq c_b(x_i)})$ +%\end{compactdesc} +% +%Exponential loss function +% +%Additive logistic regression +% +%Bayesian approached (assumes posteriors) +% +%Newtonlike updates (Gradient Descent) +% +%If previous classifier bad, next has heigh weight + +\section{Generative Methods} +%\textbf{Discriminative} - estimate $P(y|x)$ - conditional. \\ +%\textbf{Generative} - estimate $P(y, x)$ - joint, model data generation. + +\subsubsection*{Naive Bayes} +All features independent.\\ +$ +P(y|x) = \frac{1}{Z} P(y) P(x|y), Z = \sum_{y} P(y) P(x|y) \\ +y = \argmax_{y'} P(y'|x) = \argmax_{y'} \hat{P}(y') \prod_{i=1}^{d} \hat{P}(x_i|y') +$ \\ +\textbf{Discriminant Function}\\ +$ +f(x) = \log(\frac{P(y=1|x)}{P(y==1|x)}), y=sign(f(x)) +$ + +%\subsubsection*{Fischer's Linear Discriminant Analysis (LDA)} +%Idea: project high dimensional data on one axis. +% +%Complexity: $\mathcal{O}(d^2n$ with $d$ number of classifiers\\ +%$c=2, p=0.5, \hat{\Sigma}_- = \hat{\Sigma}_+ = \hat{\Sigma} \\ +%y = sign(w^\top x + w_0) \\ +%w = \hat{\Sigma}^{-1}(\hat{\mu}_+ - \hat{\mu}_-) \\ +%w_0 = \frac{1}{2}(\hat{\mu}_-^\top \Sigma^{-1} \hat{\mu}_- - \hat{\mu}_+^\top \Sigma^{-1} \hat{\mu}_+) +%$ + +% -*- root: Main.tex -*- +%\section{Unsupervised Learning} +%\subsection*{Parzen} +%$ +%\hat{p}_n = \frac{1}{n} \sum\limits_{i=1}^n \frac{1}{V_n} \phi(\frac{x-x_i}{h_n}) +%$ +%where $\int \phi(x)dx = 1$ +%\subsection*{K-NN} +%$ +%\hat{p}_n = \frac{1}{V_k} \text{ volume with } k \text{ neighbours} +%$ +%\subsection*{K-means} +%$ +%L(\mu) = \sum_{i=1}^{n} \min_{j\in\{1...k\}} \|x_i - \mu_y \|_2^2 +%$\\ +% +%\textbf{Lloyd's Heuristic}:\\ (1) assign each $x_i$ to closest cluster \\ +%(2) recalculate means of clusters. +% +%Iteration over (repeated till stable): +%\begin{compactdesc} +% \item[Step 1:]$ \text{argmin}_c ||x-\mu_c||^2$ \\ +% \item[Step 2:]$ \mu_\alpha = \frac{1}{n_\alpha} \sum \vec{x}$ +%\end{compactdesc} + +% -*- root: Main.tex -*- +\section{Neural Networks} +\subsection*{Learning features} +Parameterize the feature maps and optimize over the parameters:\\ +$w^* = \underset{w, \Theta}{\operatorname{argmin}} \sum_{i=1}^n l(y_i, \sum_{j=1}^m w_j \Phi(x_i, \Theta_j))$ + + +%\section{Hidden-Markov model} +%State only depends on previous state. +% +%Always given: sequence of symbols $\vec{s} = \{s_1,s_2, \ldots s_n\}$ +%\subsection*{Evaluation (Forward \& Backward)} +%Known: $a_{ij}, e_k(s_t)$ +% +%Wanted: $P(X = x_i | S = s_t)$ +%\begin{eqnarray} +%f_l (s_{t+1}) = e_l(s_{t+1}) \sum f_k(s_t) a_{kl} \\ +%b_l(s_t) = e_l(s_t) \sum b_k(s_{t+1}) a_{lk} \\ +%P(\vec{s}) = \sum_k f_k(s_n) a_k \cdot \text{ end} \\ +%P(x_{l,t} | \vec{s}) = \frac{f_l(s_t) b_l(s_t)}{P(\vec{s})} +%\end{eqnarray} +%Complexity in time: $\mathcal{O}(|S|^2 \cdot T)$ + +%\subsection{Learning (Baum-Welch)} +%Known: only sequence and sequence space $\Theta$ +% +%Wanted: $a_{ij}, e_k(s_t)$ \& most likely path $\vec{x} = \{x_1,x_2,\ldots x_n\}$ +% +%\textbf{E-step I:} $f_k(s_t), b_k(s_t)$ by forward \& backward algorithm +% +%\textbf{E-step II:} +%\begin{eqnarray} +%P(X_t = x_k, X_{t+1} = x_l | \vec{s}, \Theta) = \\ +%\frac{1}{P(\vec{s})} f_k(s_t) a_{kl} e_l(s_{t+1}) b_l(s_{t+1}) \\ +%A_{kl} = \sum\limits_{j=1}^m \sum\limits_{t=1}^n P(X_t = x_k, X_{t+1} = x_l | \vec{s}, \Theta) +%\end{eqnarray} +%\textbf{M-step :} +%\begin{eqnarray} +%a_{kl} = \frac{A_{kl}}{\sum\limits_i^n A_{ki}} \text{ and } e_k(b) = \frac{E_k(b)}{\sum_{b'} E_k(b')} +%\end{eqnarray} +%Complexity: $\mathcal{O}(|S|^2)$ in storage (space) +% -*- root: Main.tex -*- + +%\subsection{Norms} +%\begin{inparadesc} +% \item[\color{red}$l_0$:] $\|\mathbf{x}\|_0 := |\{i | x_i \neq 0\}|$ +% \item[\color{red}Nuclear:] $\|\mathbf{X}\|_\star = \sum_{i=1}^{\min(m, n)} \sigma_i$ + % \item[\color{red}Euclidean:] $\|\mathbf{x}\|_2 := \sqrt{\sum_{i=1}^{N} \mathbf{x}_i^2} = \sqrt{\mathbf{x}^T \mathbf{x}} = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle}$ + % \item[\color{red}$p$-norm:] $\|\mathbf{x}\|_p := \left( \sum_{i=1}^{N} |x_i|^p \right)^{\frac{1}{p}}$ + %\item[\color{red}Frobenius:] $\|\mathbf{A}\|_F :=\allowbreak %\sqrt{\sum_{i=1}^{M} \sum_{j=1}^{N} |\mathbf{A}_{i, j}|^2} =\allowbreak \sqrt{\operatorname{trace}(\mathbf{A}^T \mathbf{A})} =\allowbreak \sqrt{\sum_{i=1}^{\min\{m, n\}} \sigma_i^2}$ ($\sigma_i$ is the $i$-th singularvalue), $\mathbf{A} \in \mathbb{R}^{M \times N}$ +%\end{inparadesc} + + \subsubsection*{Reformulating the perceptron} + Ansatz: $w=\sum_{j=1}^n \alpha_j y_j x_j$\\ + $\min \limits_{w\in\mathbb{R}^d} \sum_{i=1}^n \max [0, -y_i w^T x_i]$\\ + $= \min \limits_{\alpha_{1:n}} \sum_{i=1}^n \max [0,-y_i ( \sum_{j=1}^n \alpha_j y_j x_j )^T x_i ]$\\ + $= \min \limits_{\alpha_{1:n}} \sum_{i=1}^n \max [0,- \sum_{j=1}^n \alpha_j y_i y_j x_i^T x_j ]$ + + \subsubsection*{Kernelized Perceptron} + 1. Initialize $\alpha_1 = ... = \alpha_n = 0$\\ + 2. For t do \\ + Pick data $(x_i,y_i) \in_{u.a.r} D$\\ + Predict $\hat{y} = sign(\sum_{j=1}^n \alpha_j y_j k(x_j,x_i))$\\ + If $\hat{y} \not = y_i$ set $\alpha_i = \alpha_i + \eta_t$ + +% \subsection*{Regularization} +% The error term $L$ and the regularization $C$ with regularization parameter $\lambda$: $\min \limits_w L(w) + \lambda C(w)$\\ +% L1-regularization for number of features \\ +% L2-regularization for the length of $w$ + +% \subsection*{Convex} +% $\text{g(x) is convex}$\\ +% $\Leftrightarrow x_1,x_2 \in \mathbb{R}, \lambda \in [0,1]:$\\ +% $g(\lambda x_1) + (1-\lambda x_2) \leq \lambda g(x_1) + (1-\lambda) g(x_2)$ +% $ \Leftrightarrow g''(x) > 0$ + +% \subsection*{Parametric to nonparametric linear regression} +% Ansatz: $w=\sum_i \alpha_i x$\\ +% Parametric: $w^* = \underset{w}{\operatorname{argmin}} \sum_i (*Tx_i-y_i)^2 + \lambda ||w||_2^2$\\ +% $= \underset{\alpha_{1:n}}{\operatorname{argmin}} \sum \limits_{i=1}^n (\sum \limits_{j=1}^n \alpha_j x_j^T x_i - y_i)^2 + \lambda \sum \limits_i \sum \limits_j \alpha_i \alpha_j (x_i^T x_j)$\\ +% $= \underset{\alpha_{1:n}}{\operatorname{argmin}} \sum \limits_{i=1}^n (\alpha^T K_i - y_i)^2 + \lambda \alpha^T K \alpha$\\ +% $= \underset{\alpha}{\operatorname{argmin}} ||\alpha^T K -y||_2^2 + \lambda \alpha^T K \alpha$\\ +% Closed form: $\alpha^* = (K+\lambda I)^{-1} y$\\ +% Prediction: $y^*= w^{*^T} x = \sum \limits_{i=1}^n \alpha_i ^* k(x_i,x)$ + +\end{multicols*} +\end{document} \ No newline at end of file diff --git a/src/cil-cheatsheet/DataClusteringMixture.tex b/src/cil-cheatsheet/DataClusteringMixture.tex new file mode 100644 index 0000000..14b620b --- /dev/null +++ b/src/cil-cheatsheet/DataClusteringMixture.tex @@ -0,0 +1,51 @@ +\section{Data Clustering \& Mixture Models} +\textbf{K-means} \textbf{Target:} $\min_{\mathbf{U}, \mathbf{Z}} J(\mathbf{U}, \mathbf{Z}) = \|\mathbf{X} - \mathbf{U} \mathbf{Z}\|_F^2$\\ +$= \sum_{n=1}^N \sum_{k=1}^K \mathbf{z}_{k,n} \|\mathbf{x}_n - \mathbf{u}_k\|_2^2$\\ +1. \textbf{Initiate:} choose $K$ centroids $\mathbf{U} = [\mathbf{u}_1, \ldots, \mathbf{u}_K]$\\ +2. \textbf{Cluster Assign:} data points to clusters. $k^\star(\mathbf{x}_n) = \argmin_k \{ \|\mathbf{x}_n - \mathbf{u}_k\|_2 \}$ returns cluster $k^\star$, whose centroid $\mathbf{u}_{k^\star}$ is closest to data point $\mathbf{x}_n$. Set $\mathbf{z}_{k^\star,n} = 1$, and for $ l \neq k^\star~ \mathbf{z}_{l,n}=0$.\\ +3. \textbf{Update centroids}: $\mathbf{u}_k = \frac{\sum_{n=1}^N z_{k,n} \mathbf{x}_n}{\sum_{n=1}^N z_{k,n}}$.\\ +4. Repeat until $\|\mathbf{Z} - \mathbf{Z}^\text{new}\|_0 = \|\mathbf{Z} - \mathbf{Z}^\text{new}\|^2_F = 0$.\\ +Computational cost: $O(k\cdot n \cdot d)$ +Prior: $p(z) = 1/K$ +\par \textbf{K-Means++:} +1. Choose centroid $\mathbf{u}_1$ randomly from datapoints $S$ +2. For $x \in S$, calculate min. squared distance $d_m(x)$ to existing centroids $c_1,...,c_m$ +3. Add new centroid $c_{m+1}$, choosen randomly from $S$ with prob. $p(x) = d_m(x) / \sum_{z \in S} d_m(z)$ +4. Repeat until $K$ centroids choosen $\rightarrow$ proceed with K-means +\subsection*{Gaussian Mixture Models (GMM)} +Gaussian $p(x)=\frac{1}{\sqrt{2\pi}\sigma}\mathit{exp}(-\frac{(x-\mu)^2}{2\sigma^2})$ +Multivariate $p(x;\mu;\Sigma)=\frac{1}{|\Sigma|^{\frac{1}{2}}(2\pi)^{\frac{D}{2}}}\mathit{exp}[-\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu)]$ \\ +For GMM let $\boldsymbol{\theta}_k = (\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$; $p_{\theta_k}(\mathbf{x}) = \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \Sigma_k)$\\ +\textbf{Mixture Models:} $p_\theta(\mathbf{x}) = \sum_{k=1}^K \pi_k p_{\theta_k}(\mathbf{x})$\\ +\textbf{Assignment variable (generative model):} $z_{ij} \in \{0, 1\}$, $\sum_{j=1}^k z_{ij} = 1$\\ +Prior: $\operatorname{Pr}(z_k = 1) = \pi_k \Leftrightarrow p(\mathbf{z}) = \prod_{k=1}^K \pi_k^{z_k}$\\ +\textbf{Complete data distribution:}\\ +$p_\theta(\mathbf{x}, \mathbf{z}) = \prod_{k=1}^K \left( \boldsymbol{\pi}_k p_{\theta_k}(\mathbf{x})\right)^{z_k}$\\ +\textbf{Posterior Probabilities:} $\\\operatorname{Pr}(z_k = 1 | \mathbf{x}) = \frac{\operatorname{Pr}(z_k = 1) p(\mathbf{x} | z_k = 1)}{\sum_{l=1}^K \operatorname{Pr}(z_l = 1) p(\mathbf{x} | z_l = 1)} = \frac{\boldsymbol{\pi}_k p_{\theta_k}(\mathbf{x})}{\sum_{l=1}^K \boldsymbol{\pi}_l p_{\theta_l}(\mathbf{x})}$\\ +$\text{posterior} P(A|B)=\frac{\text{prior} P(A)\times\text{likelihood} P(B|A)}{\text{evidence} P(B)}$\\ +\textbf{Likelihood of observed data $\mathbf{X}$:}\\ +$p_\theta(\mathbf{X}) = \prod_{n=1}^N p_\theta(\mathbf{x}_n) = \prod_{n=1}^N \left(\sum_{k=1}^K \pi_k p_{\theta_k}(\mathbf{x}_n)\right)$ +\textbf{Max. Likelihood Estimation (MLE):}\\ +$\argmax_\theta\sum_{n=1}^N \log \left( \sum_{k=1}^K \pi_k p_{\theta_k}(\mathbf{x}_n)\right)\\ +\ge \sum_{n=1^N} \sum_{k=1}^K{q_k[\log p_{\theta_k}(\mathbf{x}_n) + \log \pi_k - \log q_k]}$\\ +with $\sum_{k=1}^K{q_k} = 1$ by Jensen Inequality. + +\textbf{Generative Model}\\ +1. sample cluster index $j \sim Categorical(\pi)$\\ +2. given $j$, sample data $x \sim \text{Normal}(\mu_j, \Sigma_j)$ + +\textbf{Expectation-Maximization (EM) for GMM}\\ +E-Step: +Pr$[z_{k,n} = 1 | \mathbf{x}_n] = q_{k, n} = \frac{\boldsymbol{\pi}_k^{(t-1)} \mathcal{N}(\mathbf{x}_n | \boldsymbol{\mu}_k^{(t-1)}, \boldsymbol{\Sigma}_k^{(t-1)})}{\sum_{j=1}^K \boldsymbol{\pi}_j^{(t-1)} \mathcal{N}(\mathbf{x}_n | \boldsymbol{\mu}_j^{(t-1)}, \boldsymbol{\Sigma}_j^{(t-1)})}$\\ +M-Step: +$\boldsymbol{\mu}_k^{(t)} := \frac{\sum_{n=1}^N q_{k,n} \mathbf{x}_n}{\sum_{n=1}^N q_{k,n}}$ +$, \boldsymbol{\pi}_k^{(t)} := \frac{1}{N} \sum_{n=1}^N q_{k,n}$\\ +$\Sigma_k^{(t)} = \frac{\sum_{n=1}^N q_{k, n} (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t)})^\top}{\sum_{n=1}^N q_{k,n}}$ + +\textbf{K-means vs. EM} hard vs soft; spherical clusters vs covariance matrix; fast and cheap vs slow and more iteration; K-means can be used as init. for EM. K-means as a special case of GMM with covariances $\Sigma_j = \sigma^2 I$. limit of $\sigma \rightarrow 0$ is K-means (hard asgmts). + +\textbf{Model Order Selection (AIC / BIC for GMM)}\\ +Trade-off between data fit (i.e. likelihood $p(\mathbf{X} | \theta)$) and complexity (i.e. \# of free parameters $\kappa(\cdot)$). For choosing $K$: $\operatorname{AIC}(\theta | \mathbf{X}) = -\log p_\theta(\mathbf{X}) + \kappa(\theta)$\\ +$\operatorname{BIC}(\theta | \mathbf{X}) = -\log p_\theta(\mathbf{X}) + \frac{1}{2} \kappa(\theta) \log N$\\ +\# of free params, fixed covariance matrix: $\kappa(\theta) = K \cdot D + (K - 1)$ ($K$: \# clusters, $D$: $\mathsf{dim}\text{(data)}=\mathsf{dim}(\mu_i)$, $K-1$: $\pi$ of \# free clusters), full covariance matrix: $\kappa(\theta) = K(D + \frac{D(D+1)}{2}) + (K - 1)$.\\ +Compare AIC/BIC for different $K$ -- the smaller the better. BIC penalizes complexity more. diff --git a/src/cil-cheatsheet/DeepUnsuperviseLearning.tex b/src/cil-cheatsheet/DeepUnsuperviseLearning.tex new file mode 100644 index 0000000..ee6648a --- /dev/null +++ b/src/cil-cheatsheet/DeepUnsuperviseLearning.tex @@ -0,0 +1,8 @@ +\section{Deep Unsupervised Learning} +\textbf{AR:} Image $p(\mathbf{x})=\Pi_i^{n^2}p(x_i|x_1,\cdots,x_{i-1})$ \textbf{ELBO:} \\ +$\mathbb{E}_{x\sim P_{\mathbb{X}}}[\mathbb{E}_{z\sim Q}{\log P_g(x|z)}-D_{KL}(Q(z|x)\|P(z))]$\\ +$Q$ enc. posterior distr., $P(z)$ prior distr. on latent var $z$, $P_g$ likelihood of dec. generated $\mathbf{x}$. +Jointly trained: enc. optimize regularizer term, sample $\mathbf{z}\sim Q$, feed to dec., produce $\hat{x}$ to max. reconstruction quality. Both terms diff'able, can use SGD to train end-to-end. +\textbf{Reparam. trick:} use variational distr.s s.t. $q_\phi(\mathbf{z};\mathbf{x}) = g_\phi(\zeta;\mathbf{x})$ with eg. $\zeta \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ +Example: +$\zeta \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$, $\mathbf{z} = \mu + U\zeta$ then $z \sim \mathcal{N}(\mu, \mathbf{U^{\top}U})$ \ No newline at end of file diff --git a/src/cil-cheatsheet/DictionaryLearning.tex b/src/cil-cheatsheet/DictionaryLearning.tex new file mode 100644 index 0000000..08507bb --- /dev/null +++ b/src/cil-cheatsheet/DictionaryLearning.tex @@ -0,0 +1,8 @@ +\section{Dictionary Learning} + +Adapt dict. to signal characteristics. Obj: $(\mathbf{U}^\star, \mathbf{Z}^\star) \in \argmin_\mathbf{U,Z} \| \mathbf{X} - \mathbf{U} \cdot \mathbf{Z} \|_F^2$ not jointly convex but convex in either. +\textbf{Matrix Fact. by Iter Greedy Min.} +\begin{inparaenum}[\color{red} 1.] + \item Coding step: $\mathbf{Z}^{t+1} \in \argmin_\mathbf{Z} \| \mathbf{X} - \mathbf{U}^t \mathbf{Z} \|_F^2$ subject to $\mathbf{Z}$ being sparse ($\mathbf{z}_n^{t+1}\in \argmin_\mathbf{z}\|\mathbf{z}\|_0$ s.t.$\|\mathbf{x}_n - \mathbf{U}^t\mathbf{z}\|_2 \le \sigma \|\mathbf{x}_n\|_2$) + \item Dict update step: $\mathbf{U}^{t+1} \in \argmin_\mathbf{U} \| \mathbf{X} - \mathbf{UZ}^{t+1} \|_F^2$, subj to $\forall l\in [L]:\|\mathbf{u}_l\|_2 = 1$. (set $\mathbf{U} = [\mathbf{u}_1^t\cdots \mathbf{u}_l\cdots \mathbf{u}_L^t],~ \min_{u_l}\|\mathbf{X} - \mathbf{U}\mathbf{Z}^{t+1}\|_F^2 = \min_{u_l}\|\mathbf{R}_l^t - \mathbf{u}_l(\mathbf{z}_l^{t+1})^\top\|_F^2$ with $\mathbf{R}_l^t = \tilde{\mathbf{U}}\Sigma\tilde{\mathbf{V}}^\top$ by $\mathbf{u}^*_l=\tilde{\mathbf{u}}_1$) +\end{inparaenum} diff --git a/src/cil-cheatsheet/Essentials.tex b/src/cil-cheatsheet/Essentials.tex new file mode 100644 index 0000000..2741c24 --- /dev/null +++ b/src/cil-cheatsheet/Essentials.tex @@ -0,0 +1,88 @@ +% -*- root: Main.tex -*-¨ +\renewcommand{\baselinestretch}{0.1} +\section{Essentials} +\subsection*{Matrix/Vector} + \textbf{Vectors:} + Unit vector: $u^{\top}u=1$ + Orthogonal vectors: $u^{\top}v=0$ + \textbf{Range, Kernel, Nullity:} + % range + $\mathit{range}(\mathbf{A}) = \{\textbf{z} | \exists \textbf{x}: \textbf{z}=\textbf{Ax}\} = \mathit{span}(\text{columns of A})$ \\ + % rank + $\mathit{rank}(\mathbf{A}) = \mathit{dim}(\mathit{range}(\mathbf{A}))$ + % kernel + $\mathit{kernel}(A) = \{\mathbf{x}: \mathbf{Ax}=\mathbf{0}\}$ (spans nullspace) + % nullity + $\mathit{nullity}(\mathbf{A}) = \mathit{dim}(\mathit{kernel}(\mathbf{A}))$ + \textbf{Ranks:} $rank(XY) \leq rank(X) \forall X \in R^{mxn}, Y \in R^{nxk}$ eq. if $Y \in R^{nxn}, rank(Y) = n$ + \textbf{Rank-nullity Theorem:} + $\textit{dim}(\textit{kernel}(\textbf{A}))+\textit{dim}(\textit{range}(\textbf{A})) = n$\\ + \textbf{Orthogonal mat.} $\mathbf{A}^{-1} = \mathbf{A}^\top$, $\mathbf{A} \mathbf{A}^\top = \mathbf{A}^\top \mathbf{A} = \mathbf{I}$, $\operatorname{det}(\mathbf{A}) \in \{+1, -1\}$, $\operatorname{det}(\mathbf{A}^\top \mathbf{A}) = 1$, + preserves inner product, norm, distance, angle, rank, matrix orthogonality \textbf{Outer Product:} $\mathbf{u} \mathbf{v}^\top$, $(\mathbf{u} \mathbf{v}^\top)_{i, j} = \mathbf{u}_i \mathbf{v}_j$\\ + \textbf{Inner Product:} $\langle \mathbf{x}, \mathbf{y} \rangle = \mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{N} \mathbf{x}_i \mathbf{y}_i$. + $\langle \mathbf{x} \pm \mathbf{y}, \mathbf{x} \pm \mathbf{y} \rangle = \langle \mathbf{x}, \mathbf{x} \rangle \pm 2 \langle \mathbf{x}, \mathbf{y} \rangle + \langle \mathbf{y}, \mathbf{y} \rangle$\\ + $(\mathbf{u}_i^T\mathbf{v}_j)\mathbf{v}_j = (\mathbf{v}_j\mathbf{v}_j^T)\mathbf{u}_i$ + \textbf{Cross product:} $\vec{a}\times\vec{b}=(a_2b_3-a_3b_2, a_3b_1-a_1b_3, a_1b_2-a_2b_1)^\top$\\ + \textbf{Trace:} $\mathit{trace}(\mathbf{XYZ})=\mathit{trace}(\mathbf{ZXY})$\\ + \textbf{Transpose:} $(\mathbf{A}^\top)^{-1} = (\mathbf{A}^{-1})^\top$, $(\mathbf{A}\mathbf{B})^\top= \mathbf{B}^\top\mathbf{A}^\top$, $(\mathbf{A}+\mathbf{B})^\top= \mathbf{A}^\top + \mathbf{B}^\top$\\ + \textbf{Cauchy-Schwarz inequality:} $|\langle\mathbf{u}, \mathbf{v}\rangle| \leq \|\mathbf{u}\|\|\mathbf{v}\|$ + \textbf{Jensen inequality:} for convex function $f$, non negative $\lambda_i$ st. $\sum_{i=1}^{n} \lambda_i = 1$: $f(\sum_{i=1}^{n} \lambda_ix_i) \leq \sum_{i=1}^{n}\lambda_if(x_i)$ Note: for concave, inequality sign switches + \textbf{Convexity:}$f(\theta x+(1-\theta)y) \le \theta f(x)+(1-\theta)f(y), \forall\theta \in [0,1]$ + \textbf{Least Squares equations:} $\argmin_{\beta \in \mathbb{R}^{p + 1}}\|\mathbf{y-X\beta}\|^2$, $\hat{\beta}=(X^{\top}X)^{-1}X^{\top}y$\\ + \textbf{Einstein matrix notation:}$(A \cdot B)_{ij} = \sum_{k=1}^n A_{ik} \cdot B_{kj}$ + \textbf{Kullback-Leibler:} $KL(\boldsymbol{P}||\boldsymbol{Q}) = \sum_{x \in \boldsymbol{X}} P(x)\log{\frac{P(x)}{Q(x)}}$ +\subsection*{Norms} +\begin{inparaitem} + \item $\|\mathbf{x}\|_0 = |\{i | x_i \neq 0\}|$ \\ + \item $\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^{N} \mathbf{x}_i^2} = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle}$ \\ + \item $\|\mathbf{u}-\mathbf{v}\|_2 = \sqrt{(\mathbf{u}-\mathbf{v})^\top(\mathbf{u}-\mathbf{v})}$ \\ + \item $\|\mathbf{x}\|_p = \left( \sum_{i=1}^{N} |x_i|^p \right)^{\frac{1}{p}}$ ; + $\|\mathbf{x}\|_\infty = \max_{i=1, \ldots , n} |x_i|$\\ + \item + $\|\mathbf{M}\|_F =\allowbreak \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n}\mathbf{m}_{i,j}^2} = \sqrt{\sum_{i=1}^{\min\{m, n\}} \sigma_i^2} \allowbreak = \|\sigma(\mathbf{A})\|_2 = \sqrt{trace(\mathbf{M}^T\mathbf{M})} $\\ + \item $\|\mathbf{M}\|_G=\sqrt{\sum_{ij}{g_{ij}x^2_{ij}}}$ (weighted Frobenius) \\ + \item + $\|\mathbf{M}\|_1 = \sum_{i,j} | m_{i,j}|$ \\ + \item $\|\mathbf{M}\|_2 = \sigma_{\text{max}}(\mathbf{M}) = \|\sigma(\mathbf(M))\|_\infty$ (spectral)\\ + \item $\|\mathbf{M}\|_p = \max_{\mathbf{v} \neq 0} \frac{\|\mathbf{M}\mathbf{v}\|_p}{\|\mathbf{v}\|_p}$ \\ + \item $\|\mathbf{M}\|_\star = \sum_{i=1}^{\min(m, n)} \sigma_i = \|\sigma(\mathbf{A})\|_1$ (nuclear) +\end{inparaitem} + +\subsection*{Derivatives} +$\frac{\partial}{\partial \mathbf{x}}(\mathbf{b}^\top \mathbf{x}) = \frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^\top \mathbf{b}) = \mathbf{b}$ \quad +$\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^\top \mathbf{x}) = 2\mathbf{x}$\\ +$\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^\top \mathbf{A}\mathbf{x}) = (\mathbf{A}^\top + \mathbf{A})\mathbf{x}$ \quad +$\frac{\partial}{\partial \mathbf{x}}(\mathbf{b}^\top \mathbf{A}\mathbf{x}) = \mathbf{A}^\top \mathbf{b}$\\ +$\frac{\partial}{\partial \mathbf{X}}(\mathbf{c}^\top \mathbf{X} \mathbf{b}) = \mathbf{c}\mathbf{b}^\top$ \quad +$\frac{\partial}{\partial \mathbf{X}}(\mathbf{c}^\top \mathbf{X}^\top \mathbf{b}) = \mathbf{b}\mathbf{c}^\top$\\ +$\frac{\partial}{\partial \mathbf{x}}(\| \mathbf{x}-\mathbf{b} \|_2) = \frac{\mathbf{x}-\mathbf{b}}{\|\mathbf{x}-\mathbf{b}\|_2}$ \quad +$\frac{\partial}{\partial \mathbf{x}}(\|\mathbf{x}\|^2_2) = \frac{\partial}{\partial \mathbf{x}} (\mathbf{x}^\top \mathbf{x}) = 2\mathbf{x}$\\ +$\frac{\partial}{\partial \mathbf{X}}(\|\mathbf{X}\|_F^2) = 2\mathbf{X}$ \quad +$\frac{\partial}{\partial \mathbf{x}}\log(x) = \frac{1}{x}$ + +\subsection*{Eigendecomposition} +$\mathbf{A} \in \mathbb{R}^{N \times N}$ then $\mathbf{A} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{-1}$ with $\mathbf{Q} \in \mathbb{R}^{N \times N}$.\\ +if fullrank: $\mathbf{A}^{-1} = \mathbf{Q} \boldsymbol{\Lambda}^{-1} \mathbf{Q}^{-1}$ and $(\boldsymbol{\Lambda}^{-1})_{i,i} = \frac{1}{\lambda_i}$.\\ +if $\mathbf{A}$ symmetric: $A = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q^\top}$ ($\mathbf{Q}$ orthogonal). +Eigenvalue $\lambda$: solve $det(A - \lambda I) = 0$ +Eigenvector $v$: solve $(A - \lambda I)*v = \overrightarrow{0}$ + +\subsection*{Probability / Statistics} +\begin{inparaitem} + \item $P(x) := Pr[X = x] := \sum_{y \in Y} P(x, y)$ + \item $P(x|y) := Pr[X = x | Y = y] := \frac{P(x,y)}{P(y)},\quad \text{if } P(y) > 0$ + \item $\forall y \in Y: \sum_{x \in X} P(x|y) = 1$ (property for any fixed $y$) + \item $P(x, y) = P(x|y) P(y)$ + \item $\text{posterior} P(A|B)=\frac{\text{prior} P(A)\times\text{likelihood} P(B|A)}{\text{evidence} P(B)}$ (Bayes' rule) + \item $P(x|y) = P(x) \Leftrightarrow P(y|x) = P(y)$ (iff $X$, $Y$ independent) + \item $P(x_1, \ldots, x_n) = \prod_{i=1}^n P(x_i)$ (iff IID) + \item Variance $Var[X]:= E[(X-\mu_x)^2]:=\sum_{x \in X}(x-\mu_x)^2P(x)= E(X^2) - E(X)^2$ $\operatorname{Var}(aX)=a^2\operatorname{Var}(X)$ + \item expectation $\mu_x := E[X]:=\sum_{x \in X}xP(x)$ + \item $E[X+Y] = E[X] + E[Y]$ + \item standard deviation $\sigma_x := \sqrt{Var[X]}$ +\end{inparaitem} + + +\subsection*{Lagrangian Multipliers} +Minimize $f(\mathbf{x})$ s.t. $g_i(\mathbf{x}) \leq 0,\ i = 1, .., m$ (\textbf{inequality constr.}) and $h_i(\mathbf{x}) = \mathbf{a}_i^\top \mathbf{x} - b_i = 0$ or $h_i(\mathbf{x}) = \sum_{w} x_{w,i} - b_i = 0,\ i = 1, .., p$ (\textbf{equality constraint}) \\ +$L(\mathbf{x}, \boldsymbol{\alpha}, \boldsymbol{\beta}) := f(\mathbf{x}) + \sum_{i=1}^m \alpha_i g_i(\mathbf{x}) + \sum_{i=1}^p \beta_i h_i(\mathbf{x})$ diff --git a/src/cil-cheatsheet/GaussianMixtureModel.tex b/src/cil-cheatsheet/GaussianMixtureModel.tex new file mode 100644 index 0000000..0465fb7 --- /dev/null +++ b/src/cil-cheatsheet/GaussianMixtureModel.tex @@ -0,0 +1,38 @@ +% !TEX root = Main.tex +\section{Gaussian Mixture Models (GMM)} +For GMM let $\boldsymbol{\theta}_k = (\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$; $p_{\theta_k}(\mathbf{x}) = \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \Sigma_k)$\\ +\textbf{Mixture Models:} $p_\theta(\mathbf{x}) = \sum_{k=1}^K \pi_k p_{\theta_k}(\mathbf{x})$\\ +\textbf{Assignment variable (generative model):} $z_{ij} \in \{0, 1\}$, $\sum_{j=1}^k z_{ij} = 1$\\ +$\operatorname{Pr}(z_k = 1) = \pi_k \Leftrightarrow p(\mathbf{z}) = \prod_{k=1}^K \pi_k^{z_k}$\\ +\textbf{Complete data distribution:}\\ +$p_\theta(\mathbf{x}, \mathbf{z}) = \prod_{k=1}^K \left( \boldsymbol{\pi}_k p_{\theta_k}(\mathbf{x})\right)^{z_k}$\\ +\textbf{Posterior Probabilities:} $\\\operatorname{Pr}(z_k = 1 | \mathbf{x}) = \frac{\operatorname{Pr}(z_k = 1) p(\mathbf{x} | z_k = 1)}{\sum_{l=1}^K \operatorname{Pr}(z_l = 1) p(\mathbf{x} | z_l = 1)} = \frac{\boldsymbol{\pi}_k p_{\theta_k}(\mathbf{x})}{\sum_{l=1}^K \boldsymbol{\pi}_l p_{\theta_l}(\mathbf{x})}$ +\textbf{Likelihood of observed data $\mathbf{X}$:}\\ +$p_\theta(\mathbf{X}) = \prod_{n=1}^N p_\theta(\mathbf{x}_n) = \prod_{n=1}^N \left(\sum_{k=1}^K \pi_k p_{\theta_k}(\mathbf{x}_n)\right)$ +\textbf{Max. Likelihood Estimation (MLE):}\\ +$\argmax_\theta\sum_{n=1}^N \log \left( \sum_{k=1}^K \pi_k p_{\theta_k}(\mathbf{x}_n)\right)\\ +\ge \sum_{n=1^N} \sum_{k=1}^K{q_k[\log p_{\theta_k}(\mathbf{x}_n) + \log \pi_k - \log q_k]}$\\ +with $\sum_{k=1}^K{q_k} = 1$ by Jensen Inequality. + +\subsection*{Generative Model} +1. sample cluster index $j \sim Categorical(\pi)$\\ +2. given $j$, sample data $x \sim \text{Normal}(\mu_j, \Sigma_j)$ + + +\subsection*{Expectation-Maximization (EM) for GMM} +\textbf{E-Step: }\\ +Pr$[z_{k,n} = 1 | \mathbf{x}_n] = q_{k, n} = \frac{\boldsymbol{\pi}_k^{(t-1)} \mathcal{N}(\mathbf{x}_n | \boldsymbol{\mu}_k^{(t-1)}, \boldsymbol{\Sigma}_k^{(t-1)})}{\sum_{j=1}^K \boldsymbol{\pi}_j^{(t-1)} \mathcal{N}(\mathbf{x}_n | \boldsymbol{\mu}_j^{(t-1)}, \boldsymbol{\Sigma}_j^{(t-1)})}$\\ +\textbf{M-Step: } $\boldsymbol{\mu}_k^{(t)} := \frac{\sum_{n=1}^N q_{k,n} \mathbf{x}_n}{\sum_{n=1}^N q_{k,n}}$ +$, \boldsymbol{\pi}_k^{(t)} := \frac{1}{N} \sum_{n=1}^N q_{k,n}$\\ +$\Sigma_k^{(t)} = \frac{\sum_{n=1}^N q_{k, n} (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t)})^\top}{\sum_{n=1}^N q_{k,n}}$ + +\subsection*{Discussion K-means vs. EM} +hard assignment vs soft. sherical clusters shapes vs covariance matrix. fast vs slow and more iteration. K-means can be used as initialization for EM.\\ +K-means os a special case of GMM with covariances $\Sigma_j = \sigma^2 * I$. in the limit of $\sigma \rightarrow 0$, recover K-means. + +\subsection*{Model Order Selection (AIC / BIC for GMM)} +Trade-off between data fit (i.e. likelihood $p(\mathbf{X} | \theta)$) and complexity (i.e. \# of free parameters $\kappa(\cdot)$). For choosing $K$:\\ +\textbf{Akaike Information Criterion}: $\operatorname{AIC}(\theta | \mathbf{X}) = -\log p_\theta(\mathbf{X}) + \kappa(\theta)$\\ +\textbf{Bayesian Information Criterion}: $\operatorname{BIC}(\theta | \mathbf{X}) = -\log p_\theta(\mathbf{X}) + \frac{1}{2} \kappa(\theta) \log N$\\ +\# of free params, fixed covariance matrix: $\kappa(\theta) = K \cdot D + (K - 1)$ ($K$: \# clusters, $D$: $\mathsf{dim}\text{(data)}=\mathsf{dim}(\mu_i)$, $K-1$: \# free clusters, full covariance matrix: $\kappa(\theta) = K(D + \frac{D(D+1)}{2}) + (K - 1)$.\\ +Compare AIC/BIC for different $K$ -- the smaller the better. BIC penalizes complexity more. diff --git a/src/cil-cheatsheet/KMeans.tex b/src/cil-cheatsheet/KMeans.tex new file mode 100644 index 0000000..a042a1f --- /dev/null +++ b/src/cil-cheatsheet/KMeans.tex @@ -0,0 +1,9 @@ +% !TEX root = Main.tex +\section{K-means Algorithm} +\textbf{Target:} $\min_{\mathbf{U}, \mathbf{Z}} J(\mathbf{U}, \mathbf{Z}) = \|\mathbf{X} - \mathbf{U} \mathbf{Z}\|_F^2$\\ +$= \sum_{n=1}^N \sum_{k=1}^K \mathbf{z}_{k,n} \|\mathbf{x}_n - \mathbf{u}_k\|_2^2$\\ +1. \textbf{Initiate:} choose $K$ centroids $\mathbf{U} = [\mathbf{u}_1, \ldots, \mathbf{u}_K]$\\ +2. \textbf{Assign:} data points to clusters. $k^\star(\mathbf{x}_n) = \argmin_k \{ \|\mathbf{x}_n - \mathbf{u}_k\|_2 \}$ returns cluster $k^\star$, whose centroid $\mathbf{u}_{k^\star}$ is closest to data point $\mathbf{x}_n$. Set $\mathbf{z}_{k^\star,n} = 1$, and for $ l \neq k^\star~ \mathbf{z}_{l,n}=0$.\\ +3. \textbf{Update} centroids: $\mathbf{u}_k = \frac{\sum_{n=1}^N z_{k,n} \mathbf{x}_n}{\sum_{n=1}^N z_{k,n}}$.\\ +4. Repeat from step 2, stops if $\|\mathbf{Z} - \mathbf{Z}^\text{new}\|_0 = \|\mathbf{Z} - \mathbf{Z}^\text{new}\|^2_F = 0$.\\ +Computational cost: $O(k\cdot n \cdot d)$ diff --git a/src/cil-cheatsheet/NMF.tex b/src/cil-cheatsheet/NMF.tex new file mode 100644 index 0000000..9e911d1 --- /dev/null +++ b/src/cil-cheatsheet/NMF.tex @@ -0,0 +1,39 @@ +% !TEX root = Main.tex +\section{Non-Negative Matrix Factorization} +$\mathbf{X} \in \mathbb{Z}^{N \times M}_{\geq 0}$, NMF: $\mathbf{X} \approx \mathbf{U^\top V}, x_{ij}=\sum_z{u_{zi}v_{zj}}=\langle\mathbf{u}_i \mathbf{v}_j\rangle$ +Decompose object into features: topics, face parts, etc.. $\mathbf{u}$ weights on parts, $\mathbf{v}$ parts (bases). More interpretable (PCA: holistic repre.). + +\subsection*{EM for MLE for pLSA (No global opt. guarantee)} +\textbf{Context Model:} $p(w | d) = \sum_{z=1}^K p(w | z) p(z | d)$\\ +\textbf{Conditional independence assumption ($*$):}\\ +$p(w|d) = \sum_z p(w,z|d) = \sum_z p(w|d,z)p(z|d) \stackrel{*}{=} \sum_z p(w|z)p(z|d)$ or $p(w|d,z) = p(w|z)$\\ +\textbf{Symmetric parameterization:}\\ +$p(w, d) = \sum_z p(z)p(w | z) p(d | z)$ \\ +Log-Likelihood: $L(\mathbf{U}, \mathbf{V}) = \sum_{i,j} x_{i,j}\log p(w_j|d_i) \\ += \sum_{(i,j) \in X} \log \sum_{z=1}^K p(w_j|z)p(z|d_i)$ \\ +$ p(w_j|z) = v_{zj}$, $p(z|d_i) = u_{zi}$, $\sum_j^N v_{zj} = \sum_z^K u_{zi} = 1$\\ +E-Step (optimal q: posterior of z over $(d_i, w_j)$):\\ +$q_{zij} = \frac{p(w_j|z)p(z|d_i)}{\sum_{k=1}^K p(w_j|k)p(k|d_i)} := \frac{v_{zj}u_{zi}}{\sum_{k=1}^K v_{kj}u_{ki}}$, $\sum_z q_{zij}=1$\\ +M-Steps:\\ +$p(z|d_i) = \frac{\sum_j x_{ij}q_{zij}}{\sum_j x_{ij}}, p(w_j|z) = \frac{\sum_i x_{ij}q_{zij}}{\sum_{i,l}x_{il}q_{zil}}$\\ +Lower Bound of $L(\mathbf{U}, \mathbf{V})$ Jensen ineq. : $\sum_{i,j \in X} \sum_{z=1}^K q_{zij}( log(v_{zj})+ log(u_{zi}) - log(q_{zij}))$ + +\subsection*{Latent Dirichlet Allocation} +To sample a new document, we need to extend $X$ and $U^T$ with a new row, s.t. $X=U^T V$. (While pLSA fixes both dimensions)\\ +For each $d_i$ sample topic weights $\mathbf{u}_i$\textasciitilde Dirichlet($\alpha$): $p(u_i|\alpha) = \prod_{z=1}^K u_{zi}^{\alpha_k-1}$, then topic $z^t$\textasciitilde Multi($u_i$), word $w^t$\textasciitilde Multi($v_{z^t}$)\\ +Multinom. obsv. model on wc vec: $p(\mathbf{x}|V,u) = \frac{l!}{\prod_j \mathbf{x}_j!}\prod_j \pi_j^{\mathbf{x}_j}$ +where $\pi_j=\sum_z v_{zj} u_z$, $l=\sum_j x_j$ \\ +Bayesian averaging over $\mathbf{u}$: $p(\mathbf{x}|\mathbf{V},\alpha)=\int p(\mathbf{x}|\mathbf{V},\mathbf{u})p(\mathbf{u}|\alpha)d\mathbf{u}$ + +\subsection*{NMF Algorithm for quadratic cost function} + +$\min_{\mathbf{U}, \mathbf{V}} J(\mathbf{U}, \mathbf{V}) = \frac{1}{2} \|\mathbf{X} - \mathbf{U}^\top\mathbf{V}\|_F^2$ (non-negativity) +s.t. $\forall i,j,z:u_{zi},v_{zj} \geq 0 $ \\ +Comparison with pLSA:\\ +1. sampling model: Gaussian vs multinomial +2. objective: quadratic vs KL divergence +3. constraints: not normalized \\ +Alternating least squares:\\ +1. init: $\mathbf{U}, \mathbf{V} = rand()$\\ 2. repeat 3\textasciitilde4 for $\mathit{maxIters}$:\\ +3. upd. $(\mathbf{VV}^\top)\mathbf{U} = \mathbf{VX}^\top$, proj. $u_{zi} = \max \{ 0, u_{zi} \}$\\ +4. update $(\mathbf{UU}^\top)\mathbf{V} = \mathbf{UX}$, proj. $v_{zj} = \max \{ 0, v_{zj} \}$ diff --git a/src/cil-cheatsheet/NeuralNetworks.tex b/src/cil-cheatsheet/NeuralNetworks.tex new file mode 100644 index 0000000..6783653 --- /dev/null +++ b/src/cil-cheatsheet/NeuralNetworks.tex @@ -0,0 +1,22 @@ +\section{Neural Networks} +% only CNN? +\textbf{Activation:} scalar, non-linear $\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$\\ sigmoid: $\sigma(x)= \frac{1}{1+e^{-x}},\sigma^{'}(x)=\sigma(x)(1-\sigma(x))$\\ +\textbf{Neurons}: $F_\sigma(\mathbf{x};\mathbf{w}) = \sigma(w_0 + \sum_{i=1}^M{x_iw_i}) = \sigma(w^{\top}x)$\\ \textbf{Output}: linear regression $\mathbf{y} = \mathbf{W}^L\mathbf{x}^{L-1}$, binary (logistic) $y_1 = \text{P}[Y=1|\mathbf{x}] = \frac{1}{1 + \exp(-\mathbf{w}^T \mathbf{x}^{L-1})}$, multiclass (soft-max) $y_k = \text{P}[Y=k|\mathbf{x}]= \frac{\exp( \mathbf{w}_k^T\mathbf{x}^{L-1})}{\sum_{m=1}^{K}{\exp(\mathbf{w}^T\mathbf{x}^{L-1})}}$. +\textbf{Loss} $l(y, \hat{y})$: squared loss $\frac{1}{2}(y - \hat{y})^2$, cross-entropy loss $-y \log \hat{y} - (1-y)\log(1-\hat{y})$ $0\leq\hat{y}\leq1, y \in \{0,1\}$ or $y \in [0,1]$. \textbf{layer-wise:} $\mathbf{x}^{l} = \sigma^{l}\left(\mathbb{W}^{\left(l\right)}\mathbf{x}^{\left(l-1\right)}\right)$. + +\subsection*{Backpropagation} +Layer-to-layer Jacobian: $\mathbf{x}$ = prev. layer activation, $\mathbf{x^+}$ = next layer activation. Jacobian matrix $\mathbf{J}$ = $J_{ij}$ of mapping $\mathbf{x}\rightarrow\mathbf{x^+}$, $\mathbf{x_i^+} = \sigma(\mathbf{w}_i^\top\mathbf{x})$, $J_{ij} = \frac{\partial \mathbf{x_i^+}}{\partial \mathbf{x}_j} = w_{ij}\cdot\sigma'(\mathbf{w}_i^\top\mathbf{x})$. Across multiple layers: $\frac{\partial\mathbf{x}^{(l)}}{\partial\mathbf{x}^{(l-n)}} = \mathbf{J}^{(l)}\cdot\frac{\partial\mathbf{x}^{(l-1)}}{\partial\mathbf{x}^{(l-n)}}=\mathbf{J}^{(l)}\cdot\mathbf{J}^{(l-1)}\cdots\mathbf{J}^{(l-n+1)}$ and then back prop. $ \nabla_{\mathbf{x}^{(l)}}^\top\ell=\nabla_{\mathbf{y}}^\top\ell\cdot\mathbf{J}^{(L)}\cdots\mathbf{J}^{(l+1)}$\\ +Weights: $\frac{\partial l}{\partial w_{ij}^{(l)}} = \frac{\partial l}{\partial x_i^{(l)}}\frac{\partial x_i^{(l)}}{\partial w_{ij}^{(l)}}$, $\frac{\partial x_i^{l}}{\partial w_{ij}^{l}} = \sigma'([\mathbf{w}_i^{(l)}]^T \mathbf{x}^{(l-1)})\cdot x_j^{(l-1)}$ (sensitivity of down-stream unit $\cdot$ activation of up-stream unit)< + +\subsection*{Gradient Descent (or Deepest Descent)} +\textbf{Gradient}: $\nabla f(\mathbf{x}) := \left( \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}_1}, \ldots, \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}_D} \right)^\top$ + +$\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - \gamma \nabla f(\mathbf{x}^{(t)})$, usually $\gamma \approx \frac{1}{t}$ + +\textbf{SGD} Assume additive obj. $f(x) = \frac{1}{N}\sum_{n=1}^{N}f_n(x)$\\ +sample $n \in_{u.a.r.} \{1, \ldots, N\}$, then\\ +$\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} - \gamma \nabla f_n(\mathbf{x}^{(t)})$, typically $\gamma \approx \frac{1}{t}$. + +\subsection*{Neural Networks for Images (CNN)} +$F_{n,m}(\mathbf{x};\mathbf{w}) = \sigma(b + \sum_{k=-2}^2\sum_{l=-2}^{2}{w_{k,l}x_{n+k,m+l}})$. +% Unsure $(I \star K)_{i',j'} = \sum_{-k \leq i}\sum_{j \leq k} (I)_{i'+i,j'+j}(K)_{-i,-j}$ with odd numbered kernel K with dim $(k-1)/2 x (k-1)/2$ diff --git a/src/cil-cheatsheet/PCA.tex b/src/cil-cheatsheet/PCA.tex new file mode 100644 index 0000000..4bfbcb5 --- /dev/null +++ b/src/cil-cheatsheet/PCA.tex @@ -0,0 +1,31 @@ +% -*- root: Main.tex -*- +\section{Principal Component Analysis} +$\mathbf{X} \in \mathbb{R}^{D \times N}$. $N$ observations, $K$ rank.\\ +1. Empirical Mean: $\overline{\mathbf{x}} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n$.\\ +2. Center Data: $\overline{\mathbf{X}} = \mathbf{X} - [\overline{\mathbf{x}}, \ldots, \overline{\mathbf{x}}] = \mathbf{X} - \mathbf{M}$.\\ +3. Cov.: $\boldsymbol{\Sigma} = \frac{1}{N } \sum_{n=1}^N (\mathbf{x}_n - \overline{\mathbf{x}}) (\mathbf{x}_n - \overline{\mathbf{x}})^\top = \frac{1}{N} \overline{\mathbf{X}}\overline{\mathbf{X}}^\top$.\\ +4. Eigenvalue Decomposition: $\boldsymbol{\Sigma} = \mathbf{U} \boldsymbol{\Lambda} \mathbf{U}^\top$.\\ +5. Select $K < D$, only keep $\mathbf{U}_K, \boldsymbol{\lambda}_K$.\\ +6. Transform data onto new Basis: $\overline{\mathbf{Z}}_K = \mathbf{U}_K^\top \overline{\mathbf{X}}$.\\ +7. Reconstruct to original Basis: $\tilde{\overline{\mathbf{X}}} = \mathbf{U}_k \overline{\mathbf{Z}}_K$.\\ +8. Reverse centering: $\tilde{\mathbf{X}} = \tilde{\overline{\mathbf{X}}} + \mathbf{M}$.\\ +For compression save $\mathbf{U}_k, \overline{\mathbf{Z}}_K, \overline{\mathbf{x}}$.\\ +$\mathbf{U}_k \in \mathbb{R}^{D \times K}, \boldsymbol{\Sigma} \in \mathbb{R}^{D \times D}, \overline{\mathbf{Z}}_K \in \mathbb{R}^{K \times N}, \overline{\mathbf{X}} \in \mathbb{R}^{D \times N}$ + +\textbf{Calculation of}: $var(X) = \frac{1}{N} \sum_{n=1}^N (X_i - \bar{X})^2$ + +\subsection*{Iterative View} +Residual $r_i$: $x_i - \tilde{x}_i = I - uu^T x_i$\\ +Cov of $r$: $\frac{1}{n} \sum_{i=1}^n (I-uu^T)x_i x_i^T (I-uu^T)^T =$ \\ +$(I-uu^T) \Sigma (I-uu^T)^T = \Sigma - 2\Sigma u u^T + u u^T \Sigma u u ^T = \Sigma - \lambda uu^T$ \\ +1. Find principal eigenvector of $(\Sigma - \lambda u u^T)$\\ +2. which is the second eigenvector of $\Sigma$\\ +3. iterating to get $d$ principal eigenvector of $\Sigma$ + +\subsection*{Power Method} +Power iteration: $v_{t+1} = \frac{Av_t}{||Av_t||}$, $\lim_{t \rightarrow \infty} v_t = u_1$\\ +Assuming $\langle u_1, v_0 \rangle \not = 0$ and $|\lambda_1| > |\lambda_j| (\forall j \geq 2)$ +\subsection*{Reconstruction Proof Sketch} +Given: $\tilde{X} = U_KU_K^{\top}\bar{X}$ +To prove: squared reconstruction error is the sum of the lowest $D - K$ eigenvalues of $\Sigma$. +$err = 1/N\sum_{i=1}^{N}\|\tilde{x_i} - \bar{x_i}\|_2^2 = 1/N\|\tilde{X} - \bar{X}\|_F^2 = 1/N\|(U_KU_K^{\top} - I_d)\bar{X}\|_F^2 = 1/N* trace((U_KU_K^{\top} - I_d)\bar{X} \bar{X}^{\top}(U_KU_K^{\top} - I_d)^{\top} = 1/N*trace(([U_K;0] - U)\Lambda([U_K;0] - U)^{\top})$ diff --git a/src/cil-cheatsheet/Reconstruction.tex b/src/cil-cheatsheet/Reconstruction.tex new file mode 100644 index 0000000..ed41992 --- /dev/null +++ b/src/cil-cheatsheet/Reconstruction.tex @@ -0,0 +1,40 @@ +% !TEX root = Main.tex +\section{Matrix Approximation \& Reconstruction} + +$\min_{rank(B)=k}[\sum_{(i,j)\in I}{(a_{ij}-b_{ij})^2}], I=\{(i,j): \mathit{ob.}\}$ +\subsection*{Alternating Least Squares} +$f(U,v_i) = \sum_{(i,j)\in I} (a_{i,j} - \langle u_j, v_i \rangle)^2$\\ +$f(u_i,V) = \sum_{(i,j)\in I} (a_{i,j} - \langle u_j, v_i \rangle)^2$\\ +Convex when fixed one. + +% YZF: delete? +% \subsection*{Coordinate Descent} +% 1. init: $\mathbf{x}^{(0)} \in \mathbb{R}^D$\\ +% 2. for $t = 0 \ \text{to} \ \mathit{maxIter}$:\\ +% 3. sample $d \in_{u.a.r.} \{1, \ldots, D\}$\\ +% 4. $u^\star = \argmin_{u \in \mathbb{R}} f(x_1^{(t)}, .., x_{d-1}^{(t)}, u, x_{d+1}^{(t)}, .., x_D^{(t)})$\\ +% 5. $\mathbf{x}_d^{(t+1)} = u^\star$ and $\mathbf{x}_i^{(t+1)} = \mathbf{x}_i^{(t)}$ for $i \neq d$ + +% \subsection*{Projected Gradient Descent (Constrained Opt.)} +% minimize $f(x)$, $x \in Q$ (constraint).\\ +% \textbf{Project} $x$ onto $Q$: $P_Q(\mathbf{x}) = \argmin_{y \in Q} \|\mathbf{y} - \mathbf{x}\|$,\\ +% \textbf{Update}: $\mathbf{x}^{(t+1)} = P_Q[\mathbf{x}^{(t)} - \gamma \nabla f(\mathbf{x}^{(t)})]$,\\ +% $\mathbf{x}^{(t+1)}$ is unique if $Q$ convex. + + +% YZF: need revising +\subsection*{Convex Optimization} +Def.: $\{(x,t)|x \in dom f, f(x) \leq t\}$, $f : \mathbb{R}^D \rightarrow \mathbb{R}$ is convex, if $dom\ f$ is a convex set, and if $\forall \mathbf{x}, \mathbf{y} \in dom\ f$, and $\forall \alpha\in[0,1]$: $f(\alpha \mathbf{x} + (1 - \alpha)\mathbf{y}) \leq \alpha f(\mathbf{x}) + (1-\alpha)f(\mathbf{y})$. +Convex $\iff$ Hessian p.s.d $\iff$ local=global \\ +Positive semi-definite: all principal minors (same-indexed rows and columns) $\geq$ 0\\ +Positive definite: leading principal minors $>$ 0 + +\subsection*{Convex Relaxation} +Replace non-convex rank constraints by convex norm constraints (superset). Then project optimum back (hopefully still optimal).\\ +$\min_{\mathbf{B}\in P_k}{\|\mathbf{A-B}\|^2_G}, P_k=\{\mathbf{B}:\|\mathbf{B}\|_{*}\leq k\}\supseteq Q_k=\{\mathbf{B}:\mathit{rank}(\mathbf{B})\leq k\}$ (in fact tightest convex lowerbound $\mathit{rank}(\mathbf{B})\geq \|\mathbf{B}\|_{*}, for \|\mathbf{B}\|_2 \leq 1$) + +\subsection*{SVD Thresholding} +% TBA +$\mathbf{B}^{*}=\mathit{shrink}_\tau(\mathbf{A})=\argmin_{\mathbf{B}}{\{\|\mathbf{A-B}\|^2_F + \tau\|\mathbf{B}\|_{*}\}}$\\ +Then with SVD $\mathbf{A=UDV_T}, \mathbf{D}=\mathit{diag}(\sigma_i)$, holds $\mathbf{B^*=UD_\tau V^T, D_\tau} = \mathit{diag}(\max\{0,\sigma_i - \tau\})$ \\ +Iteration: $\mathbf{B}_{t+1}=\mathbf{B}_t + \eta_t \Pi(\mathbf{A} - \mathit{shrink}_\tau(\mathbf{B}_t))$ \ No newline at end of file diff --git a/src/cil-cheatsheet/RobustPCA.tex b/src/cil-cheatsheet/RobustPCA.tex new file mode 100644 index 0000000..b4e122e --- /dev/null +++ b/src/cil-cheatsheet/RobustPCA.tex @@ -0,0 +1,22 @@ +\section{Robust PCA} +\begin{compactitem} + \item Idea: Approximate $\mathbf{X}$ with $\mathbf{L} + \mathbf{S}$, $\mathbf{L}$ is low-rank, $\mathbf{S}$ is sparse. + \item $\min_{\mathbf{L},\mathbf{S}}\mathsf{rank}(\mathbf{L}) + \mu \lVert \mathbf{S}\rVert_0$, s. t. $\mathbf{L} + \mathbf{S} = \mathbf{X}$. As non-convex, change to $\min_{\mathbf{L},\mathbf{S}} \|\mathbf{L}\|_\star + \lambda \lVert\mathbf{S}\rVert_1$ (\emph{not} the same in general) Deal w/ missing values: subj. to $L_{ij} + S_{ij} = X_{ij}, \forall(i,j) \in \Omega_{observed}$ + \item Perfect reconstruction is \emph{not} possible if $\mathbf{S}$ is low-rank, $\mathbf{L}$ is sparse, or $\mathbf{X}$ is low-rank \textit{and} sparse. Formally coherence: $\|\mathbf{U}^\top \mathbf{e}_i\|^2 \leq \frac{\nu r}{n}$, $\|\mathbf{V}^\top \mathbf{e}_i\|^2 \leq \frac{\nu r}{n}$, $\|\mathbf{UV}^\top\|^2_{ij} \leq \frac{\nu r}{n^2}$ : $\mathbf{L}=\mathbf{U}\mathbf{D}\mathbf{V}^\top$ +\end{compactitem} + +\subsection*{Dual Ascent (Gradient Method for Dual Problem)} +$\boldsymbol{\lambda}^{t+1} = \boldsymbol{\lambda}^{t} + \eta \nabla D(\boldsymbol{\lambda}^t)$, +$ \nabla D (\boldsymbol{\lambda}) = \mathbf{A}\mathbf{x}^*-\mathbf{b}$ for $\mathbf{x}^* \in \arg\min_\mathbf{x} \mathcal{L}(\mathbf{x},\boldsymbol{\lambda})$ +\textbf{Dual Decomposition for Dual Ascent}: +\\$\mathbf{x}_i^{t+1} := \arg\min_{\mathbf{x}_i} \mathcal{L}_i(\mathbf{x}_i,\lambda^t)$; +$\boldsymbol{\lambda}^{t+1} := \boldsymbol{\lambda}^t + \eta^t \left( \sum_{i=1}^{N} \mathbf{A}_i \mathbf{x}_i^{t+1} -\mathbf{b} \right) $ + +\subsection*{Alternating Direction Method of Multipliers (ADMM)} +$\min_{\mathbf{x}_1, \mathbf{x}_2} f_1(\mathbf{x}_1) + f_2(\mathbf{x}_2)$ s. t. $\mathbf{A}_1 \mathbf{x}_1 + \mathbf{A}_2 \mathbf{x}_2 = \mathbf{b}$, $f_1, f_2$ convex +\begin{inparaitem}[\color{red}\textbullet] + \item Augmented Lagrangian: $L_p(\mathbf{x}_1, \mathbf{x}_2, \boldsymbol{\nu}) = f_1(\mathbf{x}_1) + f_2(\mathbf{x}_2) + \boldsymbol{\nu}^\top (\mathbf{A}_1 \mathbf{x}_1 + \mathbf{A}_2 \mathbf{x}_2 - \mathbf{b}) + \frac{p}{2}\| \mathbf{A}_1 \mathbf{x}_1 + \mathbf{A}_2 \mathbf{x}_2 - \mathbf{b} \|_2^2$ + \item ADMM: $\mathbf{x}_1^{(t+1)} := \argmin_{\mathbf{x}_1} L_p(\mathbf{x}_1, \mathbf{x}_2^{(t)}, \boldsymbol{\nu}^{(t)})$, $\mathbf{x}_2^{(t+1)} := \argmin_{\mathbf{x}_2} L_p(\mathbf{x}_1^{(t+1)}, \mathbf{x}_2, \boldsymbol{\nu}^{(t)})$, $\boldsymbol{\nu}^{(t+1)} := \boldsymbol{\nu}^{(t)} + p(\mathbf{A}_1 \mathbf{x}_1^{(t+1)} + \mathbf{A}_2 \mathbf{x}_2^{(t+1)} - \mathbf{b})$ + \item ADMM for RPCA: $f_1(\mathbf{L}) = \|\mathbf{L}\|_\star$, $f_2(\mathbf{S}) = \lambda \| \mathbf{S} \|_1$, $\mathbf{A}_1 \mathbf{x}_1 + \mathbf{A}_2 \mathbf{x}_2 = \mathbf{b} \text{ becomes } \mathbf{L} + \mathbf{S} = \mathbf{X}$, therefore $L_p(\mathbf{L}, \mathbf{S}, \boldsymbol{\nu}) = \|\mathbf{L}\|_* + \nu \|\mathbf{S}\|_1 + \left< \nu, \mathrm{vec}(\mathbf{L}+\mathbf{S}-\mathbf{X}) \right> + \frac{P}{2} \| \mathbf{L}+ \mathbf{S} - \mathbf{X} \|_F^2$ + %updates: $\mathbf{L}^{t+1} = \mathcal{D}_{\rho^{-1}}(X-S-\rho^{-1}\text{mat}(\lambda))$ +\end{inparaitem} \ No newline at end of file diff --git a/src/cil-cheatsheet/SVD.tex b/src/cil-cheatsheet/SVD.tex new file mode 100644 index 0000000..bd2fa82 --- /dev/null +++ b/src/cil-cheatsheet/SVD.tex @@ -0,0 +1,21 @@ +% -*- root: Main.tex -*- +\section{Singular Value Decomposition} +$\mathbf{A} = \mathbf{U} \mathbf{D} \mathbf{V}^\top = \sum_{k=1}^{\operatorname{rank}(\mathbf{A})} d_{k,k} u_k (v_k)^\top$\\ +$\mathbf{A} \in \mathbb{R}^{N \times P}, \mathbf{U} \in \mathbb{R}^{N \times N}, \mathbf{D} \in \mathbb{R}^{N \times P}, \mathbf{V} \in \mathbb{R}^{P \times P}$\\ +$\mathbf{U}^\top \mathbf{U} = I = \mathbf{V}^\top \mathbf{V}$ ($\mathbf{U}, \mathbf{V}$ orthonormal)\\ +$\mathbf{U}$ columns are eigvecs of $\mathbf{A} \mathbf{A}^\top$, $\mathbf{V}$ columns are eigvecs of $\mathbf{A}^\top \mathbf{A}$, $\mathbf{D}$ diag. elements are singular values.\\ +$(\mathbf{D}^{-1})_{i,i} = \frac{1}{\mathbf{D}_{i, i}}$ (don't forget to transpose) + +1. calculate $\mathbf{A}^\top \mathbf{A}$.\\ +2. calculate eigvals of $\mathbf{A}^\top \mathbf{A}$, the square root of them, in descending order, are the diagonal elements of $\mathbf{D}$.\\ +3. calc. eigvecs of $\mathbf{A}^\top \mathbf{A}$ using eigvals resulting in the columns of $\mathbf{V}$.\\ +4. calculate the missing matrix: $\mathbf{U} = \mathbf{A} \mathbf{V} \mathbf{D}^{-1}$.\\ +5. normalize each column of $\mathbf{U}$ and $\mathbf{V}$. + +\subsection*{Low-Rank approximation} +Use only $K$ largest eigvals (and corresp. eigvecs). $\tilde{\mathbf{A}}_{i, j} = \sum_{k=1}^K \mathbf{U}_{i, k} \mathbf{D}_{k,k} \mathbf{V}_{j, k} = \sum_{k=1}^K \mathbf{U}_{i, k} \mathbf{D}_{k,k} (\mathbf{V}^\top)_{k, j}$. + +\subsection*{Echart-Young Theorem} +$\mathbf{A}_k=\argmin_{rank(B)=k}\|\mathbf{A-B}\|_F^2$ (not convex) +$\min_{rank(B)=K} ||A-B||_F^2 = ||A-A_k||_F^2 = \sum_{r=k+1}^{rank(A)} \sigma_r^2$ +$\min_{rank(B)=K} ||A-B||_2 = ||A-A_k||_2 = \sigma_{k+1}$ diff --git a/src/cil-cheatsheet/SparseCoding.tex b/src/cil-cheatsheet/SparseCoding.tex new file mode 100644 index 0000000..9d727d1 --- /dev/null +++ b/src/cil-cheatsheet/SparseCoding.tex @@ -0,0 +1,43 @@ +\section{Sparse Coding} + +\subsection*{Orthogonal Basis} +Pros: fast inverse; preserves energy. +For $\mathbf{x}$ and orthog. mat. $\mathbf{U}$ compute $\mathbf{z} = \mathbf{U}^\top \mathbf{x} $. Approx $ \mathbf{\hat{x}} = \mathbf{U\hat{z}}$, $\hat{z}_i = z_i$ if $ \lvert z_i \rvert > \epsilon$ else 0. +Reconstruction Error $\|\mathbf{x}-\mathbf{\hat{x}}\|^2 = \sum_{d\notin\sigma}\langle\mathbf{x},\mathbf{u}_d\rangle ^2$. +Choice of base depends on signal. Fourier: global support, good for sine like waves; wavelet: local support, poor for non-vanishing signal; PCA basis optimal for given $\Sigma$. Stripes \& check patterns: hi-freq in Fourier. Fourier: $O(D\cdot logD)$, Wavelet: $O(D)$ or $O(D\cdot logD)$ + +\subsection*{Haar Wavelets (form orthogonal basis)} +scaling fnc. $\phi(x)=[1,1,1,1]$, mother $W(x)=[1,1,-1,-1]$, dilated $W(2x)=[1,-1,0,0]$, translated $W(2x-1)=[0,0,1,-1]$ +Must be normalized +\subsection*{Overcomplete Basis} +$\mathbf{U} \in \mathbb{R}^{D \times L}$ for \# atoms $ = L > D = \mathsf{dim}\text{(data)}$. Decoding involved $\rightarrow$ add constraint $\mathbf{z}^\star \in \argmin_\mathbf{z} \lVert \mathbf{z} \rVert_0$ s.t. $\mathbf{x} = \mathbf{Uz}$. NP-hard $\rightarrow$ approximate with 1-norm (convex) or with MP. + +\textbf{Coherence} +\begin{inparaitem}[\color{red}\textbullet] + \item $m(\mathbf{U}) = \max_{i,j:\, i \neq j} | \mathbf{u}_i^\top \mathbf{u}_j |$ + \item $m(\mathbf{B}) = 0$ if $\mathbf{B}$ orthog. matrix + \item $m([\mathbf{B}, \mathbf{u}]) \geq \frac{1}{\sqrt{D}}$ if atom $\mathbf{u}$ is added to orthog. basis $\mathbf{B}$ (o.n.b. = orthonormal base) +\end{inparaitem} + +\textbf{Matching Pursuit (MP)} +approximation of $\mathbf{x}$ onto $\mathbf{U}$, using $K$ entries. +Objective: $\mathbf{z}^\star \in \argmin_{\mathbf{z}} \|\mathbf{x} - \mathbf{Uz} \|_2$, s.t. $\|\mathbf{z}\|_0 \leq K$ +\begin{inparaenum}[\color{red}1.] + \item init: $z \leftarrow 0, r \leftarrow x$ + \item while $\|\mathbf{z}\|_0 < K$ do + \item select atom \textit{index} with smallest angle $i^\star = \argmax_i |\langle \mathbf{u}_i, \mathbf{r} \rangle|$ + \item update coefficients: $z_{i^\star} \leftarrow z_{i^\star} + \langle \mathbf{u}_{i^\star}, \mathbf{r} \rangle$ + \item update residual: $\mathbf{r} \leftarrow \mathbf{r} - \langle \mathbf{u}_{i^\star}, \mathbf{r} \rangle \mathbf{u}_{i^\star}$. +\end{inparaenum} +\\\textbf{Exact recovery} when: $K<1/2( 1+1/m(\mathbf{U}))$ + +\textbf{Compressive Sensing}: Compress data while gathering: +\begin{inparaitem}[\color{red}\textbullet] + \item $\mathbf{x} \in \mathbb{R}^D$, $K$-sparse in o.n.b. $\mathbf{U}$. $\mathbf{y} \in \mathbb{R}^M$ with $y_i = \langle \mathbf{w}_i, \mathbf{x}\rangle $: $M$ lin. combinations of signal; $\mathbf{y} = \mathbf{Wx} = \mathbf{WUz} = \Theta\textbf{z}$, $\Theta \in \mathbb{R}^{M \times D}$ + \item Reconstruct $\mathbf{x} \in \mathbb{R}^D$ from $\mathbf{y}$; find $\mathbf{z}^\star \in \argmin_{\mathbf{z}}\|\mathbf{z}\|_0$, s.t. $\mathbf{y} = \Theta\mathbf{z}$ (e.g. with MP, or convex it with 1-norm: can be eq.!). Given $\mathbf{z}$, reconstruct $\mathbf{x} = \mathbf{Uz}$ +\end{inparaitem} +\\Any orthogonal $\mathbf{U}$ sufficient if: +\begin{inparaitem}[\color{red}\textbullet] + \item $\mathbf{W} = $ Gaussian random projection, i.e. $w_{ij}\sim\mathcal{N}(0, \frac{1}{D})$ + \item M $\geq cK log(\frac{D}{K})$, where $c$ is some constant +\end{inparaitem} diff --git a/src/cil-cheatsheet/WordEmbedding.tex b/src/cil-cheatsheet/WordEmbedding.tex new file mode 100644 index 0000000..c8720ab --- /dev/null +++ b/src/cil-cheatsheet/WordEmbedding.tex @@ -0,0 +1,17 @@ +% !TEX root = Main.tex +\section{Word Embeddings} +\textbf{Distr. Model:} $p_\theta(w|w')$ = Pr[$w$ in context of $w'$]\\ +\textbf{Log-likelihood:}\\ +$L(\theta; \mathbf{w}) = \sum_{t=1}^T\sum_{\Delta \in I}{\log p_\theta(w^{(t+\Delta)}|w^{(t)})}$\\ +\textbf{Latent Vector Model:} $w \rightarrow (\mathbf{x}_w, b_w) \in \mathbb{R}^{D+1} \\p_{\theta}(w|w') = \frac{\exp[\langle \mathbf{x}_w,\mathbf{x}_{w'}\rangle + b_w]}{\sum_{v\in V}{\exp[\langle \mathbf{x}_v,\mathbf{x}_{w'}\rangle + b_v ]}}$ (soft-max).\\ +\textbf{Modifications:}\\ +$\log p_{\theta}(w|w') = \langle y_{w} , x_{w'} \rangle + b_w$, word $y_w$, c'txt $x_{w'}$\\ +use GloVe obj., negative sampling (logistic class.) + +\subsection*{GloVe (Weighted Square Loss)} +\textbf{Co-occ.:} $\mathbf{N} = (n_{ij}) \in \mathbb{R}^{|V|\times|C|} = \# w_i$ in context $w_j$\\ \textbf{Objective:} $H(\theta;\mathbf{N})= \sum_{n_{ij} > 0} f(n_{ij})(\log n_{ij} - \log \exp[\langle \mathbf{x}_i, \mathbf{y}_j \rangle + b_i + d_j])^2$, $f(n) = \min\{1, (\frac{n}{n_{max}})^\alpha\}$, $\alpha \in (0;1]$ ($ = 3/4$) unnorm. distr. $\rightarrow$ 2-sided loss. +cutoff $n_{max}$: limit influence of high freq. $f(n)\stackrel{n\rightarrow0}{\rightarrow}0$: as small counts very noisy\\ +1. sample $(i,j) u.a.r, s.t. n_{ij}>0$\\ +2. $\mathbf{x}_i^{new} \leftarrow \mathbf{x}_i + 2\eta f(n_{ij})(\log n_{ij} - \langle \mathbf{x}_i, \mathbf{y}_j \rangle)\mathbf{y}_j$\\ +3. $\mathbf{y}_j^{new} \leftarrow \mathbf{y}_j + 2\eta f(n_{ij})(\log n_{ij} - \langle \mathbf{x}_i, \mathbf{y}_j \rangle)\mathbf{x}_i$ +\par embeds can model analogies and relatedness, but antonyms are usually not well captured. diff --git a/src/cil-cheatsheet/main.tex b/src/cil-cheatsheet/main.tex new file mode 100644 index 0000000..f7b4e5b --- /dev/null +++ b/src/cil-cheatsheet/main.tex @@ -0,0 +1,87 @@ +\documentclass[11pt,landscape,a4paper,standalone,fleqn]{article} +\usepackage[utf8]{inputenc} +\usepackage[ngerman]{babel} +\usepackage{tikz} +\usetikzlibrary{shapes,positioning,arrows,fit,calc,graphs,graphs.standard} +\usepackage[nosf]{kpfonts} +% \usepackage{setspace} +\usepackage[t1]{sourcesanspro} +%\usepackage[letterspace=-100]{microtype} +%\usepackage[lf]{MyriadPro} +%\usepackage[lf,minionint]{MinionPro} +\usepackage{multicol} +\usepackage{parskip} +\setlength{\parskip}{0pt} +\setlength{\parindent}{0pt} +\usepackage{wrapfig} +\usepackage{paralist} +\usepackage[top=0mm,bottom=1mm,left=1mm,right=1mm]{geometry} +\usepackage[framemethod=tikz]{mdframed} +\usepackage{microtype} +\usepackage{mathptmx} + +% math helpers +\DeclareMathOperator*{\argmin}{arg\,min} +\DeclareMathOperator*{\argmax}{arg\,max} + +\let\bar\overline + +\definecolor{myblue}{cmyk}{1,0.72,0,0.38} +\definecolor{myorange}{cmyk}{0,0.5,1,0} + +\pgfdeclarelayer{background} +\pgfsetlayers{background,main} + +\everymath\expandafter{\the\everymath \color{myblue}} +\everydisplay\expandafter{\the\everydisplay \color{myblue}} + +\renewcommand{\baselinestretch}{.8} +\pagestyle{empty} + +\global\mdfdefinestyle{header}{% +linecolor=gray,linewidth=1pt,% +leftmargin=0mm,rightmargin=0mm,skipbelow=0mm,skipabove=0mm, +} + +\newcommand{\header}{ +\begin{mdframed}[style=header] +\footnotesize +\sffamily +Oliver Blaser and Joël Schneider\\~page~\thepage~of~2 +\end{mdframed} +} + +\makeatletter +\renewcommand{\section}{\@startsection{section}{1}{0mm}% + {.2ex}% + {.2ex}%x + {\color{myorange}\sffamily\small\bfseries}} +\renewcommand{\subsection}{\@startsection{subsection}{1}{0mm}% + {.2ex}% + {.2ex}%x + {\sffamily\bfseries}} + + +\makeatother +\setlength{\parindent}{0pt} + +\begin{document} +\small +\begin{multicols*}{4} +\setcounter{section}{-1} +\input{Essentials} +\input{PCA} +\input{SVD} +\input{Reconstruction} +\input{NMF} +\input{WordEmbedding} +\input{DataClusteringMixture} +% \input{KMeans} +% \input{GaussianMixtureModel} +\input{SparseCoding} +\input{DictionaryLearning} +\input{NeuralNetworks} +\input{DeepUnsuperviseLearning} +% \input{RobustPCA} +\end{multicols*} +\end{document} diff --git a/src/mlhc-cheatsheet.tex b/src/mlhc-cheatsheet.tex new file mode 100644 index 0000000..4f5529d --- /dev/null +++ b/src/mlhc-cheatsheet.tex @@ -0,0 +1,209 @@ +\documentclass[11pt]{article} + +% Packages +\usepackage{bookmark} +%\hypersetup{bookmarks,bookmarksopen,bookmarksdepth=2} +\hypersetup{hidelinks} +\usepackage[margin=0.5in]{geometry} +\usepackage{amsmath,amsthm,amssymb} +%\usepackage{scrextend} +\usepackage{fancyhdr} +\pagestyle{fancy} +\usepackage{lipsum} +\usepackage{xargs} +\usepackage[pdftex,dvipsnames]{xcolor} +\usepackage[colorinlistoftodos,prependcaption,textsize=tiny]{todonotes} +\usepackage{enumitem} +\usepackage{multicol} +\usepackage{mathtools} +\usepackage{algorithm} +\usepackage[noend]{algpseudocode} + +\usepackage{titlesec} +\titlespacing*{\section} +{0pt}{0.5em}{0.2em} +%\titleformat{\section} +% {\normalfont\scshape}{\thesection}{1em}{} + +% Todo notes +\newcommandx{\must}[2][1=]{\todo[linecolor=red,backgroundcolor=red!25,bordercolor=red,#1]{#2}} +\newcommandx{\maybe}[2][1=]{\todo[linecolor=blue,backgroundcolor=blue!25,bordercolor=blue,#1]{#2}} + +% List items +\newcommand{\pro}{\item[$+$]} +\newcommand{\con}{\item[$-$]} + +% Unordered list +\newcommand{\ulb}{\begin{enumerate}[label=\textbullet,topsep=0pt,itemsep=-0.5ex,partopsep=1ex,parsep=1ex]} +\newcommand{\ule}{\end{enumerate}\vspace{1mm}} + +% Ordered list +\newcommand{\olb}{\begin{enumerate}[topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]} +\newcommand{\ole}{\end{enumerate}\vspace{1mm}} + +% Quotes +\usepackage [english]{babel} +\usepackage [autostyle, english = american]{csquotes} +\MakeOuterQuote{"} + +% Multicolumn +\newcommand{\mcb}[1]{\begin{multicols}{#1}} +\newcommand{\mce}{\end{multicols}\noindent} + +% Minipage +\newcommand{\mpb}[1]{\begin{minipage}[t]{#1}} +\newcommand{\mpe}{\end{minipage} +} +% Norm +\newcommand{\snorm}[1]{||#1||} +\newcommand{\bnorm}[1]{\left\lVert#1\right\rVert} + +% Kullback-Leibler divergence +\DeclarePairedDelimiterX{\infdivx}[2]{(}{)}{% + #1\;\delimsize\|\;#2% +} +\newcommand{\dkl}{D_{KL}\infdivx} +\newcommand{\djs}{D_{JS}\infdivx} + +% Shorthand +\newcommand{\R}{\mathbb{R}} +\newcommand{\loss}{\mathcal{L}} +\newcommand{\risk}{\mathcal{L}} +\newcommand{\ra}{\rightarrow} +\newcommand{\goes}{\rightarrow} +\newcommand{\given}{\mid} +\newcommand{\normal}{\mathcal{N}} +\DeclareMathOperator*{\argmax}{arg\,max} +\DeclareMathOperator*{\argmin}{arg\,min} +\newcommand{\Eu}{\mathop{\mathbb{E}}} +\DeclareMathOperator{\E}{\mathbb{E}} +\DeclareMathOperator{\rank}{\text{rank}} +\newcommand{\sigm}{\mathop{\text{sigm}}} +\def\BState{\State\hskip-\ALG@thistlm} + +\begin{document} + +\par In biology, a \textbf{gene} is a sequence of nucleotides in DNA or RNA that codes for a molecule that has a function. The \textbf{genotype} is the part of the genetic makeup of a cell, and therefore of any individual, which determines one of its characteristics (phenotype). The \textbf{genotype–phenotype distinction} is drawn in genetics. "Genotype" is an organism's full hereditary information. "Phenotype" is an organism's actual observed properties, such as morphology, development, or behavior. An \textbf{allele} is a variant form of a given gene. A \textbf{heterozygous} individual is someone who has two different alleles at a locus. A \textbf{homozygous} individual has two identical alleles at a locus. + +\section{Support Vector Machines and Kernels for Computational Biology} +\par \textbf{RNA splicing}, in molecular biology, is a form of RNA processing in which a newly made precursor messenger RNA (pre-mRNA) transcript is transformed into a mature messenger RNA (mRNA). +\par \textbf{Kernel trick}: Scalar product in feature space can be computed in input space. Common kernels: polynomial, sigmoid, RBF, normalization... Kernels allow to encode application-specific knowledge. Many kernels for different applications available. +\par String kernel SVMs capable of efficiently dealing with large k-mers k > 10. +\\ +\textbf{Spectrum kernel}: position-independent motifs. \textbf{Spectrum Kernel with Mismatches}: Do not enforce strictly exact matches. \textbf{Weighted-degree kernel}: position-dependent motifs. As weighting use $\beta_k = 2\frac{d-k+1}{d(d+1)}$, where d is the maximal match length taken into account. This way the longer matches are weighted less, but they imply many shorter matches. \textbf{Weighted Degree Kernel with Shifts}: partially position-dependent motifs +\par \textbf{SVM scoring function:} SVM decision function is $\alpha$-weighting of training points, but we are interested in weights of features. We can explicitly compute w and use it to rank importance. Explicit representation of w allows (some) interpretation. SVM-w does not reflect the score for a motif as substrings and overlapping strings contribute, too! +\par \textbf{Positional Oligomer Importance Matrices (POIMs)} +\ulb\item Given k-mer z at position j in the sequence, compute expected score $\mathbb{E} [ s(x) \given x[j] = z ]$ (for small k) +\item Normalize with expected score over all sequences +\item For large k use differential POIM +\ule +A \textbf{sequence logo} consists of a stack of letters at each position. The relative sizes of the letters indicate their frequency in the sequences. The total height of the letters depicts the information content of the position, in bits. The lowest order POIM (k=1) essentially conveys the same information as is represented in a sequence logo. However, unlike sequence logos, POIMs naturally generalize to higher order nucleotide patterns. +\par A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or \textbf{position-specific scoring matrix (PSSM)}, is a commonly used representation of motifs (patterns) in biological sequences. PWMs are often derived from a set of aligned sequences that are thought to be functionally related and have become an important part of many software tools for computational motif discovery. +\par A PWM has one row for each symbol of the alphabet: 4 rows for nucleotides in DNA sequences or 20 rows for amino acids in protein sequences. It also has one column for each position in the pattern. In the first step in constructing a PWM, a basic position frequency matrix (PFM) is created by counting the occurrences of each nucleotide at each position. From the PFM, a position probability matrix (PPM) can now be created by dividing that former nucleotide count at each position by the number of sequences, thereby normalising the values. Most often the elements in PWMs are calculated as log likelihoods. + +\section{Biomedical Natural Language Processing} +\par \textbf{Term frequency (TF)}: the raw count of a term in a document: the number of times that term t occurs in document d. TF suffers from a critical problem: all terms are considered equally important. In fact certain terms have little or no discriminating power in determining relevance. Basic formulation is $f_{t,d} / \sum_{t'} f_{t',d}$ +\par \textbf{Document Frequency (DF)}: the number of documents in the collection that contain a term t. Basic formulation of IDF is $\log N / n_t$ +\par In information retrieval, tf–idf or TFIDF, short for \textbf{term frequency–inverse document frequency}, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. +\par \textbf{Brill tagger} (Transformation-Based Learning): a transformation-based process, in the sense that a tag is assigned to each word and changed using a set of predefined rules. In the transformation process, if the word is known, it first assigns the most frequent tag, or if the word is unknown, it naively assigns the tag "noun" to it. Applying over and over these rules, changing the incorrect tags, a quite high accuracy is achieved. +\par \textbf{Why use embeddings?} Reduce dimensionality of representation. Encodes similarity information, useful for other tasks. Learn representations of entities (words) as well as relationships between them. +\par \textbf{Word2Vec:} Train a classifier on a binary prediction task of words occurring in the neighbourhoods of other words, take the learned classifier weights as the word embeddings. Faster and can easily incorporate a new sentence/document or add a word to the vocabulary. \textbf{Continuous Bag-of-Words (CBoW) model}: predict center word from sum of surrounding word vectors. \textbf{Skip-gram model}: predicting surrounding single words from center word (see NLU notes). Take the target word and a neighboring context word as positive examples, randomly sample other words in the lexicon to get negative samples. Normalized dot-product gives \textbf{cosine similarity}. This means we maximise the overlap (via dot product) between a word and the context it appeared in. By transitivity, any other word with a similar context will have a large overlap with the original word. For example, jumps $\sim$ leaps because their context vectors are similar. +\par Combining embeddings with prior knowledge: from analogical reasoning, abstract relationships were translations in the embedded space. Take this idea and extend the concept of context to include "appears in a relationship with" alongside "appears in a sentence with" and represent these new context-relationships as arbitrary affine transformations (basically, matrices). +\par Enforcing similarity: define an energy function $\mathcal{E} (S,R,T)$, energy is low if S is related to T through R is true (R is often non-symmetric). An example energy function is $\mathcal{E} (S,R,T \given \theta) = -\frac{\mathbf{v}_T \cdot G_R \mathbf{c}_S}{\snorm{\mathbf{v}_T}\snorm{G_R \mathbf{c}_S}}$ +\par "Off-task" data helps due to shared semantic information. + +\section{Time Series Analysis} +\par Auto-regressive model are suited for stationary time series. A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time. Most statistical forecasting methods are based on the assumption that the time series can be rendered approximately stationary (i.e., "stationarized") through the use of mathematical transformations. +\par \textbf{Classification of AR models} $AR(p)$: observed, continuous, RNN: hidden, continuous, Markov chain: observed, discrete, HMM: hidden, discrete + +\section{Survival Analysis} +\par The \textbf{log-rank test} compares the survival times of two or more groups. The null hypothesis for a log-rank test is that the groups have the same survival. +\\ +\textbf{Censoring}: +\ulb +\item Left censoring – a data point is below a certain value but it is unknown by how much. +\item Interval censoring – a data point is somewhere on an interval between two values. +\item Right censoring – a data point is above a certain value but it is unknown by how much. +\item Type I censoring occurs if an experiment has a set number of subjects or items and stops the experiment at a predetermined time, at which point any subjects remaining are right-censored. +\item Type II censoring occurs if an experiment has a set number of subjects or items and stops the experiment when a predetermined number are observed to have failed; the remaining subjects are then right-censored. +\item Random (or non-informative) censoring is when each subject has a censoring time that is statistically independent of their failure time. The observed value is the minimum of the censoring and failure times; subjects whose failure time is greater than their censoring time are right-censored. +\ule +\ulb +\item The lifetime distribution function, conventionally denoted F, is defined as the complement of the survival function: $F(t) = P(T \leq t) = 1 - S(t)$ +\item If F is differentiable then the derivative, which is the density function of the lifetime distribution, is conventionally denoted f: $f(t) = F'(t) = \frac{d}{dt}F(t)$ +\item The function f is sometimes called the event density; it is the rate of death or failure events per unit time. +\ule +\par The hazard function, $h(t)$, is the instantaneous rate at which events occur, given no previous events. (The hazard function, conventionally denoted $\lambda$ , is defined as the event rate at time t conditional on survival until time t or later (that is, $T \geq t$).) +\par Proportional hazards models are a class of survival models in statistics. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. For example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed may double its hazard rate for failure. +\par Kaplan-Meier curves and log-rank tests are most useful when the predictor variable is categorical (e.g., drug vs. placebo), or takes a small number of values (e.g., drug doses 0, 20, 50, and 100 mg/day) that can be treated as categorical. The log-rank test and KM curves don't work easily with quantitative predictors such as gene expression, white blood count, or age. For quantitative predictor variables, an alternative method is \textbf{Cox proportional hazards regression analysis}. + +\section{Privacy Preserving Methods for ML in Healthcare} +\par Anonymization refers to irreversibly severing a data set from the identity of the data contributor in a study to prevent any future re-identification, even by the study organizers under any condition. There's no re-identification of anonymized records, because the links back to the subjects are irreversibly broken. De-identification is also a severing of a data set from the identity of the data contributor, but may include preserving identifying information which can only be re-linked by a trusted party in certain situations. +\par Why generate synthetic data? +\ulb +\item Data could be shared and published without privacy concerns (e.g. scientific reproducibility) +\item Data can be used to augment or enrich similar datasets +\item Represents an alternative approach to build predictive systems +\item Can benefit medical community for use in medical training simulator +\ule +TRTS is not as interesting as the TSTR case as it cannot diagnose mode collapse. Evaluating synthetic datasets: Mechanical Turks when no domain knowledge is needed, Inception score for images +\par \textbf{Differential privacy} addresses the case when a trusted data curator wants to release some statistic over its data without revealing information about a particular value itself. It is a constraint on the algorithms used to publish aggregate information about a statistical database which limits the privacy impact on individuals whose information is in the database. Roughly, an algorithm is differentially private if an observer seeing its output cannot tell if a particular individual's information was used in the computation. The most general mechanism is known as the Laplace mechanism, which adds Laplace noise to data so that everything an adversary receives becomes noisy and imprecise, and so it is much more difficult to breach privacy (if it is feasible at all). Challenges in DP: The more information you intend to "ask" of your database, the more noise has to be injected in order to minimize the privacy leakage. Once data has been leaked, it's gone. The total allowed leakage is often referred to as a "privacy budget", and it determines how many queries will be allowed (and how accurate the results will be). "Estimation from repeated queries" is also one of the fundamental limitations of differential privacy. + +\section{Interpretability of ML Models} +\par Random forests tries to improve on bagging by "de- correlating" the trees. +\par \textbf{Sensitivity Analysis of Individual Variables}: It is a global explanation method. We examine what impact each feature has on the model's prediction. Possible transformations that can be done during analysis are sampling uniformly from the feature distribution, permutation of the feature values, replacing the values by mean or zero. +\par \textbf{Mean Decrease in Impurity (MDI)}, (also called as Gini Importance) is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble. +\par Standard Encoder-Decoder framework: $q,f,g$ nonlinear functions, $s_t$ is the hidden state of the decoder RNN. Here, the context vector $c$ is the same for $\forall y_t$. +\par \textbf{Encoder-Decoder with Bahdanau attention:} $a$ is so-called alignment model, jointly trained with all other components. Unlike existing encoder-decoder models, probability in $g$ is conditioned on a distinct context vector $c_i$ for each target word $y_i$. Probability $\alpha_{ij}$ (or energy $e_{ij}$ for that reason) reflects the importance of the annotation $h_{j}$ w.r.t. previous hidden state $s_{i-1}$ in deciding the next state $s_i$ and generating $y_i$. +\par Show, attend and tell (two mechanism for obtaining context vectors from annotation vectors): +\ulb +\item Hard (stochastic) attention: returns a sample from every point in time, based upon a categorical distribution (of locations) parametrized by $\alpha$ +\item Soft attention: takes the expectation of the context vector directly +\ule + +\section{Appendix} +\vspace{-0.5cm} +\begin{small} +\mcb{2} +\ulb +\item sensitivity, recall, hit rate, or true positive rate (\textbf{TPR}) +$$ \text{TPR} = \frac{\text{TP}}{P} = \frac{\text{TP}}{\text{TP + FN}} = 1 - \text{FNR} $$ + +\item precision, positive predictive value (\textbf{PPV}) +$$ \text{PPV} = \frac{\text{TP}}{\text{TP + FP}} = 1 - \text{FDR} $$ + +\item miss rate, false negative rate (\textbf{FNR}) +$$ \text{FNR} = \frac{\text{FN}}{P} = \frac{\text{FN}}{\text{FN + TP}} = 1 - \text{TPR} $$ + +\item false discovery rate (\textbf{FDR}) +$$ \text{FDR} = \frac{\text{FP}}{\text{FP + TP}} = 1 - \text{PPV} $$ + +\item accuracy (\textbf{ACC}) +$$ \text{ACC} = \frac{\text{TP + TN}}{P + N} = \frac{\text{TP + TN}}{\text{TP + TN + FP + FN}} $$ + +\item specificity, selectivity or true negative rate (\textbf{TNR}) +$$ \text{TNR} = \frac{\text{TN}}{N} = \frac{\text{TN}}{\text{TN + FP}} = 1 - \text{FPR} $$ + +\item negative predictive value (\textbf{NPV}) +$$ \text{NPV} = \frac{\text{TN}}{\text{TN + FN}} = 1 - \text{FOR} $$ + +\item fall-out, false positive rate (\textbf{FPR}) +$$ \text{FPR} = \frac{\text{FP}}{N} = \frac{\text{FP}}{\text{FN + TN}} = 1 - \text{TNR} $$ + +\item false omission rate (\textbf{FOR}) +$$ \text{FOR} = \frac{\text{FN}}{\text{FN + TN}} = 1 - \text{NPV} $$ + +\item \textbf{F1 score} +$$ F_1 = 2 \cdot \frac{\text{PPV} \cdot \text{TPR}}{\text{PPV + TPR}} = \frac{\text{2TP}}{\text{2TP + FP + FN}} $$ + +\ule +\mce +\end{small} +\vspace{-0.5cm} +\par It can be more flexible to predict probabilities of an observation belonging to each class in a classification problem rather than predicting classes directly. This is required when using models where the cost of one error outweighs the cost of other types of errors. For example, in a smog prediction system, we may be far more concerned with having low false negatives than low false positives. A false negative would mean not warning about a smog day when in fact it is a high smog day, leading to health issues in the public that are unable to take precautions. A false positive means the public would take precautionary measures when they didn't need to. +\par The \textbf{ROC curve} is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease. Area under ROC Curve (AUROC) is robust to imbalanced classes (for example, mortality has 2\% positive examples). +\par Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity. ROC curves are appropriate when the observations are balanced between each class, whereas \textbf{precision-recall curves} are appropriate for imbalanced datasets. Key to the calculation of precision and recall is that the calculations do not make use of the true negatives. It is only concerned with the correct prediction of the minority class, class 1. Area under Precision-Recall Curve (AUPRC) quantifies the tradeoff between sensitivity and false discovery, which is relevant in a clinical setting. +\par One reason to use the \textbf{logarithmic scale} is to respond to skewness towards large values; i.e., cases in which one or a few points are much larger than the bulk of the data. log scales allow a large range to be displayed without small values being compressed down into bottom of the graph. Another reason is to show percent change or multiplicative factors. In linear scale, even if the performance in percentage terms has been fairly constant a graph of the funds will appear to have grown most rapidly at the right hand end. With a logarithmic scale a constant percentage change is seen as a constant vertical distance so a constant growth rate is seen as a straight line. That is often a substantial advantage. In short, a logarithmic axis linearizes compound interest and exponential growth. A logarithmic axis is useful for plotting ratios. Ratios are intrinsically asymmetrical, but ratios are symmetrical on a log scale. +\par \textbf{Digital phenotyping} is a multidisciplinary field of science, defined as the "moment-by-moment quantification of the individual-level human phenotype in situ using data from personal digital devices", in particular smartphones. + +\end{document} \ No newline at end of file diff --git a/src/pai-cheatsheet.docx b/src/pai-cheatsheet.docx new file mode 100644 index 0000000..340c8a1 Binary files /dev/null and b/src/pai-cheatsheet.docx differ