One promising method to represent a subject more precisely is to represent the subject with more variables, in other words, with higher dimensions. These variables or dimensions that characterize a subject are referred to as features.
On the other hand, data represented in higher dimensions may not have essential features in all dimensions. When the data has redundancy in the defined dimensions and can actually be represented in lower dimensions than the defined dimensions, the operation to obtain an essential representation of the subject is called dimensionality reduction.
One of the basic methods for such dimensionality reduction is a statistical method introduced by Karl Pearson in 19011. This method, now called Principal Component Analysis, PCA, also became known as one of the most fundamental methods in machine learning as computers later became practical. This article describes the computation of PCA.
First, we introduce a dataset. A dataset is a set of some or all of the data sampled for given subjects. For later calculations, we will formulate the dataset.
Let
Denote the
The vector and matrix representations defined here are used in the following calculations. As an example, take the two-dimensional dataset used by Pearson1, which is an example of
In preparation for the computation of PCA, we introduce the basic statistics that evaluate the features contained in the dataset, namely variance and covariance.
The variance is the sum of the squares of the distances from the mean averaged over
and the variance of
In general, when the variance is small, the characteristics represented by the variable are poor. In particular, if the variance is equal to zero, the variable can be ignored as a feature. However, since random noise also has a certain variance, a large variance does not necessarily mean that the variable represents features well.
Then we introduce covariance that evaluates the association between the dimensions of the dataset. The covariance of
By definition,
Unlike variance,
- zero if
$X_i$ and$X_j$ are independent, - positive if
$X_i$ and$X_j$ are both increasing or decreasing, - negative when one increases, the other decreases.
Therefore, if the absolute value of the covariance of two dimensions are large, we can say that the two dimensions share certain characteristics.
We have considered the relationship between each feature, then we will turn to the entire dataset.
That is, using the variances and covariances defined in (1) and (2), we construct the matrix
In the previous section, we obtained a covariance matrix
To simplify the calculation of covariance later, take the difference by the mean
Therefore, since
and the covariance matrix
The essential computation of PCA is the eigendecomposition of the covariance matrix
We mentioned that the covariance matrix
The eigenvectors of the Hermitian matrices are known to be orthogonal to each other, and since the eigenvectors are determined by excluding multiples of the coefficients, we can take orthonormal vectors as eigenvectors. If we define a matrix
Let
The intuitive understanding of eigendecomposition of the covariance matrix is that by mapping the dimensions that share features onto an orthonormal basis, each dimension represents an independent feature.
The projection of the dataset is obtained by
The inverse transformation of the projection is given by
Here we show an example of calculation using the Python numerical package numpy
. As mentioned in the previous section, PCA calculation is based on the eigendecomposition of a matrix.
# Mean centering.
mu = np.mean(X, axis=0)
X = X - mu
# Covariance matrix.
C = X.T * X / (N - 1)
# Eigendecomposition.
L, W = np.linalg.eig(C)
# Sort eigenvectors in the descending order of eigenvalues.
W = W[:, np.flip(np.argsort(L))]
# Projection.
T = X * W
By the way, the computational complexity of eigendecomposition of a square matrix of order
The SVD for
Here,
Since the the diagonal component of
# Singular value decomposition.
U, S, Vt = np.linalg.svd(X)
# Singular values are already sorted.
assert np.allclose(S, np.flip(np.sort(S)))
# Projected coodinates T = U * S.
T = U[:, :n] * np.diag(S)
# U * S = X * V.
T = X * Vt.T
The eigenvalue
Thus, the ratio of variance of the projected dimension can also be calculated from the singular values.
Actually we will calculate for the Pearson dataset. Calculating the covariance matrix of the centered dataset using the equation (3), we yield
Since
Next, computing the eigenvectors, we yield
Since
Since
Let
Figure 1.
It can be seen that the variance of
Applications of PCA include quantitative structure-activity relationship, QSAR of small molecules and classification of diseases based on gene expression levels, etc.
In 1999, Golub et al. classified two leukemia phenotypes, Acute Lymphocytic Leukemia, ALL and Acute Myeloid Leukemia, AML, using clustering based on gene expression levels2. The dataset used for training consisted of about 7,000 gene expression levels in 38 patients. The results of unsupervised learning of this dataset with PCA is shown in Figure 2.
Figure 2. Two-dimensional projection of gene expression dataset for ALL and AML patients using PCA.
It can be seen that we are able to project the ALL and AML of the training dataset in a way that is almost linearly discriminative. In other words, PCA is an example of unsupervised learning.
While the number of genes in a gene expression dataset is high dimensional, it is not uncommon for the sample size to be an order of magnitude smaller due to the constraints of clinical research. In such cases, PCA, which extracts significant features by dimensionality reduction, is one of the effective methods.
Later, this feature extraction was performed by neural networks, which is a part of deep learning. In particular, we note that PCA is equivalent to a linear autoencoder.