Principal Components Analysis Explained

$$%---- MACROS FOR SETS ----% \newcommand{\znz}{\mathbb{Z} / #1 \mathbb{Z}} \newcommand{\twoheadrightarrowtail}{\mapsto\mathrel{\mspace{-15mu}}\rightarrow} % popular set names \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\I}{\mathbb{I}} % popular vector space notation \newcommand{\V}{\mathbb{V}} \newcommand{\W}{\mathbb{W}} \newcommand{\B}{\mathbb{B}} \newcommand{\D}{\mathbb{D}} %---- MACROS FOR FUNCTIONS ----% % linear algebra \newcommand{\T}{\mathrm{T}} \renewcommand{\ker}{\mathrm{ker}} \newcommand{\range}{\mathrm{range}} \renewcommand{\span}{\mathrm{span}} \newcommand{\rref}{\mathrm{rref}} \renewcommand{\dim}{\mathrm{dim}} \newcommand{\col}{\mathrm{col}} \newcommand{\nullspace}{\mathrm{null}} \newcommand{\row}{\mathrm{row}} \newcommand{\rank}{\mathrm{rank}} \newcommand{\nullity}{\mathrm{nullity}} \renewcommand{\det}{\mathrm{det}} \newcommand{\proj}{\mathrm{proj}} \renewcommand{\H}{\mathrm{H}} \newcommand{\trace}{\mathrm{trace}} \newcommand{\diag}{\mathrm{diag}} \newcommand{\card}{\mathrm{card}} \newcommand\norm{\left\lVert#1\right\rVert} % differential equations \newcommand{\laplace}{\mathcal{L}\{#1\}} \newcommand{\F}{\mathrm{F}} % misc \newcommand{\sign}{\mathrm{sign}} \newcommand{\softmax}{\mathrm{softmax}} \renewcommand{\th}{\mathrm{th}} \newcommand{\adj}{\mathrm{adj}} \newcommand{\hyp}{\mathrm{hyp}} \renewcommand{\max}{\mathrm{max}} \renewcommand{\min}{\mathrm{min}} \newcommand{\where}{\mathrm{\ where\ }} \newcommand{\abs}{\vert #1 \vert} \newcommand{\bigabs}{\big\vert #1 \big\vert} \newcommand{\biggerabs}{\Bigg\vert #1 \Bigg\vert} \newcommand{\equivalent}{\equiv} \newcommand{\cross}{\times} % statistics \newcommand{\cov}{\mathrm{cov}} \newcommand{\var}{\mathrm{var}} \newcommand{\bias}{\mathrm{bias}} \newcommand{\E}{\mathrm{E}} \newcommand{\prob}{\mathrm{prob}} \newcommand{\unif}{\mathrm{unif}} \newcommand{\invNorm}{\mathrm{invNorm}} \newcommand{\invT}{\mathrm{invT}} % real analysis \renewcommand{\sup}{\mathrm{sup}} \renewcommand{\inf}{\mathrm{inf}} %---- MACROS FOR ALIASES AND REFORMATTING ----% % logic \newcommand{\forevery}{\ \forall\ } \newcommand{\OR}{\lor} \newcommand{\AND}{\land} \newcommand{\then}{\implies} % set theory \newcommand{\impropersubset}{\subseteq} \newcommand{\notimpropersubset}{\nsubseteq} \newcommand{\propersubset}{\subset} \newcommand{\notpropersubset}{\not\subset} \newcommand{\union}{\cup} \newcommand{\Union}{\bigcup\limits_{#1}^{#2}} \newcommand{\intersect}{\cap} \newcommand{\Intersect}{\bigcap\limits_{#1}^{#2}} \newcommand{\intersection}{\bigcap\limits_{#1}^{#2}} \newcommand{\Intersection}{\bigcap\limits_{#1}^{#2}} \newcommand{\closure}{\overline} \newcommand{\compose}{\circ} % linear algebra \newcommand{\subspace}{\le} \newcommand{\angles}{\langle #1 \rangle} \newcommand{\identity}{\mathbb{1}} \newcommand{\orthogonal}{\perp} \renewcommand{\parallel}{#1^{||}} % calculus \newcommand{\integral}{\int\limits_{#1}^{#2}} \newcommand{\limit}{\lim\limits_{#1}} \newcommand{\approaches}{\rightarrow} \renewcommand{\to}{\rightarrow} \newcommand{\convergesto}{\rightarrow} % algebra \newcommand{\summation}{\sum\limits_{#1}^{#2}} \newcommand{\product}{\prod\limits_{#1}^{#2}} \newcommand{\by}{\times} \newcommand{\integral}{\int_{#1}^{#2}} % exists commands \newcommand{\notexist}{\nexists\ } \newcommand{\existsatleastone}{\exists\ } \newcommand{\existsonlyone}{\exists!} \newcommand{\existsunique}{\exists!} \let\oldexists\exists \renewcommand{\exists}{\ \oldexists\ } % statistics \newcommand{\distributed}{\sim} \newcommand{\onetoonecorresp}{\sim} \newcommand{\independent}{\perp\!\!\!\perp} \newcommand{\conditionedon}{\ |\ } \newcommand{\given}{\ |\ } \newcommand{\notg}{\ngtr} \newcommand{\yhat}{\hat{y}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\muhat}{\hat{\mu}} \newcommand{\transmatrix}{\mathrm{P}} \renewcommand{\choose}{\binom} % misc \newcommand{\infinity}{\infty} \renewcommand{\bold}{\textbf} \newcommand{\italics}{\textit}$$

Principal Components Analysis (PCA for short) is a technique used to reduce the dimensions of a data set. There are a few helpful ways (at least for the math literate) to explain what PCA does:

• PCA computes the most meaningful basis to re-express a noisy, garbled data set.
• PCA yields a feature subspace that maximizes the variance along the axes.
• PCA reduces the dimensionality of the original feature space by projecting it onto a smaller subspace, where the eigenvectors (which all have the same unit length of 1) will form the axes.

PCA is an extremely useful technique in data science. Hi-resolution images can have thousands of dimensions (MRI scans are huge) and those familiar with “the curse of dimensionality” know that certain algorithms are only approachable when one’s data set is of a manageable size. PCA employs a combination of topics like correlation, eigenvalues, and eigenvectors to determine which variables account for most of the variability in the data set.

Eigenvalues and Eigenvectors Introduction

Let’s say you have a transformation

$$\mathrm{T} = \begin{bmatrix} 7 & -1 \\ -1 & 7 \\ \end{bmatrix}$$.

To find the eigenvalues, $\lambda$s, of the transformation:

$$\mathrm{T}x = \lambda x$$ $$\mathrm{T}x = \lambda \mathrm{I} x$$ $$\mathrm{T}x - \lambda \mathrm{I} x = 0$$ $$(\mathrm{T} - \lambda \mathrm{I})x = 0 : x \ne 0$$ $$x \in \mathrm{ker}(\mathrm{T} - \lambda \mathrm{I})$$ $$\mathrm{det}(\mathrm{T} - \lambda \mathrm{I}) = 0$$ $$\mathrm{det}(\lambda \mathrm{I} - \mathrm{T}) = 0$$ $$\mathrm{det}\big(\lambda \big(\begin{bmatrix} 1 & 0\\\ 0 & 1 \end{bmatrix}\big) - \begin{bmatrix} 7 & -1\\\ -1 & 7 \end{bmatrix}\big) = 0$$ $$\mathrm{det}\big(\begin{bmatrix} \lambda & 0\\\ 0 & \lambda \end{bmatrix} - \begin{bmatrix} 7 & -1\\\ -1 & 7 \end{bmatrix}\big) = 0$$ $$\mathrm{det}\big(\begin{bmatrix} \lambda-7 & 0\\\ 0 & \lambda-7 \end{bmatrix}\big) = 0$$ $$(\lambda - 7)^2 - 1 = 0$$ $$\lambda^2 - 14 \lambda + 49 - 1 = 0$$ $$\lambda^2 - 14 \lambda + 48 = 0$$ $$(\lambda - 6)(\lambda - 8) = 0$$ $$\lambda = 6, \lambda = 8$$

To find the eigenvectors associated with the eigenvalues, we need only plug each eigenvalue into the below equation and solve for the vector that provides the solution:

$$\lambda = 6$$ $$(6 \mathrm{I} - \mathrm{T})x = 0$$

$$\begin{bmatrix} -1 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} 0 \\\ 0 \end{bmatrix}$$

rref(above) and solve for pivot variables to get the following set of vectors as a solution:

$$\begin{bmatrix} a \\\ a \end{bmatrix} : a \in \mathbb{R} = a\begin{bmatrix} 1 \\\ 1 \end{bmatrix} : a \in \mathbb{R} = \mathrm{span}(\begin{bmatrix} 1 \\\ 1\end{bmatrix})$$

We can repeat the above steps with $\lambda = 8$ and get the set of vectors: $$\mathrm{span}\big(\begin{bmatrix} 1 \\\ -1\end{bmatrix}\big)$$

Now let’s verify this in Python.

from IPython.core.display import display
import numpy as np
from numpy import linalg as LA
import math
# create a numpy matrix for our transformation
T = np.array([[7, -1], [-1, 7]])
# get eigenvalues and eigenvectors
eigenvalues, eigenvectors = LA.eig(T)
print("eigenvalues: %s\n" %eigenvalues)
# eigenvectors will be normalized
print("normalized eigenvectors:\n %s\n" %eigenvectors)
#print(eigenvectors)
# un-normalized eigenvectors
normalized_eigenvectors = eigenvectors*math.sqrt(2)
print("un-normalized eigenvectors:\n %s" %normalized_eigenvectors)
eigenvalues: [ 8. 6.]

normalized eigenvectors:
[[ 0.70710678 0.70710678]
[-0.70710678 0.70710678]]

un-normalized eigenvectors:
[[ 1. 1.]
[-1. 1.]]

PCA the Hard Way

1. Normalize data set (always a good idea to do so that certain algorithms aren’t affected by the scale of different variables)
2. Calculate the correlation matrix for your dataset. This matrix will be your transformation and will provide the correlation between each pair of variables. (We use correlation instead of covariance since correlation is a normalized version of the covariance matrix; although we’re not too concerned about normalization since we already normalized the data set in step 1).
3. Find the eigenvectors and eigenvalues of the correlation matrix using the steps above.
4. For each eigenvalue, take the absolute value and divide it by the sum of all the eigenvalues which will provide the proportion of variance that its associated eigenvector contributes to the data.
5. Sort the eigenvectors (i.e. principal components) by their associated eigenvalues (highest to lowest).
6. Eigenvalues tell you which principal components to keep. The number of principal components to keep will be based on the cumulative explained variance that the eigenvalues account for. The cumulative explained variance to stop at is a threshold that is decided by how much variability you want to explain.
7. Stick each eigenvector you want to keep in a matrix that we call a “feature vector”.
8. Now project the data set onto the new feature space (i.e. the feature vector) by multiplying your data set by the feature vector.
import pandas as pd
filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
sep=',')
print("Original data set:")
display(df.tail())
# normalize data set
data_set = df.ix[:,0:3].values
data_set = StandardScaler().fit_transform(data_set)
# calculate correlation matrix
correlation_matrix = np.corrcoef(data_set.T)
print("Correlation matrix:\n%s\n" %correlation_matrix)
# find eigenvalues and eigenvectors
eigenvalues, eigenvectors = LA.eig(correlation_matrix)
print("Eigenvectors:\n%s\n" %eigenvectors)
# make a list of (eigenvalue, eigenvector) tuples
eigen_pairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))]
# sort the (eigenvalue, eigenvector) tuples from highest to lowest
eigen_pairs.sort()
eigen_pairs.reverse()
print('Eigenvalues in descending order:')
for pair in eigen_pairs:
print(pair)

Original data set:

0 1 2 3 4
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
Correlation matrix:
[[ 1. -0.10936925 0.87175416 0.81795363]
[-0.10936925 1. -0.4205161 -0.35654409]
[ 0.87175416 -0.4205161 1. 0.9627571 ]
[ 0.81795363 -0.35654409 0.9627571 1. ]]

Eigenvectors:
[[ 0.52237162 -0.37231836 -0.72101681 0.26199559]
[-0.26335492 -0.92555649 0.24203288 -0.12413481]
[ 0.58125401 -0.02109478 0.14089226 -0.80115427]
[ 0.56561105 -0.06541577 0.6338014 0.52354627]]

Eigenvalues in descending order:
2.91081808375
0.921220930707
0.147353278305
0.0206077072356
# find explained variance of eigenvalues
eigenvalues_sum = sum(eigenvalues)
explained_variance = [(eigenvalue / eigenvalues_sum) for eigenvalue in sorted(eigenvalues, reverse=True)]
for i in explained_variance:
print(i)
0.727704520938
0.230305232677
0.0368383195763
0.00515192680891

It appears the first two eigenvectors account for the majority of the variance in the data set. Let’s just use the two eigenvectors associated with those first two eigenvalues, stick them in a matrix, and apply said matrix to the data set.

# put eigenvectors in matrix
feature_vector = np.hstack((eigen_pairs.reshape(4,1),
eigen_pairs.reshape(4,1)))
# project data set onto feature_vector
pca_data_set = pd.DataFrame(data_set.dot(feature_vector))
print("Dimensionally-reduced data set (tail):")
display(pca_data_set.tail())

Dimensionally-reduced data set (tail):

0 1
145 1.870522 -0.382822
146 1.558492 0.905314
147 1.520845 -0.266795
148 1.376391 -1.016362
149 0.959299 0.022284

PCA the Easy Way

Now that we understand the mechanics of Principal Components Analysis, we can use convenience functions in sklearn to do all the heavy lifing for us.

# scikit-learn convenience command
from sklearn.decomposition import PCA as sklearnPCA
from sklearn.preprocessing import StandardScaler
# standardize data set (sklearn uses covariance matrix instead of correlation matrix)
data_set = df.ix[:,0:3].values
data_set_std = StandardScaler().fit_transform(data_set)
sklearn_pca = sklearnPCA(n_components=4)
sklearn_pca.fit(data_set_std)
# covariance matrix
covariance_matrix = sklearn_pca.get_covariance()
print("Covariance matrix:\n%s\n" %covariance_matrix)
# eigenvectors
eigenvectors = sklearn_pca.components_.T
print("Eigenvectors:\n%s\n" %eigenvectors)
# eigenvalues
explained_variance = sklearn_pca.explained_variance_
print("Explained Variance:\n%s\n" %explained_variance)

# explained variance shows we really only need the eigenvectors associated with the first two eigenvalues
sklearn_pca = sklearnPCA(n_components=2)
sklearn_pca.fit(data_set_std)
# apply the feature vector to the data set
pca_data_set = pd.DataFrame(sklearn_pca.transform(data_set_std))
print("Dimensionally-reduced data set (tail):")
display(pca_data_set.tail())
Covariance matrix:
[[ 1.         -0.10936925  0.87175416  0.81795363]
[-0.10936925  1.         -0.4205161  -0.35654409]
[ 0.87175416 -0.4205161   1.          0.9627571 ]
[ 0.81795363 -0.35654409  0.9627571   1.        ]]

Eigenvectors:
[[ 0.52237162  0.37231836 -0.72101681 -0.26199559]
[-0.26335492  0.92555649  0.24203288  0.12413481]
[ 0.58125401  0.02109478  0.14089226  0.80115427]
[ 0.56561105  0.06541577  0.6338014  -0.52354627]]

Explained Variance:
[ 2.91081808  0.92122093  0.14735328  0.02060771]

Dimensionally-reduced data set (tail):

0 1
145 1.870522 -0.382822
146 1.558492 0.905314
147 1.520845 -0.266795
148 1.376391 -1.016362
149 0.959299 -0.022284