Principal Component Analysis


Typical Data Science problems have a huge amout of input features, they are often deciding factors for model performance and make it difficult to visualize and structure data. Therefore reduction of the input features is often helpful or even necessary. The central question here is: Which are the important features? Which ones can be dropped without loosing too much data?
There are quite a few techniques to adress these questions.

Classical Techniques adressed the problem in two ways:
Backword elimination creates a model for all features, then deletes the feature that less raises the error when dropping. Forward selection goes the other way round, it chooses features stepwise choosing always the feature that improves the model the most.
As a combination of the upper mentioned techniques there is the stepwise regression: after adding a feature, a check is executed to decide which features can be deleted without increasing the error.
In all mentioned approaches the main problem is to decide, when the algorithm should stop. Also a problem is that all variables are considered independent, which in general is not the case.


More modern techniques collect subsets of features by filters or wrappers, these techniques allow insights on relations of variables, however these techniques are quite performance intense (note that the number of subsets is essentially bigger than the number of features!).


I want to present another algorithm using the Singular value decomposition of the covariant matrix:
Assume that you have an $n$-dimensial training data set of size m that is mean normalized and feature scaled, every data point $x = {x_1, ..., x_n}$ has $n$ components.
In the first step you calculate the covariance Matrix $$\Sigma =  \frac{1}{m} * \sum_{i=1}^{m} (x^{(i)} (x^{(i)})^T) $$ which is an ($n$ x $n$)-Matrix. Then determine the decomposition $$\Sigma = USV*$$ with a unitary matrices $U$ and $V$ and a diagonal matrix $S$ (use a predefined algorithm for that).
From the returning matrix $U$ choose only the first $k < n$ colums to denote a matrix $U_{k}$.

Use then $$z = U_{k}^T * x$$ with $k < n$ new features $z = {z_1, ..., z_k}$ instead of the existing $n$ ones.

To get back to the original parameters from $z$ to $x$ use the following approximation:
$$x \sim x_{approx} = U_{k}^T * z$$ which only works, if $k$ is properly chosen.

How should $k$ be chosen?
You can choose $k$ = smallest value so that the average squared projection error devided by the total variation in the data is below a boundary $\epsilon$, e.g. $\epsilon = 0.01$ to say that "99% of variance is retained":

$$\frac{\frac{1}{m} \sum_{i=1}^{m} || x^{(i)} - x_{approx}^{(i)} ||^2}{\frac{1}{m} \sum_{i=1}{m} ||x(i)||^2 } <= \epsilon$$
There is even a faster, direkt way to determine $k$, using the diagonal matrix $S$:
Chosse the smallest value for $k$ so that
$$1 -  \frac{\sum_{i=1}^{k} S_{ii}}{\sum_{i=1}^n S_{ii}} <= \epsilon$$ or $$ \frac{\sum_{i=1}^{k} S_{ii}}{\sum_{i=1}^n S_{ii}} >=1-\epsilon$$


Notes: 
- PCA should be applied to the training set to determine the parameters of the mapping from $x$ to $z$. However the reduced matrix $U_{reduce}$ can later still be used for the data in the cross and test set.
- PCA should not be used to adress overfitting (as data is getting lost, not using the exising results), therefore the better solution is using regulatization parameters $\lambda$.
- PCA is often not needed, so the recommended approach is to first try to run an algorithm without PCA and only apply it, if it is really needed (visualization, performance).
Previous
Next Post »
0 Comment