Principal Component Analysis


Typical Data Science problems have a huge amout of input features, they are often deciding factors for model performance and make it difficult to visualize and structure data. Therefore reduction of the input features is often helpful or even necessary. The central question here is: Which are the important features? Which ones can be dropped without loosing too much data?
There are quite a few techniques to adress these questions.

Classical Techniques adressed the problem in two ways:
Backword elimination creates a model for all features, then deletes the feature that less raises the error when dropping. Forward selection goes the other way round, it chooses features stepwise choosing always the feature that improves the model the most.
As a combination of the upper mentioned techniques there is the stepwise regression: after adding a feature, a check is executed to decide which features can be deleted without increasing the error.
In all mentioned approaches the main problem is to decide, when the algorithm should stop. Also a problem is that all variables are considered independent, which in general is not the case.


More modern techniques collect subsets of features by filters or wrappers, these techniques allow insights on relations of variables, however these techniques are quite performance intense (note that the number of subsets is essentially bigger than the number of features!).


I want to present another algorithm using the Singular value decomposition of the covariant matrix:
Assume that you have an nn-dimensial training data set of size m that is mean normalized and feature scaled, every data point x=x1,...,xnx=x1,...,xn has nn components.
In the first step you calculate the covariance Matrix Σ=1mmi=1(x(i)(x(i))T)Σ=1mmi=1(x(i)(x(i))T) which is an (nn x nn)-Matrix. Then determine the decomposition Σ=USVΣ=USV with a unitary matrices UU and VV and a diagonal matrix SS (use a predefined algorithm for that).
From the returning matrix UU choose only the first k<nk<n colums to denote a matrix UkUk.

Use then z=UTkxz=UTkx with k<nk<n new features z=z1,...,zkz=z1,...,zk instead of the existing nn ones.

To get back to the original parameters from zz to xx use the following approximation:
xxapprox=UTkzxxapprox=UTkz which only works, if kk is properly chosen.

How should kk be chosen?
You can choose kk = smallest value so that the average squared projection error devided by the total variation in the data is below a boundary ϵϵ, e.g. ϵ=0.01ϵ=0.01 to say that "99% of variance is retained":

1mmi=1||x(i)x(i)approx||21mi=1m||x(i)||2<=ϵ1mmi=1||x(i)x(i)approx||21mi=1m||x(i)||2<=ϵ
There is even a faster, direkt way to determine kk, using the diagonal matrix SS:
Chosse the smallest value for kk so that
1ki=1Siini=1Sii<=ϵ1ki=1Siini=1Sii<=ϵ or ki=1Siini=1Sii>=1ϵki=1Siini=1Sii>=1ϵ


Notes: 
- PCA should be applied to the training set to determine the parameters of the mapping from xx to zz. However the reduced matrix UreduceUreduce can later still be used for the data in the cross and test set.
- PCA should not be used to adress overfitting (as data is getting lost, not using the exising results), therefore the better solution is using regulatization parameters λλ.
- PCA is often not needed, so the recommended approach is to first try to run an algorithm without PCA and only apply it, if it is really needed (visualization, performance).

Related Post

0 Comment