Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. It reduces the number of variables of a data set while preserving as much information as possible.
1. STANDARDIZATION
- The aim of this step is to standardize the range of the continuous initial variables so that each one of them contributes equally to the analysis.
- If there are major differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges that will give biased results. Hence, it is important to perform the standardization.
- Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value of each variable.
z = value - mean / standard deviation
2. COVARIANCE MATRIX COMPUTATION
- The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them.
- The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables.
- Sign of covariance:
If positive then: the two variables increase or decrease together
If negative then: one increases while the other decreases
3. COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS
- Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables.
- Principal components represent the directions of the data that explain a maximal amount of variance.
- Eigenvector and eigenvalue always come in pairs, so that every eigenvector has an eigenvalue. Eigenvectors are the directions of the axes while eigenvalues are the coefficients attached to the eigenvectors.
4. FEATURE VECTOR
- In this step, what one does is, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that is called Feature vector.
- The feature vector is simply a matrix that has as columns the eigenvectors of the components that we decide to keep. This makes it the first step towards dimensionality reduction, because if we choose to keep only p eigenvectors (components) out of n, the final data set will have only p dimensions.
5. RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS AXES
- The aim is to use the feature vector formed using the eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones represented by the principal components (hence the name Principal Components Analysis).
- This can be done by multiplying the transpose of the original data set by the transpose of the feature vector.
https://builtin.com/data-science/step-step-explanation-principal-component-analysis