Definition of PCA
Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a high-dimensional dataset into a lower-dimensional subspace while preserving as much variance (information) as possible. PCA identifies new orthogonal axes called principal components.
Objectives of PCA
- Reduce dimensionality of data.
- Identify directions (principal components) that capture maximum variance.
- Remove redundancy and noise from data.
Step-by-Step Procedure to Perform PCA
Step 0: Mean Centering the Data
Given a dataset with observations and features, represented as matrix :
Compute the mean vector :
Center the data by subtracting the mean from each feature:
Step-by-Step PCA Calculation
Step 1: Compute Covariance Matrix
The covariance matrix measures how variables vary together:
is a symmetric matrix of dimension .
Step 2: Eigenvalue Decomposition
Solve the eigenvalue problem for covariance matrix :
- : eigenvector (principal component direction)
- : eigenvalue (variance along that direction)
Eigenvectors are orthogonal and normalized:
Step 3: Sort Eigenvalues and Eigenvectors
Sort eigenvalues in descending order:
Select top eigenvectors corresponding to largest eigenvalues. These form the projection matrix .
Step 4: Project Data onto Principal Components
Transform original data into new subspace using projection matrix :
- : projected data in lower-dimensional space ().
Explained Variance:
The proportion of variance explained by each principal component is given by:
Choosing principal components involves selecting first few components whose cumulative explained variance exceeds a threshold (e.g., 95%).
Reconstruction and Reconstruction Error:
Reconstruct original data from reduced data:
Reconstructed data:
Reconstruction error (minimized by PCA):
where,