Definition of PCA

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a high-dimensional dataset into a lower-dimensional subspace while preserving as much variance (information) as possible. PCA identifies new orthogonal axes called principal components.


Objectives of PCA

  • Reduce dimensionality of data.
  • Identify directions (principal components) that capture maximum variance.
  • Remove redundancy and noise from data.

Step-by-Step Procedure to Perform PCA

Step 0: Mean Centering the Data

Given a dataset with observations and features, represented as matrix :

Compute the mean vector :

Center the data by subtracting the mean from each feature:


Step-by-Step PCA Calculation

Step 1: Compute Covariance Matrix

The covariance matrix measures how variables vary together:

is a symmetric matrix of dimension .


Step 2: Eigenvalue Decomposition

Solve the eigenvalue problem for covariance matrix :

  • : eigenvector (principal component direction)
  • : eigenvalue (variance along that direction)

Eigenvectors are orthogonal and normalized:


Step 3: Sort Eigenvalues and Eigenvectors

Sort eigenvalues in descending order:

Select top eigenvectors corresponding to largest eigenvalues. These form the projection matrix .


Step 4: Project Data onto Principal Components

Transform original data into new subspace using projection matrix :

  • : projected data in lower-dimensional space ().

Explained Variance:

The proportion of variance explained by each principal component is given by:

Choosing principal components involves selecting first few components whose cumulative explained variance exceeds a threshold (e.g., 95%).


Reconstruction and Reconstruction Error:

Reconstruct original data from reduced data:

Reconstructed data:

Reconstruction error (minimized by PCA):

where,