Principal Component Analysis (PCA)

Definition of PCA

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a high-dimensional dataset into a lower-dimensional subspace while preserving as much variance (information) as possible. PCA identifies new orthogonal axes called principal components.

Objectives of PCA

Reduce dimensionality of data.
Identify directions (principal components) that capture maximum variance.
Remove redundancy and noise from data.

Step-by-Step Procedure to Perform PCA

Step 0: Mean Centering the Data

Given a dataset with $n$ observations and $d$ features, represented as matrix $X$ :

X = x_{11} x_{21} ⋮ x_{n 1} x_{12} x_{22} ⋮ x_{n 2} \dots \dots ⋱ \dots x_{1 d} x_{2 d} ⋮ x_{n d}

Compute the mean vector $μ$ :

μ_{j} = \frac{1}{n} i = 1 \sum n x_{ij}

Center the data by subtracting the mean from each feature:

X_{centered} = X - μ

Step-by-Step PCA Calculation

Step 1: Compute Covariance Matrix

The covariance matrix $C$ measures how variables vary together:

C = \frac{1}{n - 1} X_{centered}^{T} X_{centered}

$C$ is a symmetric matrix of dimension $d \times d$ .

Step 2: Eigenvalue Decomposition

Solve the eigenvalue problem for covariance matrix $C$ :

C v = λ v

$v$ : eigenvector (principal component direction)
$λ$ : eigenvalue (variance along that direction)

Eigenvectors are orthogonal and normalized:

V^{T} V = I

Step 3: Sort Eigenvalues and Eigenvectors

Sort eigenvalues in descending order:

λ_{1} \geq λ_{2} \geq ... \geq λ_{d}

Select top $k$ eigenvectors corresponding to largest eigenvalues. These form the projection matrix $V_{k} = [v_{1}, v_{2}, ..., v_{k}]$ .

Step 4: Project Data onto Principal Components

Transform original data $X_{centered}$ into new subspace using projection matrix $V_{k}$ :

Y = X_{centered} V_{k}

$Y$ : projected data in lower-dimensional space ( $n \times k$ ).

Explained Variance:

The proportion of variance explained by each principal component is given by:

\frac{λ _{i}}{\sum _{j = 1}^{d} λ _{j}}

Choosing principal components involves selecting first few components whose cumulative explained variance exceeds a threshold (e.g., 95%).

Reconstruction and Reconstruction Error:

Reconstruct original data from reduced data:

Reconstructed data:

X_{reconstructed} = Y V_{k}^{T} + μ

Reconstruction error (minimized by PCA):

ε = ∣∣ X - X_{reconstructed} ∣ ∣^{2}

where, $X_{reconstructed} = Y V_{k}^{T} + μ$

Quartz 4

Explorer