Scikit-Learn Data Preprocessing

This lecture covers essential data preprocessing techniques using the Scikit-Learn library:

Handling Missing Data: Strategies to deal with absent values in datasets.
Feature Scaling: Methods to standardize or normalize the range of feature values.
Outlier Detection: Techniques to identify data points that deviate significantly from the rest.
Dimensionality Reduction: Approaches to reduce the number of features while preserving important information.
Encoding Categorical Variables: Converting categorical data into numerical format for machine learning models.
Creating Preprocessing Pipelines: Combining multiple preprocessing steps into a single workflow.

Generating Sample Dataset

To illustrate the concepts, we first generate a sample dataset using numpy and pandas.

import numpy as np
import pandas as pd
 
# Set random seed for reproducibility
np.random.seed(42) # Using seed value 42
 
# Generate random data: 500 samples, 4 features
# Data follows a normal distribution (randn)
# Scaled by [10, 5, 1, 0.5] and shifted by [50, 30, 10, 5] per feature
data = np.random.randn(500, 4) * [10, 5, 1, 0.5] + [50, 30, 10, 5]
 
# Create a pandas DataFrame
df = pd.DataFrame(data,
                  columns=['Feature1', 'Feature2', 'Feature3', 'Feature4'])

Viewing the Data Head

Let’s look at the first 5 rows using df.head():

df.head()

Output:

	Feature1	Feature2	Feature3	Feature4
0	$54.967142$	$29.308678$	$10.647689$	$5.761515$
1	$47.658466$	$28.829315$	$11.579213$	$5.383717$
2	$45.305256$	$32.712800$	$9.536582$	$4.767135$
3	$52.419623$	$20.433599$	$8.275082$	$4.718856$
4	$39.871689$	$31.571237$	$9.091976$	$4.293848$

Handling Missing Data

Missing data is a common problem in real-world datasets. Imputation involves filling in these missing values.

Imputation Techniques

Mean Imputation:
- Concept: For each feature (column), calculate the average (mean) of all non-missing values. Replace missing entries in that column with this calculated mean.
- Formula: $X_{imputed} = \frac{1}{N} i = 1 \sum N x_{i} (excluding missing values)$ where $N$ is the number of non-missing values for the feature, and $x_{i}$ are the non-missing values.
Median Imputation:
- Concept: Replace missing values in a feature with the median value of the non-missing entries in that feature. Median is often preferred over mean when outliers are present, as it is less sensitive to extreme values.
Mode Imputation (Most Frequent):
- Concept: Replace missing values with the most frequently occurring value (mode) in the feature. This is typically used for categorical or discrete numerical features.

Handling Missing Data in Scikit-Learn (Example)

Using SimpleImputer to perform mean imputation on ‘Feature3’.

from sklearn.impute import SimpleImputer
 
# 1. Introduce some missing values (NaN) into 'Feature3' for demonstration
# Rows 10 to 14 (exclusive of 15), column index 2 ('Feature3')
df.iloc[10:15, 2] = np.nan # Introduce missing values in 'Feature3'
 
# 2. Create an imputer instance using the 'mean' strategy
imputer = SimpleImputer(strategy='mean')
 
# 3. Fit the imputer to the 'Feature3' column and transform it
# The imputer learns the mean from the non-missing values
# Then it replaces the NaN values with the learned mean
# Note: fit_transform expects a 2D array, hence df[['Feature3']]
df[['Feature3']] = imputer.fit_transform(df[['Feature3']])
 
# Now df['Feature3'] has no missing values (rows 10-14 are filled with the mean)

Feature Scaling

Scaling features ensures that all features contribute more equally to model training, preventing features with larger values from dominating distance-based algorithms or gradient descent updates.

StandardScaler (Mean 0, Std 1)

Concept: Transforms each feature so that it has a mean of $0$ and a standard deviation of $1$ . This is also known as Z-score normalization.
Formula: Each value $x$ is transformed into $z$ using: $z = \frac{x - μ}{σ}$ where:
- $μ$ is the mean of the feature.
- $σ$ is the standard deviation of the feature.
Effect:
- Features are centered around zero.
- Useful for models that assume normally distributed data (e.g., logistic regression, SVM, PCA).
- Prevents dominance of large-scale features in algorithms sensitive to feature magnitude.

StandardScaler in Scikit-Learn (Example)

from sklearn.preprocessing import StandardScaler
 
# 1. Create a StandardScaler instance
scaler = StandardScaler()
 
# 2. Fit the scaler to the data (calculates mean and std dev for each feature)
#    and transform the data (applies the z-score formula)
scaled_data = scaler.fit_transform(df)
 
# 3. Create a new DataFrame with the scaled data
df_scaled = pd.DataFrame(scaled_data, columns=df.columns)
 
# df_scaled now contains the standardized data

MinMaxScaler (Range 0 to 1)

Concept: Transforms each feature to scale its values into a specific range, typically $[0, 1]$ or $[- 1, 1]$ .
Formula (for range $[0, 1]$ ): $x_{scaled} = \frac{x - x _{min}}{x _{max} - x _{min}}$ where:
- $x$ is the original value.
- $x_{min}$ is the minimum value of the feature.
- $x_{max}$ is the maximum value of the feature.
Effect:
- Scales data to a fixed range.
- Useful when algorithms require data in a specific bounded interval (e.g., some neural networks).
- Can be sensitive to outliers, as they determine the min/max values.

MinMaxScaler in Scikit-Learn (Example)

from sklearn.preprocessing import MinMaxScaler
 
# 1. Create a MinMaxScaler instance, specifying the desired range (0, 1)
minmax_scaler = MinMaxScaler(feature_range=(0, 1)) # Default range is (0, 1)
 
# 2. Fit the scaler (finds min and max for each feature)
#    and transform the data (applies the scaling formula)
scaled_data_mm = minmax_scaler.fit_transform(df)
 
# 3. Create a new DataFrame with the min-max scaled data
df_minmax = pd.DataFrame(scaled_data_mm, columns=df.columns)
 
# df_minmax now contains the data scaled to the [0, 1] range

Viewing `df_minmax.head()` Output

df_minmax.head()

Output:

	Feature1	Feature2	Feature3	Feature4
0	$0.507007$	$- 0.201021$	$0.614230$	$1.446150$
1	$- 0.284059$	$- 0.295971$	$1.549968$	$0.704719$
2	$- 0.538762$	$0.473250$	$- 0.501903$	$- 0.505330$
3	$0.231272$	$- 1.958951$	$- 1.769110$	$- 0.600078$
4	$- 1.126872$	$0.247135$	$- 0.948521$	$- 1.434160$

(Note: The output values shown in the slide for df_minmax.head() do not appear to be in the $0$ to $1$ range. This might be an error in the slide, potentially showing output from a different transformation like StandardScaler again. The code itself correctly implements MinMaxScaler to the $0$ to $1$ range.)

Outlier Detection

Outliers are data points that are significantly different from other observations. They can skew results and negatively impact model performance.

Adding Outliers to the Data (Example Setup)

Let’s add some artificial outliers to our dataset.

# Define outlier data points
outliers = np.array([
    [200, 300, 20, 10],  # Outlier 1
    [180, 280, 25, 7],   # Outlier 2
    [160, -100, -5, 2]   # Outlier 3
])
 
# Create a DataFrame for the outliers
outliers_df = pd.DataFrame(outliers, columns=df.columns)
 
# Concatenate the original DataFrame with the outliers DataFrame
# ignore_index=True resets the index for the new combined DataFrame
df = pd.concat([df, outliers_df], ignore_index=True)
 
# df now contains the original 500 points plus 3 outlier points

Isolation Forest

Concept: Anomaly detection (or outlier detection) identifies data points deviating significantly from the norm. Isolation Forest is an efficient algorithm specifically designed for this purpose.
Principle: It works by randomly partitioning the data until each data point is isolated. Anomalies are typically easier to isolate (require fewer partitions) than normal points.

Steps in Isolation Forest

Step 1: Random Partitioning
- The algorithm builds multiple isolation trees (iTrees). For each iTree:
  1. A random subsample of the data is selected (without replacement).
  2. The tree is built by recursively:
    - Selecting a random feature.
    - Selecting a random split value between the minimum and maximum values of that feature in the subsample.
  3. This continues until each point in the subsample is in its own leaf node, or a predefined maximum tree depth is reached.
Step 2: Path Length
- Key Idea: Anomalies, being different and fewer, are expected to be isolated in fewer steps (shorter paths from the root node) compared to normal instances which are often clustered.
- Path Length Definition: The path length of a data point in an iTree is the number of edges traversed from the root node to the terminal (leaf) node containing that point.
- Interpretation: Shorter path lengths suggest a higher likelihood of being an anomaly.
Step 3: Averaging Over Multiple Trees
- Forest of iTrees: Because partitioning is random, a single tree might not be reliable. The algorithm builds a forest (often $100 +$ iTrees).
- Robust Isolation Measure: The average path length of a data point across all trees in the forest provides a more stable indicator of how easily it can be isolated.
- Benefit: Averaging reduces the variance associated with any single random partition, improving the accuracy and reliability of anomaly detection.

Anomaly Score

Definition: The anomaly score $s (x, n)$ quantifies how anomalous a data point $x$ is, given a sample size $n$ used to build the trees.
Formula: $s (x, n) = 2^{- \frac{E ( h ( x ))}{c ( n )}}$ Where:
- $h (x)$ : Path length of point $x$ in a single iTree.
- $E (h (x))$ : Average path length of point $x$ across all iTrees in the forest.
- $c (n)$ : Normalization factor, approximating the average path length of an unsuccessful search in a Binary Search Tree (BST) with $n$ nodes. It’s used to normalize the average path length $E (h (x))$ .
Calculation of $c (n)$ : $c (n) = 2 H (n - 1) - \frac{2 ( n - 1 )}{n}$ where $H (i)$ is the harmonic number, which can be approximated as: $H (i) \approx ln (i) + 0.5772156649 (Euler-Mascheroni constant)$
Interpretation:
- $s (x, n) \approx 1$ : Strong anomaly (average path length $E (h (x))$ is very small).
- $s (x, n) ≪ 0.5$ : Likely normal instance (average path length $E (h (x))$ is larger).
- $s (x, n) \approx 0.5$ : No clear anomaly within the dataset (average path length is close to the average expected for the sample size).

Hyperparameters

Key hyperparameters for IsolationForest in Scikit-Learn:

n_estimators: The number of iTrees to build in the forest. More trees generally lead to more reliable results but increase computational cost. (Default: $100$ )
max_samples: The number (or proportion) of data points to use for building each iTree. Smaller subsamples can lead to faster training and sometimes improve performance by reducing “swamping” (where normal instances obscure anomalies) and “masking” (where multiple anomalies obscure each other) effects. (Default: ‘auto’, meaning $min (256, n_samples)$ )
contamination: The expected proportion of outliers in the dataset. This is used to define the threshold for the anomaly scores when using the predict method. It doesn’t directly affect the calculation of anomaly scores themselves but influences the conversion of scores to labels ( $- 1$ for outlier, $1$ for inlier). (Default: ‘auto’, which currently corresponds to $0.1$ )
max_features: The number (or proportion) of features to consider when looking for the best split. (Default: $1.0$ , use all features)
random_state: Controls the pseudo-randomness for selecting samples and features, ensuring reproducibility.

Parameters in `IsolationForest` Class (Summary Table)

Parameter	Description	Effect
`contamination`	Proportion of expected outliers	Controls the threshold for classifying anomalies via `predict()`.
`n_estimators`	Number of trees in the ensemble	More trees improve stability but increase computation time.
`max_samples`	Number of samples per tree	Affects tree depth and anomaly detection sensitivity.
`max_features`	Number of features considered per split	Limits feature selection for splits.
`random_state`	Controls randomness of splits/samples	Ensures reproducibility.

Isolation Forest Example (Conceptual 2D)

Data Generation: Imagine a 2D dataset with a dense cluster of normal points around $(5, 5)$ and a few outlier points far away, like $(15, 15)$ and $(0, 0)$ .
- (See plot on slide 19: Blue ‘X’s form a cluster, Red ‘X’s are scattered outliers).
Tree Construction: The algorithm randomly selects a feature (x or y) and a split value.
Isolation: Outliers like $(15, 15)$ are likely isolated quickly. For example, a single split at $x = 10$ might isolate it. Points within the dense cluster require many more splits.
Anomaly Score Calculation: After building multiple trees and averaging path lengths, the outliers $(15, 15)$ and $(0, 0)$ will have significantly shorter average path lengths, resulting in anomaly scores close to $1$ . Inlier points in the cluster will have longer average paths and scores closer to $0.5$ or lower.

Isolation Tree Visualization

(Slide 21 shows a visualization of a single complex Isolation Tree. Nodes contain split conditions (e.g., feature_i <= value), sample counts, and potentially error or value metrics depending on the visualization tool. It illustrates the recursive partitioning process.)

Using `IsolationForest` (Code Example)

from sklearn.ensemble import IsolationForest
 
# 1. Create an IsolationForest instance with specific hyperparameters
iso_forest = IsolationForest(
    contamination=0.01,     # Expect ~1% of the data to be outliers
    n_estimators=100,       # Use 100 trees for stability
    max_samples=256,        # Use 256 random samples per tree (if dataset size allows)
    max_features=2,         # Consider only 2 features randomly per split
    random_state=42         # Ensure reproducibility
)
 
# 2. Fit the model to the data and predict outliers
# fit_predict fits the model and returns labels: 1 for inlier, -1 for outlier
outlier_labels = iso_forest.fit_predict(df) # Assuming df has the outliers added
 
# 3. Add the labels to the DataFrame
df['Outlier'] = outlier_labels
 
# 4. Check the counts of inliers and outliers
print(df['Outlier'].value_counts())

Output of `value_counts()`

Outlier
 1    497
-1      6
Name: count, dtype: int64

(This output suggests the model identified $6$ outliers ( $- 1$ ) and $497$ inliers ( $1$ ) based on the contamination=0.01 threshold and the data structure. Note: The number of outliers found might slightly differ from the number added due to the algorithm’s nature and threshold setting).

Removing Outliers (Code)

Once outliers are identified, they can be removed from the dataset if desired.

# Filter the DataFrame to keep only rows where 'Outlier' is 1 (inliers)
# Drop the 'Outlier' column itself afterwards
df_no_outliers = df[df['Outlier'] == 1].drop(columns='Outlier')
 
# Check the shape of the DataFrame without outliers
print(df_no_outliers.shape)

Two Cluster Example (Isolation Forest)

This example demonstrates Isolation Forest on data with two distinct clusters (normal data) and some scattered outliers.

Generating Data

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
 
# Generate data
rng = np.random.RandomState(42)
 
# Generate train data - normal data consisting of two clusters
# Cluster 1: 100 points centered around (2, 2) with some noise
# Cluster 2: 100 points centered around (-2, -2) with some noise
X_train = 0.3 * rng.randn(100, 2)
X_train = np.r_[X_train + 2, X_train - 2] # Combine clusters (total 200 normal points)
 
# Generate some outliers scattered in the range [-4, 4] x [-4, 4]
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2)) # 20 outlier points
 
# Combine normal data and outliers
X = np.r_[X_train, X_outliers] # Total 220 points

Fitting the Model and Predicting

# Fit the model
# Note: Fit ONLY on the training data (normal data) if possible,
#       as contamination is often defined relative to normal behavior.
#       Here, we fit on X_train to learn the structure of normal points.
clf = IsolationForest(n_estimators=100,
                      max_samples='auto',
                      contamination='auto', # Let the algorithm estimate contamination
                      random_state=42)
clf.fit(X_train) # Fit on the normal data
 
# Predict - 1 for inliers, -1 for outliers on the combined dataset
y_pred = clf.predict(X)

Plotting the Results

Plotting Code 1 (Original Data)

# Plot the data points before classification
plt.figure(figsize=(8, 6))
# Plot inliers (first 200 points)
plt.scatter(X[:200, 0], X[:200, 1], c='white', edgecolors='k', s=20, label='Inliers')
# Plot outliers (last 20 points)
plt.scatter(X[200:, 0], X[200:, 1], c='red', edgecolors='k', s=20, label='Outliers')
plt.title("Isolation Forest Example Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

(See slide 29 for the plot. Shows two clusters (white circles) and scattered outliers (red circles).)

Plotting Code 2 (Predicted Labels)

# Plot the data points colored by predicted labels
plt.figure(figsize=(8, 6))
# Loop through labels (-1: outlier, 1: inlier) and colors ('red', 'white')
for i, color in enumerate(['white', 'red']):
    label_val = 1 if color == 'white' else -1
    idx = (y_pred == label_val) # Find indices where prediction matches the label
    plt.scatter(X[idx, 0], X[idx, 1], c=color, edgecolors='k', s=50,
                label=('Inliers' if color == 'white' else 'Outliers'))
 
plt.title("Isolation Forest Example Prediction")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

(See slide 31 for the plot. Shows points colored by prediction. Most cluster points are white (inliers), scattered points are red (outliers). Some points near cluster edges might also be marked as outliers.)

Dimensionality Reduction: Principal Component Analysis (PCA)

Motivation

Having too many features (high dimensionality) can lead to:

Overfitting: Model learns training data too well, including noise, and performs poorly on new data.
Slower Computation: More features mean more calculations.
Lower Accuracy: Irrelevant features (noise) can degrade model performance.
Curse of Dimensionality: Data becomes sparse in high dimensions, making distance measures less meaningful.

PCA Concept

Definition: PCA is a widely used linear dimensionality reduction technique.
Goal: To find a new set of uncorrelated variables, called principal components (PCs), that capture the maximum possible variance from the original data.
Prioritization: PCA prioritizes the directions (axes) in the feature space where the data varies the most. The assumption is that directions with more variance contain more information.

Key Mathematical Concepts

Covariance Matrix:
- Describes the linear relationships (covariance) between different pairs of variables in a dataset.
- For a dataset with $n$ variables, the covariance matrix $Σ$ is an $n \times n$ matrix.
- Element $(Σ)_{ij}$ represents the covariance between variable $i$ and variable $j$ .
- Diagonal elements $(Σ)_{ii}$ represent the variance of variable $i$ .
Eigenvectors and Eigenvalues:
- Fundamental concepts in linear algebra.
- For a square matrix $A$ (like the covariance matrix $Σ$ ), an eigenvector $v$ is a non-zero vector that, when multiplied by the matrix, results in a scaled version of itself. The scaling factor is the corresponding eigenvalue $λ$ .
- Equation: $A v = λ v$
- In PCA Context:
  - Eigenvectors of the covariance matrix represent the directions of the principal components (the new axes).
  - Eigenvalues represent the amount of variance explained by each corresponding principal component (eigenvector). Larger eigenvalues correspond to principal components that capture more variance.
Projection onto Principal Components:
- Once the principal components (eigenvectors) are found, the original data can be projected onto these new axes.
- This projection transforms the data from the original coordinate system to the new coordinate system defined by the principal components.
- Mathematically: If $X$ is the standardized data matrix (samples x features) and $V$ is the matrix whose columns are the selected top $k$ eigenvectors (features x $k$ ), the projected data $P$ (samples x $k$ ) is calculated as: $P = X V$

PCA Algorithm Steps

Data Standardization: Standardize the data (typically using StandardScaler) to have zero mean ( $μ = 0$ ) and unit variance ( $σ = 1$ ) for each feature. This is crucial because PCA is sensitive to the scale of variables; features with larger variances would otherwise dominate the principal components.
Covariance Matrix Calculation: Calculate the covariance matrix $Σ$ of the standardized data.
Eigenvalue Decomposition: Perform eigenvalue decomposition (also known as eigendecomposition) on the covariance matrix $Σ$ . This yields the eigenvectors $v_{i}$ and their corresponding eigenvalues $λ_{i}$ .
Selection of Principal Components:
- Sort the eigenvectors based on their corresponding eigenvalues in descending order ( $λ_{1} \geq λ_{2} \geq ... \geq λ_{n}$ ).
- Select the top $k$ eigenvectors (those corresponding to the $k$ largest eigenvalues), where $k$ is the desired number of dimensions for the reduced data. The choice of $k$ often depends on the desired amount of variance to retain (e.g., choose $k$ such that the sum of the top $k$ eigenvalues is $95%$ of the sum of all eigenvalues).
Projection of Data: Project the original standardized data onto the selected $k$ principal components (eigenvectors). This is done by taking the dot product of the standardized data matrix $X$ and the matrix $V$ formed by the selected $k$ eigenvectors. The result $P = X V$ is the lower-dimensional representation of the data.

Mathematical Example

Let’s consider a simple 2D dataset with two data points: $x_{1} = [24]$ and $x_{2} = [42]$ .

Step 1: Data Standardization
- Calculate mean: $μ_{1} = \frac{2 + 4}{2} = 3$ , $μ_{2} = \frac{4 + 2}{2} = 3$ . Mean vector $μ = [33]$ .
- Center data: $x_{1}^{'} = x_{1} - μ = [- 1 1]$ , $x_{2}^{'} = x_{2} - μ = [1 - 1]$ .
- Calculate standard deviation (using $N - 1 = 1$ in denominator for sample std dev): $σ_{1} = \frac{( - 1 ) ^{2} + 1 ^{2}}{2 - 1} = 2$ $σ_{2} = \frac{1 ^{2} + ( - 1 ) ^{2}}{2 - 1} = 2$
- Standardize: Standardized data matrix $X = [x_{1}^{''} x_{2}^{''}]^{T} = [- 1/ 2 1/ 2 1/ 2 - 1/ 2]$ .
Step 2: Covariance Matrix Calculation
- $Σ = \frac{1}{n - 1} X^{T} X = \frac{1}{2 - 1} [- 1/ 2 1/ 2 1/ 2 - 1/ 2]^{T} [- 1/ 2 1/ 2 1/ 2 - 1/ 2]$
- $Σ = 1 \cdot [- 1/ 2 1/ 2 1/ 2 - 1/ 2] [- 1/ 2 1/ 2 1/ 2 - 1/ 2] = [(1/2 + 1/2) (- 1/2 - 1/2) (- 1/2 - 1/2) (1/2 + 1/2)] = [1 - 1 - 1 1]$
Step 3: Eigenvalue Decomposition
- Solve the characteristic equation $det (Σ - λ I) = 0$ : $det ([1 - λ - 1 - 1 1 - λ]) = (1 - λ)^{2} - (- 1)^{2} = 1 - 2 λ + λ^{2} - 1 = λ^{2} - 2 λ = λ (λ - 2) = 0$
- Eigenvalues are $λ_{1} = 2$ and $λ_{2} = 0$ .
- For $λ_{1} = 2$ : $(Σ - 2 I) v_{1} = 0 ⟹ [- 1 - 1 - 1 - 1] [v_{11} v_{12}] = [00]$ . This gives $- v_{11} - v_{12} = 0$ , or $v_{11} = - v_{12}$ . Normalized eigenvector $v_{1} = [1/ 2 - 1/ 2]$ (or $[- 1/ 2 1/ 2]$ ). Let’s match slide 41: $v_{1} = [- 1/ 2 1/ 2]$ .
- For $λ_{2} = 0$ : $(Σ - 0 I) v_{2} = 0 ⟹ [1 - 1 - 1 1] [v_{21} v_{22}] = [00]$ . This gives $v_{21} - v_{22} = 0$ , or $v_{21} = v_{22}$ . Normalized eigenvector $v_{2} = [1/ 2 1/ 2]$ .
Step 4: Selection of Principal Components
- The largest eigenvalue is $λ_{1} = 2$ . The corresponding eigenvector (principal component) is $v_{1} = [- 1/ 2 1/ 2]$ .
- We choose to keep only this component (reduce dimensionality to $k = 1$ ).
Step 5: Projection of Data
- Project the standardized data $X$ onto $v_{1}$ : $X_{reduced} = X v_{1}$ .
- $X_{reduced} = [- 1/ 2 1/ 2 1/ 2 - 1/ 2] [- 1/ 2 1/ 2] = [(- 1/ 2) (- 1/ 2) + (1/ 2) (1/ 2) (1/ 2) (- 1/ 2) + (- 1/ 2) (1/ 2)] = [(1/2 + 1/2) (- 1/2 - 1/2)] = [1 - 1]$ .
- The reduced data consists of the scalar values $1$ and $- 1$ .

Implementation using Sklearn

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
 
# Sample data (replace with your data or use the example data)
# Example data from slide 44: X = np.array([[1, 2], [2, 4], [3, 6], [4, 8]])
# Let's use the mathematical example data:
X = np.array([[2., 4.], [4., 2.]]) # Using floats
 
# 1. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# X_scaled should be approximately [[-1./sqrt(2), 1./sqrt(2)], [1./sqrt(2), -1./sqrt(2)]]
# [[-0.70710678,  0.70710678], [ 0.70710678, -0.70710678]]
 
# 2. Create a PCA object and specify the number of components (k=1)
pca = PCA(n_components=1) # Reduce to 1 dimension
 
# 3. Fit the PCA model to the scaled data and transform the data
X_reduced = pca.fit_transform(X_scaled)
 
# Print the reduced data
print(X_reduced)
# Output should be approximately [[1.41421356], [-1.41421356]]
# Note: Sklearn's PCA output might differ by a constant scaling factor or sign
# compared to manual calculation, but represents the same projection.
# The manual calculation yielded [1, -1]. Sklearn scales it by sqrt(eigenvalue) = sqrt(2).
# So, [1*sqrt(2), -1*sqrt(2)] = [1.414, -1.414]. Sign might flip depending on eigenvector direction.
# [[-1.41421356], [ 1.41421356]] # Actual output often matches this due to eigenvector sign choice

Encoding Categorical Variables

Machine learning algorithms typically require numerical input. Categorical variables (text labels) need to be converted into numbers.

Common Encoders

LabelEncoder: Assigns a unique integer to each category. E.g., [‘red’, ‘green’, ‘blue’] → [ $0, 1, 2$ ]. Suitable for ordinal target variables, but generally not for features as it implies an arbitrary order.
OneHotEncoder: Creates new binary ( $0$ or $1$ ) columns for each category. E.g., ‘red’ → $[1, 0, 0]$ , ‘green’ → $[0, 1, 0]$ . Prevents the model from assuming an order. Suitable for nominal features. Can lead to high dimensionality if there are many categories.
OrdinalEncoder: Similar to LabelEncoder but designed for features. Assigns integers based on a specified order (if known) or automatically. E.g., [‘Low’, ‘Medium’, ‘High’] → $[0, 1, 2]$ . Suitable for ordinal features where the order matters.

Example Setup

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
 
# Generate sample classification data
X, y = make_classification(n_samples=1000, n_features=4, random_state=42)
df = pd.DataFrame(X, columns=['num_feat1', 'num_feat2', 'num_feat3', 'num_feat4'])
 
# Add categorical features
# Nominal feature: 'cat_feat1'
df['cat_feat1'] = np.random.choice(['A', 'B', 'C'], size=1000)
# Ordinal feature: 'cat_feat2'
df['cat_feat2'] = np.random.choice(['Low', 'Medium', 'High'], size=1000)

One-Hot Encoding Example

# One-Hot Encoding for nominal 'cat_feat1'
onehot_encoder = OneHotEncoder(sparse_output=False, # Get dense array output
                               handle_unknown='ignore') # How to handle categories seen in transform but not fit
 
# Fit and transform the column
cat_feat1_encoded_array = onehot_encoder.fit_transform(df[['cat_feat1']])
 
# Create a DataFrame with meaningful column names
cat_feat1_encoded = pd.DataFrame(
    cat_feat1_encoded_array,
    columns=onehot_encoder.get_feature_names_out(['cat_feat1']) # e.g., ['cat_feat1_A', 'cat_feat1_B', 'cat_feat1_C']
)
 
# This cat_feat1_encoded DataFrame can be concatenated back to the main df
# (after potentially dropping the original 'cat_feat1' column)

(Note: sparse=False was deprecated and replaced by sparse_output=False in newer Scikit-learn versions.)

Creating a Preprocessing Pipeline

Scikit-Learn’s Pipeline allows chaining multiple preprocessing steps and a final estimator (model). This simplifies workflows, prevents data leakage during cross-validation, and makes code cleaner.

Example Pipeline

Combining imputation and scaling for numeric features, and imputation and one-hot encoding for categorical features.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer # Needed to apply different steps to different columns
 
# Define numeric and categorical feature names
numeric_features = ['num_feat1', 'num_feat2', 'num_feat3', 'num_feat4']
categorical_features = ['cat_feat1', 'cat_feat2']
 
# Create pipeline for numeric features: Impute with mean, then scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
 
# Create pipeline for categorical features: Impute with most frequent, then one-hot encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore',
                             drop='first', # Drop first category to avoid multicollinearity
                             sparse_output=False))
])
 
# Use ColumnTransformer to apply different transformers to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])
 
# Now 'preprocessor' can be used as a single step to preprocess the DataFrame 'df'
# Example:
# X_processed = preprocessor.fit_transform(df)
 
# This 'preprocessor' can also be the first step in a larger pipeline including a model
# Example:
# from sklearn.linear_model import LogisticRegression
# full_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
#                                 ('classifier', LogisticRegression())])
# full_pipeline.fit(X_train, y_train) # Assuming X_train, y_train are defined
# predictions = full_pipeline.predict(X_test)

Conclusion

Thank You for Listening!

Data preprocessing is a critical step in the machine learning workflow. Techniques like imputation, feature scaling, outlier detection, dimensionality reduction, and encoding categorical variables, often implemented using Scikit-Learn tools like SimpleImputer, StandardScaler, MinMaxScaler, IsolationForest, PCA, OneHotEncoder, and Pipeline, are essential for preparing data and improving model performance and reliability.

Quartz 4

Explorer