This lecture covers essential data preprocessing techniques using the Scikit-Learn library:
- Handling Missing Data: Strategies to deal with absent values in datasets.
- Feature Scaling: Methods to standardize or normalize the range of feature values.
- Outlier Detection: Techniques to identify data points that deviate significantly from the rest.
- Dimensionality Reduction: Approaches to reduce the number of features while preserving important information.
- Encoding Categorical Variables: Converting categorical data into numerical format for machine learning models.
- Creating Preprocessing Pipelines: Combining multiple preprocessing steps into a single workflow.
Generating Sample Dataset
To illustrate the concepts, we first generate a sample dataset using numpy and pandas.
import numpy as np
import pandas as pd
# Set random seed for reproducibility
np.random.seed(42) # Using seed value 42
# Generate random data: 500 samples, 4 features
# Data follows a normal distribution (randn)
# Scaled by [10, 5, 1, 0.5] and shifted by [50, 30, 10, 5] per feature
data = np.random.randn(500, 4) * [10, 5, 1, 0.5] + [50, 30, 10, 5]
# Create a pandas DataFrame
df = pd.DataFrame(data,
columns=['Feature1', 'Feature2', 'Feature3', 'Feature4'])Viewing the Data Head
Let’s look at the first 5 rows using df.head():
df.head()
Output:
| Feature1 | Feature2 | Feature3 | Feature4 | |
|---|---|---|---|---|
| 0 | ||||
| 1 | ||||
| 2 | ||||
| 3 | ||||
| 4 |
Handling Missing Data
Missing data is a common problem in real-world datasets. Imputation involves filling in these missing values.
Imputation Techniques
- Mean Imputation:
- Concept: For each feature (column), calculate the average (mean) of all non-missing values. Replace missing entries in that column with this calculated mean.
- Formula: where is the number of non-missing values for the feature, and are the non-missing values.
- Median Imputation:
- Concept: Replace missing values in a feature with the median value of the non-missing entries in that feature. Median is often preferred over mean when outliers are present, as it is less sensitive to extreme values.
- Mode Imputation (Most Frequent):
- Concept: Replace missing values with the most frequently occurring value (mode) in the feature. This is typically used for categorical or discrete numerical features.
Handling Missing Data in Scikit-Learn (Example)
Using SimpleImputer to perform mean imputation on ‘Feature3’.
from sklearn.impute import SimpleImputer
# 1. Introduce some missing values (NaN) into 'Feature3' for demonstration
# Rows 10 to 14 (exclusive of 15), column index 2 ('Feature3')
df.iloc[10:15, 2] = np.nan # Introduce missing values in 'Feature3'
# 2. Create an imputer instance using the 'mean' strategy
imputer = SimpleImputer(strategy='mean')
# 3. Fit the imputer to the 'Feature3' column and transform it
# The imputer learns the mean from the non-missing values
# Then it replaces the NaN values with the learned mean
# Note: fit_transform expects a 2D array, hence df[['Feature3']]
df[['Feature3']] = imputer.fit_transform(df[['Feature3']])
# Now df['Feature3'] has no missing values (rows 10-14 are filled with the mean)Feature Scaling
Scaling features ensures that all features contribute more equally to model training, preventing features with larger values from dominating distance-based algorithms or gradient descent updates.
StandardScaler (Mean 0, Std 1)
- Concept: Transforms each feature so that it has a mean of and a standard deviation of . This is also known as Z-score normalization.
- Formula: Each value is transformed into using:
where:
- is the mean of the feature.
- is the standard deviation of the feature.
- Effect:
- Features are centered around zero.
- Useful for models that assume normally distributed data (e.g., logistic regression, SVM, PCA).
- Prevents dominance of large-scale features in algorithms sensitive to feature magnitude.
StandardScaler in Scikit-Learn (Example)
from sklearn.preprocessing import StandardScaler
# 1. Create a StandardScaler instance
scaler = StandardScaler()
# 2. Fit the scaler to the data (calculates mean and std dev for each feature)
# and transform the data (applies the z-score formula)
scaled_data = scaler.fit_transform(df)
# 3. Create a new DataFrame with the scaled data
df_scaled = pd.DataFrame(scaled_data, columns=df.columns)
# df_scaled now contains the standardized dataMinMaxScaler (Range 0 to 1)
- Concept: Transforms each feature to scale its values into a specific range, typically or .
- Formula (for range ):
where:
- is the original value.
- is the minimum value of the feature.
- is the maximum value of the feature.
- Effect:
- Scales data to a fixed range.
- Useful when algorithms require data in a specific bounded interval (e.g., some neural networks).
- Can be sensitive to outliers, as they determine the min/max values.
MinMaxScaler in Scikit-Learn (Example)
from sklearn.preprocessing import MinMaxScaler
# 1. Create a MinMaxScaler instance, specifying the desired range (0, 1)
minmax_scaler = MinMaxScaler(feature_range=(0, 1)) # Default range is (0, 1)
# 2. Fit the scaler (finds min and max for each feature)
# and transform the data (applies the scaling formula)
scaled_data_mm = minmax_scaler.fit_transform(df)
# 3. Create a new DataFrame with the min-max scaled data
df_minmax = pd.DataFrame(scaled_data_mm, columns=df.columns)
# df_minmax now contains the data scaled to the [0, 1] rangeViewing df_minmax.head() Output
df_minmax.head()
Output:
| Feature1 | Feature2 | Feature3 | Feature4 | |
|---|---|---|---|---|
| 0 | ||||
| 1 | ||||
| 2 | ||||
| 3 | ||||
| 4 |
(Note: The output values shown in the slide for df_minmax.head() do not appear to be in the to range. This might be an error in the slide, potentially showing output from a different transformation like StandardScaler again. The code itself correctly implements MinMaxScaler to the to range.)
Outlier Detection
Outliers are data points that are significantly different from other observations. They can skew results and negatively impact model performance.
Adding Outliers to the Data (Example Setup)
Let’s add some artificial outliers to our dataset.
# Define outlier data points
outliers = np.array([
[200, 300, 20, 10], # Outlier 1
[180, 280, 25, 7], # Outlier 2
[160, -100, -5, 2] # Outlier 3
])
# Create a DataFrame for the outliers
outliers_df = pd.DataFrame(outliers, columns=df.columns)
# Concatenate the original DataFrame with the outliers DataFrame
# ignore_index=True resets the index for the new combined DataFrame
df = pd.concat([df, outliers_df], ignore_index=True)
# df now contains the original 500 points plus 3 outlier pointsIsolation Forest
- Concept: Anomaly detection (or outlier detection) identifies data points deviating significantly from the norm. Isolation Forest is an efficient algorithm specifically designed for this purpose.
- Principle: It works by randomly partitioning the data until each data point is isolated. Anomalies are typically easier to isolate (require fewer partitions) than normal points.
Steps in Isolation Forest
-
Step 1: Random Partitioning
- The algorithm builds multiple isolation trees (iTrees). For each iTree:
- A random subsample of the data is selected (without replacement).
- The tree is built by recursively:
- Selecting a random feature.
- Selecting a random split value between the minimum and maximum values of that feature in the subsample.
- This continues until each point in the subsample is in its own leaf node, or a predefined maximum tree depth is reached.
- The algorithm builds multiple isolation trees (iTrees). For each iTree:
-
Step 2: Path Length
- Key Idea: Anomalies, being different and fewer, are expected to be isolated in fewer steps (shorter paths from the root node) compared to normal instances which are often clustered.
- Path Length Definition: The path length of a data point in an iTree is the number of edges traversed from the root node to the terminal (leaf) node containing that point.
- Interpretation: Shorter path lengths suggest a higher likelihood of being an anomaly.
-
Step 3: Averaging Over Multiple Trees
- Forest of iTrees: Because partitioning is random, a single tree might not be reliable. The algorithm builds a forest (often iTrees).
- Robust Isolation Measure: The average path length of a data point across all trees in the forest provides a more stable indicator of how easily it can be isolated.
- Benefit: Averaging reduces the variance associated with any single random partition, improving the accuracy and reliability of anomaly detection.
Anomaly Score
- Definition: The anomaly score quantifies how anomalous a data point is, given a sample size used to build the trees.
- Formula:
Where:
- : Path length of point in a single iTree.
- : Average path length of point across all iTrees in the forest.
- : Normalization factor, approximating the average path length of an unsuccessful search in a Binary Search Tree (BST) with nodes. It’s used to normalize the average path length .
- Calculation of : where is the harmonic number, which can be approximated as:
- Interpretation:
- : Strong anomaly (average path length is very small).
- : Likely normal instance (average path length is larger).
- : No clear anomaly within the dataset (average path length is close to the average expected for the sample size).
Hyperparameters
Key hyperparameters for IsolationForest in Scikit-Learn:
n_estimators: The number of iTrees to build in the forest. More trees generally lead to more reliable results but increase computational cost. (Default: )max_samples: The number (or proportion) of data points to use for building each iTree. Smaller subsamples can lead to faster training and sometimes improve performance by reducing “swamping” (where normal instances obscure anomalies) and “masking” (where multiple anomalies obscure each other) effects. (Default: ‘auto’, meaning )contamination: The expected proportion of outliers in the dataset. This is used to define the threshold for the anomaly scores when using thepredictmethod. It doesn’t directly affect the calculation of anomaly scores themselves but influences the conversion of scores to labels ( for outlier, for inlier). (Default: ‘auto’, which currently corresponds to )max_features: The number (or proportion) of features to consider when looking for the best split. (Default: , use all features)random_state: Controls the pseudo-randomness for selecting samples and features, ensuring reproducibility.
Parameters in IsolationForest Class (Summary Table)
| Parameter | Description | Effect |
|---|---|---|
contamination | Proportion of expected outliers | Controls the threshold for classifying anomalies via predict(). |
n_estimators | Number of trees in the ensemble | More trees improve stability but increase computation time. |
max_samples | Number of samples per tree | Affects tree depth and anomaly detection sensitivity. |
max_features | Number of features considered per split | Limits feature selection for splits. |
random_state | Controls randomness of splits/samples | Ensures reproducibility. |
Isolation Forest Example (Conceptual 2D)
- Data Generation: Imagine a 2D dataset with a dense cluster of normal points around and a few outlier points far away, like and .
- (See plot on slide 19: Blue ‘X’s form a cluster, Red ‘X’s are scattered outliers).
- Tree Construction: The algorithm randomly selects a feature (x or y) and a split value.
- Isolation: Outliers like are likely isolated quickly. For example, a single split at might isolate it. Points within the dense cluster require many more splits.
- Anomaly Score Calculation: After building multiple trees and averaging path lengths, the outliers and will have significantly shorter average path lengths, resulting in anomaly scores close to . Inlier points in the cluster will have longer average paths and scores closer to or lower.
Isolation Tree Visualization
(Slide 21 shows a visualization of a single complex Isolation Tree. Nodes contain split conditions (e.g., feature_i <= value), sample counts, and potentially error or value metrics depending on the visualization tool. It illustrates the recursive partitioning process.)
Using IsolationForest (Code Example)
from sklearn.ensemble import IsolationForest
# 1. Create an IsolationForest instance with specific hyperparameters
iso_forest = IsolationForest(
contamination=0.01, # Expect ~1% of the data to be outliers
n_estimators=100, # Use 100 trees for stability
max_samples=256, # Use 256 random samples per tree (if dataset size allows)
max_features=2, # Consider only 2 features randomly per split
random_state=42 # Ensure reproducibility
)
# 2. Fit the model to the data and predict outliers
# fit_predict fits the model and returns labels: 1 for inlier, -1 for outlier
outlier_labels = iso_forest.fit_predict(df) # Assuming df has the outliers added
# 3. Add the labels to the DataFrame
df['Outlier'] = outlier_labels
# 4. Check the counts of inliers and outliers
print(df['Outlier'].value_counts())Output of value_counts()
Outlier
1 497
-1 6
Name: count, dtype: int64
(This output suggests the model identified outliers () and inliers () based on the contamination=0.01 threshold and the data structure. Note: The number of outliers found might slightly differ from the number added due to the algorithm’s nature and threshold setting).
Removing Outliers (Code)
Once outliers are identified, they can be removed from the dataset if desired.
# Filter the DataFrame to keep only rows where 'Outlier' is 1 (inliers)
# Drop the 'Outlier' column itself afterwards
df_no_outliers = df[df['Outlier'] == 1].drop(columns='Outlier')
# Check the shape of the DataFrame without outliers
print(df_no_outliers.shape)Two Cluster Example (Isolation Forest)
This example demonstrates Isolation Forest on data with two distinct clusters (normal data) and some scattered outliers.
Generating Data
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
# Generate data
rng = np.random.RandomState(42)
# Generate train data - normal data consisting of two clusters
# Cluster 1: 100 points centered around (2, 2) with some noise
# Cluster 2: 100 points centered around (-2, -2) with some noise
X_train = 0.3 * rng.randn(100, 2)
X_train = np.r_[X_train + 2, X_train - 2] # Combine clusters (total 200 normal points)
# Generate some outliers scattered in the range [-4, 4] x [-4, 4]
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2)) # 20 outlier points
# Combine normal data and outliers
X = np.r_[X_train, X_outliers] # Total 220 pointsFitting the Model and Predicting
# Fit the model
# Note: Fit ONLY on the training data (normal data) if possible,
# as contamination is often defined relative to normal behavior.
# Here, we fit on X_train to learn the structure of normal points.
clf = IsolationForest(n_estimators=100,
max_samples='auto',
contamination='auto', # Let the algorithm estimate contamination
random_state=42)
clf.fit(X_train) # Fit on the normal data
# Predict - 1 for inliers, -1 for outliers on the combined dataset
y_pred = clf.predict(X)Plotting the Results
Plotting Code 1 (Original Data)
# Plot the data points before classification
plt.figure(figsize=(8, 6))
# Plot inliers (first 200 points)
plt.scatter(X[:200, 0], X[:200, 1], c='white', edgecolors='k', s=20, label='Inliers')
# Plot outliers (last 20 points)
plt.scatter(X[200:, 0], X[200:, 1], c='red', edgecolors='k', s=20, label='Outliers')
plt.title("Isolation Forest Example Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()(See slide 29 for the plot. Shows two clusters (white circles) and scattered outliers (red circles).)
Plotting Code 2 (Predicted Labels)
# Plot the data points colored by predicted labels
plt.figure(figsize=(8, 6))
# Loop through labels (-1: outlier, 1: inlier) and colors ('red', 'white')
for i, color in enumerate(['white', 'red']):
label_val = 1 if color == 'white' else -1
idx = (y_pred == label_val) # Find indices where prediction matches the label
plt.scatter(X[idx, 0], X[idx, 1], c=color, edgecolors='k', s=50,
label=('Inliers' if color == 'white' else 'Outliers'))
plt.title("Isolation Forest Example Prediction")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()(See slide 31 for the plot. Shows points colored by prediction. Most cluster points are white (inliers), scattered points are red (outliers). Some points near cluster edges might also be marked as outliers.)
Dimensionality Reduction: Principal Component Analysis (PCA)
Motivation
Having too many features (high dimensionality) can lead to:
- Overfitting: Model learns training data too well, including noise, and performs poorly on new data.
- Slower Computation: More features mean more calculations.
- Lower Accuracy: Irrelevant features (noise) can degrade model performance.
- Curse of Dimensionality: Data becomes sparse in high dimensions, making distance measures less meaningful.
PCA Concept
- Definition: PCA is a widely used linear dimensionality reduction technique.
- Goal: To find a new set of uncorrelated variables, called principal components (PCs), that capture the maximum possible variance from the original data.
- Prioritization: PCA prioritizes the directions (axes) in the feature space where the data varies the most. The assumption is that directions with more variance contain more information.
Key Mathematical Concepts
-
Covariance Matrix:
- Describes the linear relationships (covariance) between different pairs of variables in a dataset.
- For a dataset with variables, the covariance matrix is an matrix.
- Element represents the covariance between variable and variable .
- Diagonal elements represent the variance of variable .
-
Eigenvectors and Eigenvalues:
- Fundamental concepts in linear algebra.
- For a square matrix (like the covariance matrix ), an eigenvector is a non-zero vector that, when multiplied by the matrix, results in a scaled version of itself. The scaling factor is the corresponding eigenvalue .
- Equation:
- In PCA Context:
- Eigenvectors of the covariance matrix represent the directions of the principal components (the new axes).
- Eigenvalues represent the amount of variance explained by each corresponding principal component (eigenvector). Larger eigenvalues correspond to principal components that capture more variance.
-
Projection onto Principal Components:
- Once the principal components (eigenvectors) are found, the original data can be projected onto these new axes.
- This projection transforms the data from the original coordinate system to the new coordinate system defined by the principal components.
- Mathematically: If is the standardized data matrix (samples x features) and is the matrix whose columns are the selected top eigenvectors (features x ), the projected data (samples x ) is calculated as:
PCA Algorithm Steps
- Data Standardization: Standardize the data (typically using
StandardScaler) to have zero mean () and unit variance () for each feature. This is crucial because PCA is sensitive to the scale of variables; features with larger variances would otherwise dominate the principal components. - Covariance Matrix Calculation: Calculate the covariance matrix of the standardized data.
- Eigenvalue Decomposition: Perform eigenvalue decomposition (also known as eigendecomposition) on the covariance matrix . This yields the eigenvectors and their corresponding eigenvalues .
- Selection of Principal Components:
- Sort the eigenvectors based on their corresponding eigenvalues in descending order ().
- Select the top eigenvectors (those corresponding to the largest eigenvalues), where is the desired number of dimensions for the reduced data. The choice of often depends on the desired amount of variance to retain (e.g., choose such that the sum of the top eigenvalues is of the sum of all eigenvalues).
- Projection of Data: Project the original standardized data onto the selected principal components (eigenvectors). This is done by taking the dot product of the standardized data matrix and the matrix formed by the selected eigenvectors. The result is the lower-dimensional representation of the data.
Mathematical Example
Let’s consider a simple 2D dataset with two data points: and .
-
Step 1: Data Standardization
- Calculate mean: , . Mean vector .
- Center data: , .
- Calculate standard deviation (using in denominator for sample std dev):
- Standardize: Standardized data matrix .
-
Step 2: Covariance Matrix Calculation
-
Step 3: Eigenvalue Decomposition
- Solve the characteristic equation :
- Eigenvalues are and .
- For : . This gives , or . Normalized eigenvector (or ). Let’s match slide 41: .
- For : . This gives , or . Normalized eigenvector .
-
Step 4: Selection of Principal Components
- The largest eigenvalue is . The corresponding eigenvector (principal component) is .
- We choose to keep only this component (reduce dimensionality to ).
-
Step 5: Projection of Data
- Project the standardized data onto : .
- .
- The reduced data consists of the scalar values and .
Implementation using Sklearn
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
# Sample data (replace with your data or use the example data)
# Example data from slide 44: X = np.array([[1, 2], [2, 4], [3, 6], [4, 8]])
# Let's use the mathematical example data:
X = np.array([[2., 4.], [4., 2.]]) # Using floats
# 1. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# X_scaled should be approximately [[-1./sqrt(2), 1./sqrt(2)], [1./sqrt(2), -1./sqrt(2)]]
# [[-0.70710678, 0.70710678], [ 0.70710678, -0.70710678]]
# 2. Create a PCA object and specify the number of components (k=1)
pca = PCA(n_components=1) # Reduce to 1 dimension
# 3. Fit the PCA model to the scaled data and transform the data
X_reduced = pca.fit_transform(X_scaled)
# Print the reduced data
print(X_reduced)
# Output should be approximately [[1.41421356], [-1.41421356]]
# Note: Sklearn's PCA output might differ by a constant scaling factor or sign
# compared to manual calculation, but represents the same projection.
# The manual calculation yielded [1, -1]. Sklearn scales it by sqrt(eigenvalue) = sqrt(2).
# So, [1*sqrt(2), -1*sqrt(2)] = [1.414, -1.414]. Sign might flip depending on eigenvector direction.
# [[-1.41421356], [ 1.41421356]] # Actual output often matches this due to eigenvector sign choiceEncoding Categorical Variables
Machine learning algorithms typically require numerical input. Categorical variables (text labels) need to be converted into numbers.
Common Encoders
LabelEncoder: Assigns a unique integer to each category. E.g., [‘red’, ‘green’, ‘blue’] → []. Suitable for ordinal target variables, but generally not for features as it implies an arbitrary order.OneHotEncoder: Creates new binary ( or ) columns for each category. E.g., ‘red’ → , ‘green’ → . Prevents the model from assuming an order. Suitable for nominal features. Can lead to high dimensionality if there are many categories.OrdinalEncoder: Similar toLabelEncoderbut designed for features. Assigns integers based on a specified order (if known) or automatically. E.g., [‘Low’, ‘Medium’, ‘High’] → . Suitable for ordinal features where the order matters.
Example Setup
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
# Generate sample classification data
X, y = make_classification(n_samples=1000, n_features=4, random_state=42)
df = pd.DataFrame(X, columns=['num_feat1', 'num_feat2', 'num_feat3', 'num_feat4'])
# Add categorical features
# Nominal feature: 'cat_feat1'
df['cat_feat1'] = np.random.choice(['A', 'B', 'C'], size=1000)
# Ordinal feature: 'cat_feat2'
df['cat_feat2'] = np.random.choice(['Low', 'Medium', 'High'], size=1000)One-Hot Encoding Example
# One-Hot Encoding for nominal 'cat_feat1'
onehot_encoder = OneHotEncoder(sparse_output=False, # Get dense array output
handle_unknown='ignore') # How to handle categories seen in transform but not fit
# Fit and transform the column
cat_feat1_encoded_array = onehot_encoder.fit_transform(df[['cat_feat1']])
# Create a DataFrame with meaningful column names
cat_feat1_encoded = pd.DataFrame(
cat_feat1_encoded_array,
columns=onehot_encoder.get_feature_names_out(['cat_feat1']) # e.g., ['cat_feat1_A', 'cat_feat1_B', 'cat_feat1_C']
)
# This cat_feat1_encoded DataFrame can be concatenated back to the main df
# (after potentially dropping the original 'cat_feat1' column)(Note: sparse=False was deprecated and replaced by sparse_output=False in newer Scikit-learn versions.)
Creating a Preprocessing Pipeline
Scikit-Learn’s Pipeline allows chaining multiple preprocessing steps and a final estimator (model). This simplifies workflows, prevents data leakage during cross-validation, and makes code cleaner.
Example Pipeline
Combining imputation and scaling for numeric features, and imputation and one-hot encoding for categorical features.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer # Needed to apply different steps to different columns
# Define numeric and categorical feature names
numeric_features = ['num_feat1', 'num_feat2', 'num_feat3', 'num_feat4']
categorical_features = ['cat_feat1', 'cat_feat2']
# Create pipeline for numeric features: Impute with mean, then scale
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
# Create pipeline for categorical features: Impute with most frequent, then one-hot encode
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore',
drop='first', # Drop first category to avoid multicollinearity
sparse_output=False))
])
# Use ColumnTransformer to apply different transformers to different columns
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Now 'preprocessor' can be used as a single step to preprocess the DataFrame 'df'
# Example:
# X_processed = preprocessor.fit_transform(df)
# This 'preprocessor' can also be the first step in a larger pipeline including a model
# Example:
# from sklearn.linear_model import LogisticRegression
# full_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
# ('classifier', LogisticRegression())])
# full_pipeline.fit(X_train, y_train) # Assuming X_train, y_train are defined
# predictions = full_pipeline.predict(X_test)Conclusion
Thank You for Listening!
Data preprocessing is a critical step in the machine learning workflow. Techniques like imputation, feature scaling, outlier detection, dimensionality reduction, and encoding categorical variables, often implemented using Scikit-Learn tools like SimpleImputer, StandardScaler, MinMaxScaler, IsolationForest, PCA, OneHotEncoder, and Pipeline, are essential for preparing data and improving model performance and reliability.