What: The process of partitioning a digital image into multiple segments (sets of pixels, also known as image regions).
Goal: To simplify or change the representation of an image into something more meaningful and easier to analyze. Specifically, to cluster pixels into regions corresponding to individual surfaces, objects, or natural parts of objects.
Approach Discussed: Primarily bottom-up segmentation, meaning it relies on local pixel information (brightness, color, texture) without prior knowledge or recognition of specific objects (top-down information).
Input: Often image brightness (grayscale), but techniques can extend to colour, motion, or stereo disparity.
Example Segmentations
Simple scenes show clear delineation of surfaces.
Complex scenes (horses, tiger) show varying results from different algorithms.
Key Property: Segmentation boundaries are closed, defining distinct regions, unlike edge detection results which can be fragmented.
Key Questions
How well can bottom-up segmentation work without object recognition?
How to mathematically define a “good” segment?
How to solve the defined segmentation problem(s)?
How to evaluate the results?
Observations on Example Algorithms
Algorithms mentioned: Local Variation (LV), Spectral Min-Cut (SMC), Human (H), Edge-augmented Mean-Shift (ED), Normalized Cut (NC).
Quality: Highly image-dependent. Works best on smoothly shaded surfaces with clear intensity steps between them.
Human vs. Machine: Humans likely use object recognition; these bottom-up methods don’t.
Complexity: Machine results are useful for simple images but less reliable for complex/cluttered ones.
Evaluation: Need standardized datasets because algorithms can fail on challenging images (clutter, camouflage).
Approaches to Segmentation
Clustering in Feature Space
What: Represent each pixel as a point in a multi-dimensional feature space. Segmentation becomes finding clusters in this space.
How: Define a feature vector F(x) for each pixel x. This can include:
Pixel coordinates: x
Intensity/Color: I(x)
Local features (e.g., texture, filter responses): L(x)
F(x)=xI(x)L(x)
Why: Pixels belonging to the same segment (e.g., same object part) are expected to have similar feature vectors and thus form a dense cluster in the feature space.
Mixture of Gaussians (MoG) Model
What: A specific method for feature space clustering.
How: Model the probability distribution of feature vectors F as a sum of K Gaussian distributions:
p(F∣M)=∑k=1Kπkg(F∣mk,Σk)
πk: mixing coefficient for cluster k (∑πk=1)
mk,Σk: mean and covariance of the k-th Gaussian component.
Fit parameters (πk,mk,Σk) using Maximum Likelihood on the image’s feature vectors.
Choose K using criteria like Minimum Description Length (MDL).
Segmentation: Assign pixel x to the cluster k that maximizes its “ownership” probability:
c(x)=argmaxkp(F(x)∣M)πkg(F(x)∣mk,Σk)
Limitations: Assumes segments correspond to well-separated, Gaussian-shaped clusters in feature space. Can fail when this assumption is violated (e.g., complex textures, gradual changes). (See Blobworld examples).
Mean-Shift Segmentation
What: Another feature space clustering method, but uses a non-parametric density estimate.
How (Density Estimation): Estimate the density pK(F) using kernel density estimation (e.g., Epanechnikov or Gaussian kernel K):
pK(F)=∣X∣1∑x∈XK(F−F(x))
The kernel’s shape/bandwidth (Σ in K(e)=k(eTΣ−1e)) controls smoothness.
How (Clustering): Find the modes (peaks) of the estimated density pK(F).
For each feature point F(x), perform mean-shift iterations:
Fj+1=∑y∈Xw(F(y)−Fj)∑y∈Xw(F(y)−Fj)F(y)
(This iteratively shifts the point towards the local mean, weighted by the kernel derivative, converging to a mode).
Segmentation: Assign pixel x to the segment corresponding to the mode its feature vector F(x) converges to. Segments are the domains of convergence (watersheds) of the mean-shift procedure.
Properties: Converges; tends to avoid edges (anti-edge detection); can fragment regions with constant intensity gradient (requires post-processing).
Enhancement (EDISON): Incorporate edge information to modify the weights in the mean-shift iteration, improving results by preserving weak boundaries without excessive oversegmentation.
Similarity Graph Based Methods
What: Represent the image as a graph where nodes are pixels and edge weights represent similarity (or dissimilarity) between connected pixels. Segmentation becomes partitioning the graph.
How (Graph Construction):
Vertices V: Image pixels x.
Edges E: Connect nearby pixels (xi,xj).
Weights w(xi,xj) or Affinities a(xi,xj): Based on similarity of features F(xi) and F(xj). Example affinity: a(xi,xj)=e−∥F(xi)−F(xj)∥2/(2σ2) or using Σ−1 for Mahalanobis distance.
Goal: Find a partition of the graph nodes (pixels) into segments.
Connected Components (CC)
How: Simple approach: remove edges where dissimilarity w(xi,xj) is above a threshold τ. Find the connected components in the remaining graph.
Limitation: Not robust. A single “weak link” (low weight edge) can incorrectly merge two distinct regions. Finding a good τ is hard.
Kruskal’s Algorithm: Can find CCs efficiently by building a Minimum Spanning Tree (MST) first, then removing edges above τ.
Local Variation Method (Felzenszwalb & Huttenlocher)
What: An efficient graph-based method using a modified Kruskal’s algorithm.
How: Builds segments (trees in a forest) by adding edges in increasing order of weight. Merges two components Ci,Cj using edge (xk,xl)only if the edge weight is small compared to the internal variation within both components:
w(xk,xl)≤min(T(Ci),T(Cj))
where T(Ci)=w(Ci)+k/∣Ci∣.
w(Ci): maximum edge weight within component Ci (internal variation).
∣Ci∣: number of pixels in Ci.
k: merging threshold parameter.
Why: Adapts the merge criterion based on the existing variation within components. Large components require much smaller external variation (edge weight) to merge compared to their internal variation. Encourages growth of small regions.
Pros: Fast (O(eloge)), tends to follow ‘true’ boundaries well initially.
Cons: Parameter k controls region size; can create narrow regions along boundaries.
Source-Sink Minimum Cut (S-T Min-Cut)
What: Uses graph cuts (min-cut/max-flow algorithms) for segmentation, often for foreground/background or between specific regions.
How:
Define ‘source’ seeds S and ‘sink’ seeds T (sets of pixels believed to be in different segments).
Construct a directed graph from the similarity graph: include edges (xi,xj) and (xj,xi) with capacity a(xi,xj).
Add a source node s connected to all pixels in S with infinite capacity edges.
Add a sink node t connected from all pixels in T with infinite capacity edges.
Find the minimum s−t cut in this graph. The cut partitions the nodes (pixels) into two sets, one containing S and the other containing T. The cut cost is L(F,G)=∑xi∈F,xj∈Ga(xi,xj).
Why: Efficient algorithms exist for S-T min-cut. Finds the globally optimal boundary between the predefined S and T regions.
Use: Need a good method to generate seed sets S and T to find multiple segments (e.g., using spectral properties [6]).
Cons: Computationally intensive if many cuts are needed. Results depend heavily on seed selection.
Normalized Cut (Ncut) (Shi & Malik)
What: A graph partitioning criterion that tries to avoid cutting off very small, isolated segments.
How (Criterion): Partition the graph into F and G=V∖F by minimizing the normalized cut:
N(F,G)=L(F,V)L(F,G)+L(G,V)L(F,G)=L(F,G)(assoc(F,V)1+assoc(G,V)1)
where L(A,B)=∑xi∈A,xj∈Ba(xi,xj) is the total connection weight between sets A and B. assoc(F,V) measures the total connection from F to the whole graph.
Why: Minimizing just the cut weight L(F,G) tends to produce small outlier segments. Ncut balances the cut cost with the “volume” or total connection of the segments.
Problem: Minimizing Ncut directly is NP-hard (computationally intractable).
How (Approximation - Spectral Method):
The Ncut minimization problem is related to finding an eigenvector.
Relax the discrete problem to a continuous one involving the generalized eigenvector problem related to the graph Laplacian.
Solve for the eigenvector u corresponding to the second smallest eigenvalue of the normalized Laplacian system (D−A)y=λDy, or equivalently, the eigenvector for the second smallest eigenvalue of I−D−1/2AD−1/2. (A is affinity matrix, D is diagonal degree matrix Dii=∑jAij).
Threshold this eigenvector (e.g., at 0 or search for the best threshold) to get a discrete partition (F,G).
Use: Can be applied recursively to get multiple segments. Can also use multiple eigenvectors for multi-way Ncut [14].
Limitation: The step of thresholding the eigenvector is an approximation, works best when there’s a clear bipartition in the data.
Natural Image Boundaries
What: Contours separating distinct objects or surfaces (semantically meaningful).
Difference: Distinct from image edges, which are just rapid changes in brightness/color and might occur within an object (texture).
Relation to Segmentation: Boundary detection is complementary. Can be integrated into segmentation (e.g., EDISON uses edge saliency, Ncut can use intervening contour information in affinities [8]).
Evaluation
Berkeley Segmentation Database: A standard dataset with human-provided segmentations for benchmarking algorithms. Shows human variability.
Benchmarking: Use datasets (e.g., Berkeley, synthetic fractal images) to compare algorithms.
Metrics: Precision-Recall curves are often used for evaluating the quality of detected segment boundaries against ground truth. Human performance serves as an upper bound reference. (SE-MinCut, MeanShift often perform well in examples shown).