杰瑞科技汇

python sklearn gmm

Of course! Here is a comprehensive guide to using Gaussian Mixture Models (GMM) in Python with scikit-learn.

What is a Gaussian Mixture Model (GMM)?

A GMM is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. It's a powerful clustering algorithm that is more flexible than K-Means.

Key Differences from K-Means:

Feature K-Means Gaussian Mixture Model (GMM)
Cluster Shape Assumes spherical clusters of equal size. Can model clusters of different shapes, sizes, and orientations (ellipsoids).
Assignment Hard Assignment: Each data point belongs to exactly one cluster. Soft Assignment: Each data point has a probability of belonging to each cluster.
Algorithm Optimizes for cluster variance (squared distances). Uses the Expectation-Maximization (EM) algorithm to maximize the likelihood of the data.
Output A single cluster label for each point. A probability distribution over clusters for each point.

Step-by-Step Guide to GMM in scikit-learn

Let's walk through the process using a synthetic dataset.

Step 1: Installation and Imports

First, make sure you have scikit-learn, numpy, and matplotlib installed.

pip install scikit-learn numpy matplotlib

Now, import the necessary libraries.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans

Step 2: Create a Synthetic Dataset

We'll create a dataset with 3 distinct blobs. This is a great way to test our clustering algorithm.

# Generate synthetic data
n_samples = 1500
random_state = 170
# Create 3 blobs with different variances
X, y_true = make_blobs(
    n_samples=n_samples,
    centers=3,
    cluster_std=[1.0, 2.5, 0.5], # Different standard deviations for each blob
    random_state=random_state
)
# Plot the true clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_true, s=40, cmap='viridis')"True Clusters")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

This will show you the original data, which is clearly grouped into three clusters.

Step 3: Fit a GMM to the Data

Now, we'll instantiate and fit a GaussianMixture model. The most important parameter is n_components, which is the number of Gaussian distributions (clusters) you want to find.

# Instantiate the Gaussian Mixture Model
# We know there are 3 clusters, so n_components=3
gmm = GaussianMixture(n_components=3, random_state=random_state)
# Fit the model to the data
# The fit method learns the parameters (means, covariances, weights)
gmm.fit(X)

Step 4: Predict Cluster Labels and Probabilities

After fitting the model, we can use it to predict cluster assignments.

  • predict(X): Performs a hard assignment, giving the most likely cluster for each data point.
  • predict_proba(X): Performs a soft assignment, giving the probability of each data point belonging to each cluster.
# Predict the cluster for each data point (hard assignment)
y_gmm_pred = gmm.predict(X)
# Predict the probability for each data point (soft assignment)
proba = gmm.predict_proba(X)
print("Probabilities for the first 5 data points:\n", proba[:5])

Step 5: Visualize the GMM Results

Let's plot the data colored by the GMM's predicted labels.

# Plot the GMM clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_gmm_pred, s=40, cmap='viridis')"GMM Predicted Clusters")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

As you can see, GMM successfully identified the three clusters, even though they have different sizes and densities.

Step 6: Visualizing the Cluster Covariances (Ellipses)

A key strength of GMM is that it models the covariance of each cluster. We can visualize this by drawing ellipses representing the 1, 2, and 3 standard deviation contours of each Gaussian.

# Function to draw ellipses for GMM visualization
def plot_gmm(gmm, X, label=True):
    ax = plt.gca()
    labels = gmm.fit(X).predict(X)
    if label:
        ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
    else:
        ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
    ax.axis('equal')
    w_factor = 0.2 / gmm.weights_.max()
    for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
        # Draw an ellipse to show the Gaussian component
        v, w = np.linalg.eigh(covar)
        u = w[0] / np.linalg.norm(w[0])
        angle = np.arctan2(u[1], u[0])
        s = 2. * np.sqrt(2.) * np.sqrt(v)
        for scale in np.linspace(0.5, 2.0, 5):
            ell = plt.matplotlib.patches.Ellipse(
                pos, scale * s[0], scale * s[1],
                angle=np.degrees(angle),
                facecolor='none',
                edgecolor='black',
                linewidth=0.5,
                zorder=1
            )
            ax.add_patch(ell)
plt.figure(figsize=(8, 6))
plot_gmm(gmm, X)"GMM with Covariance Ellipses")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

This plot beautifully shows how GMM models each cluster with its own ellipse, capturing its shape, size, and orientation.


Choosing the Number of Components (n_components)

In a real-world scenario, you wouldn't know the number of clusters beforehand. You can use information-theoretic criteria like the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) to find the optimal number of components.

Both AIC and BIC penalize the model for having more parameters, helping you avoid overfitting. The model with the lower BIC or AIC score is generally preferred.

Let's plot the BIC and AIC for a range of component numbers.

n_components = np.arange(1, 11)
models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(X) for n in n_components]
plt.figure(figsize=(10, 6))
plt.plot(n_components, [m.bic(X) for m in models], label='BIC')
plt.plot(n_components, [m.aic(X) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('Number of components');
plt.ylabel('Score (BIC/AIC)');'BIC and AIC for GMM')
plt.show()

The plot will show a "knee" or an "elbow." In this case, both BIC and AIC are minimized at n_components=3, which is the correct number of clusters we generated.


Full Code Example

Here is the complete, runnable script from start to finish.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
# --- 1. Generate Data ---
n_samples = 1500
random_state = 170
X, y_true = make_blobs(
    n_samples=n_samples,
    centers=3,
    cluster_std=[1.0, 2.5, 0.5],
    random_state=random_state
)
# --- 2. Fit GMM ---
# We choose 3 components based on our knowledge of the data
gmm = GaussianMixture(n_components=3, random_state=random_state)
gmm.fit(X)
# --- 3. Predict and Visualize ---
y_gmm_pred = gmm.predict(X)
# Plot GMM results
plt.figure(figsize=(12, 5))
# Plot 1: GMM Predicted Clusters
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_gmm_pred, s=40, cmap='viridis')"GMM Predicted Clusters")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
# Plot 2: GMM with Covariance Ellipses
plt.subplot(1, 2, 2)
# Function to draw ellipses
def plot_gmm(gmm, X, label=True):
    ax = plt.gca()
    if label:
        labels = gmm.fit(X).predict(X)
        ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
    else:
        ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
    ax.axis('equal')
    w_factor = 0.2 / gmm.weights_.max()
    for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
        v, w = np.linalg.eigh(covar)
        u = w[0] / np.linalg.norm(w[0])
        angle = np.arctan2(u[1], u[0])
        s = 2. * np.sqrt(2.) * np.sqrt(v)
        for scale in np.linspace(0.5, 2.0, 5):
            ell = plt.matplotlib.patches.Ellipse(
                pos, scale * s[0], scale * s[1],
                angle=np.degrees(angle),
                facecolor='none',
                edgecolor='black',
                linewidth=0.5,
                zorder=1
            )
            ax.add_patch(ell)
plot_gmm(gmm, X)"GMM with Covariance Ellipses")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.tight_layout()
plt.show()
# --- 4. Use BIC/AIC to Find Optimal Number of Components ---
n_components = np.arange(1, 11)
models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(X) for n in n_components]
plt.figure(figsize=(8, 5))
plt.plot(n_components, [m.bic(X) for m in models], label='BIC')
plt.plot(n_components, [m.aic(X) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('Number of components');
plt.ylabel('Score (BIC/AIC)');'BIC and AIC for GMM')
plt.show()

Key Parameters of GaussianMixture

  • n_components: The number of mixture components.
  • covariance_type: The type of covariance parameters to use.
    • 'full' (default): Each component has its own general covariance matrix. Most flexible but requires the most parameters.
    • 'tied': All components share the same general covariance matrix.
    • 'diag': Each component has its own diagonal covariance matrix (features are independent).
    • 'spherical': Each component has its own single variance (features are independent and have the same variance).
  • init_params: Method for initializing the weights, means, and covariances. 'kmeans' is a robust default.
  • max_iter: Maximum number of EM iterations to perform.
  • random_state: For reproducibility of the results.
分享:
扫描分享到社交APP
上一篇
下一篇