python sklearn gmm-杰瑞科技汇

Of course! Here is a comprehensive guide to using Gaussian Mixture Models (GMM) in Python with scikit-learn.

What is a Gaussian Mixture Model (GMM)?

A GMM is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. It's a powerful clustering algorithm that is more flexible than K-Means.

Key Differences from K-Means:

Feature	K-Means	Gaussian Mixture Model (GMM)
Cluster Shape	Assumes spherical clusters of equal size.	Can model clusters of different shapes, sizes, and orientations (ellipsoids).
Assignment	Hard Assignment: Each data point belongs to exactly one cluster.	Soft Assignment: Each data point has a probability of belonging to each cluster.
Algorithm	Optimizes for cluster variance (squared distances).	Uses the Expectation-Maximization (EM) algorithm to maximize the likelihood of the data.
Output	A single cluster label for each point.	A probability distribution over clusters for each point.

Step-by-Step Guide to GMM in `scikit-learn`

Let's walk through the process using a synthetic dataset.

Step 1: Installation and Imports

First, make sure you have scikit-learn, numpy, and matplotlib installed.

pip install scikit-learn numpy matplotlib

Now, import the necessary libraries.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans

Step 2: Create a Synthetic Dataset

We'll create a dataset with 3 distinct blobs. This is a great way to test our clustering algorithm.

# Generate synthetic data
n_samples = 1500
random_state = 170
# Create 3 blobs with different variances
X, y_true = make_blobs(
    n_samples=n_samples,
    centers=3,
    cluster_std=[1.0, 2.5, 0.5], # Different standard deviations for each blob
    random_state=random_state
)
# Plot the true clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_true, s=40, cmap='viridis')"True Clusters")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

This will show you the original data, which is clearly grouped into three clusters.

Step 3: Fit a GMM to the Data

Now, we'll instantiate and fit a GaussianMixture model. The most important parameter is n_components, which is the number of Gaussian distributions (clusters) you want to find.

# Instantiate the Gaussian Mixture Model
# We know there are 3 clusters, so n_components=3
gmm = GaussianMixture(n_components=3, random_state=random_state)
# Fit the model to the data
# The fit method learns the parameters (means, covariances, weights)
gmm.fit(X)

Step 4: Predict Cluster Labels and Probabilities

After fitting the model, we can use it to predict cluster assignments.

predict(X): Performs a hard assignment, giving the most likely cluster for each data point.
predict_proba(X): Performs a soft assignment, giving the probability of each data point belonging to each cluster.

# Predict the cluster for each data point (hard assignment)
y_gmm_pred = gmm.predict(X)
# Predict the probability for each data point (soft assignment)
proba = gmm.predict_proba(X)
print("Probabilities for the first 5 data points:\n", proba[:5])

Step 5: Visualize the GMM Results

Let's plot the data colored by the GMM's predicted labels.

# Plot the GMM clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_gmm_pred, s=40, cmap='viridis')"GMM Predicted Clusters")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

As you can see, GMM successfully identified the three clusters, even though they have different sizes and densities.

Step 6: Visualizing the Cluster Covariances (Ellipses)

A key strength of GMM is that it models the covariance of each cluster. We can visualize this by drawing ellipses representing the 1, 2, and 3 standard deviation contours of each Gaussian.

# Function to draw ellipses for GMM visualization
def plot_gmm(gmm, X, label=True):
    ax = plt.gca()
    labels = gmm.fit(X).predict(X)
    if label:
        ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
    else:
        ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
    ax.axis('equal')
    w_factor = 0.2 / gmm.weights_.max()
    for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
        # Draw an ellipse to show the Gaussian component
        v, w = np.linalg.eigh(covar)
        u = w[0] / np.linalg.norm(w[0])
        angle = np.arctan2(u[1], u[0])
        s = 2. * np.sqrt(2.) * np.sqrt(v)
        for scale in np.linspace(0.5, 2.0, 5):
            ell = plt.matplotlib.patches.Ellipse(
                pos, scale * s[0], scale * s[1],
                angle=np.degrees(angle),
                facecolor='none',
                edgecolor='black',
                linewidth=0.5,
                zorder=1
            )
            ax.add_patch(ell)
plt.figure(figsize=(8, 6))
plot_gmm(gmm, X)"GMM with Covariance Ellipses")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

This plot beautifully shows how GMM models each cluster with its own ellipse, capturing its shape, size, and orientation.

Choosing the Number of Components (`n_components`)

In a real-world scenario, you wouldn't know the number of clusters beforehand. You can use information-theoretic criteria like the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) to find the optimal number of components.

Both AIC and BIC penalize the model for having more parameters, helping you avoid overfitting. The model with the lower BIC or AIC score is generally preferred.

Let's plot the BIC and AIC for a range of component numbers.

n_components = np.arange(1, 11)
models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(X) for n in n_components]
plt.figure(figsize=(10, 6))
plt.plot(n_components, [m.bic(X) for m in models], label='BIC')
plt.plot(n_components, [m.aic(X) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('Number of components');
plt.ylabel('Score (BIC/AIC)');'BIC and AIC for GMM')
plt.show()

The plot will show a "knee" or an "elbow." In this case, both BIC and AIC are minimized at n_components=3, which is the correct number of clusters we generated.

Full Code Example

Here is the complete, runnable script from start to finish.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
# --- 1. Generate Data ---
n_samples = 1500
random_state = 170
X, y_true = make_blobs(
    n_samples=n_samples,
    centers=3,
    cluster_std=[1.0, 2.5, 0.5],
    random_state=random_state
)
# --- 2. Fit GMM ---
# We choose 3 components based on our knowledge of the data
gmm = GaussianMixture(n_components=3, random_state=random_state)
gmm.fit(X)
# --- 3. Predict and Visualize ---
y_gmm_pred = gmm.predict(X)
# Plot GMM results
plt.figure(figsize=(12, 5))
# Plot 1: GMM Predicted Clusters
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_gmm_pred, s=40, cmap='viridis')"GMM Predicted Clusters")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
# Plot 2: GMM with Covariance Ellipses
plt.subplot(1, 2, 2)
# Function to draw ellipses
def plot_gmm(gmm, X, label=True):
    ax = plt.gca()
    if label:
        labels = gmm.fit(X).predict(X)
        ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
    else:
        ax.scatter(X[:, 0], X[:, 1], s=40, zorder=2)
    ax.axis('equal')
    w_factor = 0.2 / gmm.weights_.max()
    for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
        v, w = np.linalg.eigh(covar)
        u = w[0] / np.linalg.norm(w[0])
        angle = np.arctan2(u[1], u[0])
        s = 2. * np.sqrt(2.) * np.sqrt(v)
        for scale in np.linspace(0.5, 2.0, 5):
            ell = plt.matplotlib.patches.Ellipse(
                pos, scale * s[0], scale * s[1],
                angle=np.degrees(angle),
                facecolor='none',
                edgecolor='black',
                linewidth=0.5,
                zorder=1
            )
            ax.add_patch(ell)
plot_gmm(gmm, X)"GMM with Covariance Ellipses")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.tight_layout()
plt.show()
# --- 4. Use BIC/AIC to Find Optimal Number of Components ---
n_components = np.arange(1, 11)
models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(X) for n in n_components]
plt.figure(figsize=(8, 5))
plt.plot(n_components, [m.bic(X) for m in models], label='BIC')
plt.plot(n_components, [m.aic(X) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('Number of components');
plt.ylabel('Score (BIC/AIC)');'BIC and AIC for GMM')
plt.show()

Key Parameters of `GaussianMixture`

n_components: The number of mixture components.
covariance_type: The type of covariance parameters to use.
- 'full' (default): Each component has its own general covariance matrix. Most flexible but requires the most parameters.
- 'tied': All components share the same general covariance matrix.
- 'diag': Each component has its own diagonal covariance matrix (features are independent).
- 'spherical': Each component has its own single variance (features are independent and have the same variance).
init_params: Method for initializing the weights, means, and covariances. 'kmeans' is a robust default.
max_iter: Maximum number of EM iterations to perform.
random_state: For reproducibility of the results.

python sklearn gmm

What is a Gaussian Mixture Model (GMM)?

Step-by-Step Guide to GMM in `scikit-learn`

Step 1: Installation and Imports

Step 2: Create a Synthetic Dataset

Step 3: Fit a GMM to the Data

Step 4: Predict Cluster Labels and Probabilities

Step 5: Visualize the GMM Results

Step 6: Visualizing the Cluster Covariances (Ellipses)

Choosing the Number of Components (`n_components`)

Full Code Example

Key Parameters of `GaussianMixture`

99ANYc3cd6

Scratch和Python，哪个更适合入门学习？

mike2025安装教程

Python str rindex 方法如何使用？

Java如何实现WebSocket？

Lomo相机怎么用？图解教程看这里！

Java Socket编程实例具体怎么实现？

Java日期如何转为整数？

Java接口与类的核心区别是什么？

hashcode与equals在Java中如何协同工作？

Python如何对接Zabbix实现syslog日志监控？

Ajax如何调用Java WebService？

Python thread join 如何正确使用？

Java文件名/目录名语法错误如何解决？

Python signal与socket如何协同工作？

Eclipse for Java 1.7 还能用吗？

visual foxpro 6.0教程

python sklearn gmm

What is a Gaussian Mixture Model (GMM)?

Step-by-Step Guide to GMM in scikit-learn

Step 1: Installation and Imports

Step 2: Create a Synthetic Dataset

Step 3: Fit a GMM to the Data

Step 4: Predict Cluster Labels and Probabilities

Step 5: Visualize the GMM Results

Step 6: Visualizing the Cluster Covariances (Ellipses)

Choosing the Number of Components (n_components)

Full Code Example

Key Parameters of GaussianMixture

相关推荐

Java Socket编程实例具体怎么实现？

Step-by-Step Guide to GMM in `scikit-learn`

Choosing the Number of Components (`n_components`)

Key Parameters of `GaussianMixture`