Python GaussianNB如何实现高斯朴素贝叶斯？-杰瑞科技汇

Of course! Here is a comprehensive guide to using GaussianNB in Python, including a detailed explanation, code examples, and best practices.

（图片来源网络，侵删）

What is Gaussian Naive Bayes?

GaussianNB is a classification algorithm based on Bayes' Theorem with a "naive" assumption of independence between every pair of features.

Key Concepts:

Bayes' Theorem: It calculates the probability of a label given some features. The formula is: $P(y | x_1, x_2, ..., xn) = \frac{P(y) \prod{i=1}^{n} P(x_i | y)}{P(x_1, x_2, ..., x_n)}$
- $P(y | x_1, ..., x_n)$: Posterior probability (what we want to find: the probability of a class y given the features).
- $P(y)$: Prior probability (the probability of a class y occurring, regardless of features).
- $P(x_i | y)$: Likelihood (the probability of a feature x_i occurring, given that the data point belongs to class y).
- $P(x_1, ..., x_n)$: Evidence (the probability of the features occurring, which is constant for all classes and can be ignored during comparison).
"Naive" Assumption: The algorithm assumes that all features are independent of each other given the class label. In reality, features are often correlated. This simplification makes the calculation much easier and works surprisingly well in many real-world scenarios.
（图片来源网络，侵删）
"Gaussian" Part: This specific version of Naive Bayes assumes that the values of each feature for a given class are drawn from a Gaussian (Normal) distribution. This means it works best with continuous numerical data. For each feature and each class, the algorithm calculates:
- The mean ($\mu$)
- The variance ($\sigma^2$)

When predicting the class for a new data point, it calculates the probability of that data point belonging to each class using the Gaussian probability density function for each feature, multiplies them together (due to the independence assumption), and picks the class with the highest probability.

When to Use GaussianNB?

It's an excellent choice when:

You have a medium to large-sized dataset.
Your features are continuous numerical data.
You need a fast, simple, and interpretable model.
The "naive" independence assumption is reasonably valid for your problem.

Implementation with `scikit-learn`

scikit-learn provides a simple and efficient implementation of GaussianNB in the naive_bayes module.

（图片来源网络，侵删）

Step 1: Import Necessary Libraries

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2: Create or Load Data

Let's create a synthetic dataset for a clear example. We'll use make_classification to generate a dataset with 2 classes and several informative features.

# Generate a synthetic dataset
# n_samples: number of data points
# n_features: number of features
# n_informative: number of features that are actually useful for classification
# n_redundant: number of features that are linear combinations of informative features
# n_classes: number of classes
# n_clusters_per_class: number of clusters for each class
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=2,
    random_state=42
)
print("Shape of features (X):", X.shape)
print("Shape of labels (y):", y.shape)
print("\nFirst 5 rows of features:\n", X[:5])
print("\nFirst 5 labels:", y[:5])

Step 3: Split Data into Training and Testing Sets

This is a crucial step to evaluate the model's performance on unseen data.

# Split the data into training and testing sets
# test_size=0.2 means 20% of the data will be used for testing
# random_state ensures that the split is the same every time we run the code
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Step 4: Initialize and Train the GaussianNB Model

This is where the magic happens. The fit() method calculates the mean and variance of each feature for each class from the training data.

# Initialize the Gaussian Naive Bayes model
gnb = GaussianNB()
# Train the model on the training data
gnb.fit(X_train, y_train)
print("\nModel training complete.")

Step 5: Make Predictions on the Test Set

Now, we use the trained model to predict the class labels for the test data.

# Make predictions on the test data
y_pred = gnb.predict(X_test)
print("\nFirst 10 predictions:", y_pred[:10])
print("First 10 actual labels:", y_test[:10])

Step 6: Evaluate the Model's Performance

How well did our model do? Let's calculate some common metrics.

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")
# Display the confusion matrix
# A confusion matrix shows the number of correct and incorrect predictions for each class.
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
# For a more visual representation of the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Class 0', 'Class 1'], yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')'Confusion Matrix')
plt.show()
# Display a detailed classification report
# Report includes precision, recall, and F1-score for each class.
report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)

Step 7: Understand the Model's Learned Parameters

One of the great things about GaussianNB is its interpretability. You can inspect the parameters it learned during training.

# The model calculates the mean and variance of each feature for each class.
# Shape: (n_classes, n_features)
print("\nClass Priors (P(y)): The probability of each class in the training set.")
print(gnb.class_prior_)
print("\nFeature Means (mu) for each class:")
print(gnb.theta_) # theta_ is the attribute for means
print("\nFeature Variances (sigma^2) for each class:")
print(gnb.var_) # var_ is the attribute for variances

Complete Runnable Code

Here is the full script from start to finish for easy copy-pasting.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# 1. Create a synthetic dataset
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=2,
    random_state=42
)
# 2. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Initialize and train the model
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print("Model training complete.")
# 4. Make predictions
y_pred = gnb.predict(X_test)
# 5. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
# Visualize Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Class 0', 'Class 1'], yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')'Confusion Matrix')
plt.show()
# Classification Report
report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)
# 6. Inspect learned parameters
print("\n--- Learned Parameters ---")
print("Class Priors (P(y)):", gnb.class_prior_)
print("\nFeature Means (theta_):\n", gnb.theta_)
print("\nFeature Variances (var_):\n", gnb.var_)

Other Types of Naive Bayes in `scikit-learn`

GaussianNB is just one flavor. The choice depends on your data type:

Model	When to Use	Assumption
`GaussianNB`	Continuous numerical data.	Features follow a normal distribution.
`MultinomialNB`	Discrete counts (e.g., word counts in text).	Features represent counts or frequencies.
`BernoulliNB`	Binary/Boolean features (0s and 1s).	Features are binary (e.g., presence/absence of a word).
`ComplementNB`	Imbalanced datasets. A variation of `MultinomialNB` that is often more effective.	Features are counts. It complements the other class's probabilities.

Python GaussianNB如何实现高斯朴素贝叶斯？

What is Gaussian Naive Bayes?

When to Use GaussianNB?

Implementation with `scikit-learn`

Step 1: Import Necessary Libraries

Step 2: Create or Load Data

Step 3: Split Data into Training and Testing Sets

Step 4: Initialize and Train the GaussianNB Model

Step 5: Make Predictions on the Test Set

Step 6: Evaluate the Model's Performance

Step 7: Understand the Model's Learned Parameters

Complete Runnable Code

Other Types of Naive Bayes in `scikit-learn`

99ANYc3cd6

JTextField Java 如何使用？

企业资源计划ERP教程如何快速上手？

Java JTextField如何实现输入限制与事件监听？

python nslackssvm

斗鱼OBS教程，直播推流怎么设置？

Java Socket编程实例具体怎么实现？

Java switch和if，哪个场景更适用？

Python正则化具体怎么实现？

Python 2.7如何处理Cookie？

udp socket java

3dmax视频教程光

如何在Eclipse中运行Java程序？

微信小程序开发教程怎么学？

python reverse 函数

MacBook如何搭建Python环境？

Java JSONObject如何使用？

Python GaussianNB如何实现高斯朴素贝叶斯？

What is Gaussian Naive Bayes?

When to Use GaussianNB?

Implementation with scikit-learn

Step 1: Import Necessary Libraries

Step 2: Create or Load Data

Step 3: Split Data into Training and Testing Sets

Step 4: Initialize and Train the GaussianNB Model

Step 5: Make Predictions on the Test Set

Step 6: Evaluate the Model's Performance

Step 7: Understand the Model's Learned Parameters

Complete Runnable Code

Other Types of Naive Bayes in scikit-learn

相关推荐

Java Socket编程实例具体怎么实现？

Implementation with `scikit-learn`

Other Types of Naive Bayes in `scikit-learn`