Python XGBoost示例怎么用？-杰瑞科技汇

安装 XGBoost

请确保您已经安装了 XGBoost 库，如果尚未安装，可以通过 pip 进行安装：

（图片来源网络，侵删）

pip install xgboost

为了运行示例,您还需要安装 scikit-learn、pandas 和 numpy：

pip install scikit-learn pandas numpy

示例 1：二分类问题

这是最经典的场景,例如判断一封邮件是否为垃圾邮件，或者一个客户是否会流失。

步骤：

加载数据：使用 scikit-learn 自带的乳腺癌数据集。
数据分割：将数据集分为训练集和测试集。
模型训练：创建并训练 XGBoost 分类器。
模型预测：使用训练好的模型对测试集进行预测。
模型评估：计算准确率、精确率、召回率等指标。

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, classification_report
# 1. 加载数据
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
# 2. 数据分割
# 将数据集按 80% / 20% 的比例分割为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. 模型训练
# 创建 XGBoost 分类器对象
# XGBoost 提供了 DMatrix 数据结构，它对内存和计算效率更高
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# 设定模型参数
params = {
    'objective': 'binary:logistic',  # 二分类逻辑回归问题
    'max_depth': 6,                  # 树的最大深度
    'eta': 0.1,                      # 学习率
    'subsample': 0.8,                # 每棵树随机采样的样本比例
    'colsample_bytree': 0.8,         # 每棵树随机采样的特征比例
    'eval_metric': 'logloss',       # 评估指标
    'seed': 42                       # 随机种子，保证结果可复现
}
# 训练模型，num_boost_round 是 boosting 迭代次数（即树的数量）
# verbose_eval=True 会打印出每次迭代的评估结果
num_rounds = 100
bst = xgb.train(params, dtrain, num_rounds, evals=[(dtest, 'test')], verbose_eval=10)
# 4. 模型预测
# 预测结果是概率值，我们需要将其转换为 0 或 1
y_pred_proba = bst.predict(dtest)
y_pred = [1 if prob > 0.5 else 0 for prob in y_pred_proba]
# 5. 模型评估
accuracy = accuracy_score(y_test, y_pred)
print(f"\n模型准确率: {accuracy:.4f}")
# 打印更详细的分类报告
print("\n分类报告:")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))
# 可视化其中一棵树（例如第0棵树）
import matplotlib.pyplot as plt
xgb.plot_tree(bst, num_trees=0)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()

示例 2：回归问题

XGBoost 同样擅长解决回归问题，例如预测房价、销售额等连续值。

步骤：

加载数据：使用 scikit-learn 自带的波士顿房价数据集（注意：此数据集因伦理问题在新版 scikit-learn 中已被移除，这里使用 California 房价数据集作为替代）。
数据分割：将数据集分为训练集和测试集。
模型训练：创建并训练 XGBoost 回归器。
模型预测与评估：使用测试集进行预测，并计算均方误差等指标。

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# 1. 加载数据
housing = fetch_california_housing()
X = housing.data
y = housing.target
# 2. 数据分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. 模型训练
# 使用 scikit-learn 风格的 API，更简单直观
xgb_reg = xgb.XGBRegressor(
    objective='reg:squarederror',  # 回归问题的目标函数
    n_estimators=100,               # 树的数量
    learning_rate=0.1,              # 学习率
    max_depth=6,                    # 树的最大深度
    subsample=0.8,                  # 行采样比例
    colsample_bytree=0.8,           # 列采样比例
    random_state=42
)
# 拟合模型
xgb_reg.fit(X_train, y_train)
# 4. 模型预测与评估
y_pred = xgb_reg.predict(X_test)
# 计算均方误差
mse = mean_squared_error(y_test, y_pred)
print(f"均方误差: {mse:.4f}")
# 计算R平方分数
r2 = r2_score(y_test, y_pred)
print(f"R平方分数: {r2:.4f}")
# 特征重要性
print("\n特征重要性:")
for importance, name in sorted(zip(xgb_reg.feature_importances_, housing.feature_names), reverse=True):
    print(f"{name}: {importance:.4f}")
# 可视化特征重要性
xgb.plot_importance(xgb_reg)
plt.rcParams['figure.figsize'] = [10, 8]
plt.show()

示例 3：使用 scikit-learn API（更通用）

XGBoost 完美兼容 scikit-learn 的 API，这意味着你可以像使用 RandomForestClassifier 或 GradientBoostingClassifier 一样使用 XGBoost，这使得它在 scikit-learn 生态（如 GridSearchCV, Pipeline）中非常易于使用。

这个例子将展示如何使用 GridSearchCV 来寻找最佳的超参数。

import xgboost as xgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# 1. 加载数据
iris = load_iris()
X = iris.data
y = iris.target
# 2. 数据分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. 创建 XGBoost 模型 (使用 scikit-learn API)
xgb_clf = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
# 4. 定义要搜索的参数网格
# 注意：参数名与 scikit-learn API 一致
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0]
}
# 5. 创建 GridSearchCV 对象
# cv=3 表示 3 折交叉验证
# n_jobs=-1 表示使用所有可用的 CPU 核心
grid_search = GridSearchCV(
    estimator=xgb_clf,
    param_grid=param_grid,
    scoring='accuracy',
    cv=3,
    verbose=1,
    n_jobs=-1
)
# 6. 执行网格搜索
grid_search.fit(X_train, y_train)
# 7. 输出最佳参数和最佳得分
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳交叉验证得分: {grid_search.best_score_:.4f}")
# 8. 使用最佳模型进行预测
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
# 9. 评估模型
accuracy = accuracy_score(y_test, y_pred)
print(f"\n测试集准确率: {accuracy:.4f}")

关键参数说明

XGBoost 的强大很大程度上得益于其丰富的参数，以下是一些最核心的参数：

通用参数

booster: 默认 gbtree，使用梯度提升树，也可以是 gblinear (线性模型) 或 dart (Dropouts meet Multiple Additive Regression Trees)。
n_jobs: 并行任务数，默认为 1，设置为 -1 可使用所有 CPU 核心。

学习任务参数

分类问题 (objective):
- binary:logistic: 二分类，输出概率。
- multi:softmax: 多分类，输出类别索引。
- multi:softprob: 多分类，输出每个类别的概率。
回归问题 (objective):
- reg:squarederror: 标准回归，最小化均方误差。
- reg:logistic: 逻辑回归输出，用于预测 0 到 1 之间的值。

树参数

n_estimators: (或 num_boost_round) 树的数量（迭代次数），越多模型越复杂，但容易过拟合。
max_depth: 树的最大深度，值越大，模型越复杂。
learning_rate: (或 eta) 学习率，每次迭代步长的大小，通常需要与 n_estimators 一起调整，较小的学习率需要更多的树。
subsample: 训练每棵树时使用的样本比例，小于 1.0 时引入随机性，防止过拟合。
colsample_bytree: 训练每棵树时使用的特征比例，小于 1.0 时引入随机性，防止过拟合。
min_child_weight: 子节点中所需的最小样本权重和，值越大，模型越保守。
gamma: (或 min_split_loss) 在节点分裂所需的最小损失减少，值越大，算法越保守。

希望这些例子能帮助您快速上手 XGBoost！您可以根据自己的具体问题调整数据和参数。

Python XGBoost示例怎么用？

安装 XGBoost

示例 1：二分类问题

步骤：

示例 2：回归问题

步骤：

示例 3：使用 scikit-learn API（更通用）

关键参数说明

通用参数

学习任务参数

树参数

99ANYc3cd6

3dmax2025怎么安装？步骤详解来了！

Python list remove()如何正确使用？

python调用creatfile

java 正则表达式匹配url

python 日历calendar

Java Socket编程实例具体怎么实现？

Visual Studio教程PDF哪里找？新手如何学？

Python replace与strip方法有何区别？

java properties 文件路径

Python iterbetter 属性是什么，如何使用？

Photoshop CS6教程哪里下载？安全吗？最新版吗？

Python Graphviz与PyQt如何实现可视化交互？

Java如何连接SQL Server数据库？

U盘装Windows系统教程，详细步骤是怎样的？

cheetah python flask

Java memcache client如何正确使用？

Python XGBoost示例怎么用？

安装 XGBoost

示例 1：二分类问题

步骤：

示例 2：回归问题

步骤：

示例 3：使用 scikit-learn API（更通用）

关键参数说明

通用参数

学习任务参数

树参数

相关推荐

Java Socket编程实例具体怎么实现？