杰瑞科技汇

Python fit方法在sklearn中如何使用?

Of course! Let's break down how to "fit" a model in scikit-learn (often abbreviated as sklearn). This is the core step of machine learning.

The Core Concept: What Does fit() Do?

In machine learning, "fitting" a model means training it. You are giving your model data (both the features X and the target y) so that it can learn the underlying patterns or relationships from that data.

Think of it like this:

  • You (the teacher): Have a textbook (X and y).
  • The student (the model): Is a blank slate.
  • The fit() method: Is the student studying the textbook. They read the examples, understand the rules, and build a mental model of how to solve problems.

Once the student has "studied" (i.e., the model has been fit), you can give them a new, unseen problem and they can make a prediction.


The Standard Workflow in Scikit-Learn

Almost all machine learning tasks in sklearn follow this standard pattern. Let's walk through it with a simple example.

Step 1: Import Necessary Libraries

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

Step 2: Prepare Your Data

Your data needs to be split into two main parts:

  • Features (X): The input variables (also called predictors or independent variables). This is what you use to make a prediction.
  • Target (y): The output variable (also called the label or dependent variable). This is what you are trying to predict.
# Sample data: Let's predict a house price based on its size.
# X = feature (house size in square feet)
# y = target (house price in $1000s)
X = np.array([[1500], [1600], [1700], [1800], [1900], [2000], [2100], [2200]])
y = np.array([300, 320, 340, 360, 380, 400, 420, 440])
# It's crucial to split your data into training and testing sets.
# The model learns from the TRAINING set and is evaluated on the TESTING set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
  • train_test_split: Shuffles and splits the data. test_size=0.3 means 30% of the data will be used for testing.
  • random_state: Ensures that the split is the same every time you run the code, making your results reproducible.

Step 3: Choose and Instantiate Your Model

You need to choose the type of algorithm you want to use (e.g., Linear Regression, a Support Vector Machine, a Random Forest). Then, you create an instance of that model.

# Create an instance of the Linear Regression model
model = LinearRegression()

At this point, model is an empty object. It knows the rules of linear regression but hasn't seen any data yet.

Step 4: Fit the Model (The Main Event!)

This is the step you asked about. You call the .fit() method on your model instance, passing it the training data.

# Train the model using the training data
model.fit(X_train, y_train)

What happens inside fit()? Scikit-learn takes X_train and y_train and performs the mathematical calculations specific to LinearRegression. In this case, it finds the best-fit line (the optimal slope and intercept) that minimizes the error between its predictions and the actual prices in y_train. The calculated parameters (slope and intercept) are now stored inside the model object.

Step 5: Make Predictions

Now that the model is trained, you can use it to predict outcomes for new, unseen data (from your X_test set).

# Make predictions on the test data
y_pred = model.predict(X_test)
# Compare the predictions (y_pred) with the actual values (y_test)
print("Actual Prices:", y_test)
print("Predicted Prices:", y_pred)

Step 6: Evaluate the Model

How good is your model? You compare its predictions (y_pred) with the actual values (y_test).

# Calculate the model's performance
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_test, y_pred))
print('R-squared Score:', metrics.r2_score(y_test, y_pred))

Putting It All Together: A Complete, Runnable Example

Here is the full code for the house price example.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
# 1. Prepare Data
# X = feature (house size in square feet)
# y = target (house price in $1000s)
X = np.array([[1500], [1600], [1700], [1800], [1900], [2000], [2100], [2200]])
y = np.array([300, 320, 340, 360, 380, 400, 420, 440])
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Training data shapes:")
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("-" * 20)
# 2. Instantiate the Model
# Create an instance of the Linear Regression model
model = LinearRegression()
# 3. Fit the Model (The Training Step)
# This is where the model learns from the data
model.fit(X_train, y_train)
print("Model has been fitted!")
print("Learned Coefficient (slope):", model.coef_)
print("Learned Intercept:", model.intercept_)
print("-" * 20)
# 4. Make Predictions
# Use the trained model to make predictions on the test data
y_pred = model.predict(X_test)
print("Predictions on test data:")
print("Actual Prices:", y_test)
print("Predicted Prices:", y_pred)
print("-" * 20)
# 5. Evaluate the Model
print("Model Evaluation:")
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_test, y_pred))
print('R-squared Score:', metrics.r2_score(y_test, y_pred))
print("-" * 20)
# 6. Visualize the Results (Optional but helpful)
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', linewidth=3, label='Regression Line')'House Price vs. Size')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price ($1000s)')
plt.legend()
plt.grid(True)
plt.show()

Key Parameters of fit()

While fit(X, y) is the most common call, some models have additional parameters:

  • sample_weight: Allows you to assign different importance to different samples. For example, if you have more confidence in some data points than others, you can give them a higher weight.
    # Assign higher weight to the first training sample
    sample_weights = [10, 1, 1, 1, 1, 1]
    model.fit(X_train, y_train, sample_weight=sample_weights)

Common Pitfalls & Best Practices

  1. Data Leakage: Never fit your model on your entire dataset and then evaluate it on the same data. This is like giving a student an exam with the exact same questions they studied. The train_test_split step is critical to avoid this.

  2. Data Shape: Ensure your X is a 2D array (a matrix of samples and features) and your y is a 1D array (a vector of targets). This is why we use [[1500], [1600], ...] instead of [1500, 1600, ...] for X.

  3. Preprocessing: For many models (like SVMs or Neural Networks), you need to scale your features first (e.g., using StandardScaler or MinMaxScaler). You should fit the scaler only on the training data and then use it to transform both the training and testing data.

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train) # Fit on training data ONLY
    X_test_scaled = scaler.transform(X_test)       # Transform test data with the SAME scaler
    model.fit(X_train_scaled, y_train)
分享:
扫描分享到社交APP
上一篇
下一篇