Of course! Performing Ordinary Least Squares (OLS) regression with robust standard errors is a common and important task in statistics and econometrics. It allows you to get valid statistical inferences (p-values, confidence intervals) even when your model's errors are heteroskedastic (i.e., the variance of the errors is not constant).
Here’s a comprehensive guide covering the "why," "how," and practical examples in Python using the most popular libraries.
Why Use Robust Standard Errors?
The classical OLS model assumes:
- Linearity: The relationship is linear.
- Independence: Errors are independent.
- Homoskedasticity: The variance of the errors is constant across all levels of the independent variables.
- No Autocorrelation: Errors are not correlated with each other (especially important in time series).
- Exogeneity: Independent variables are uncorrelated with the error term.
When homoskedasticity (assumption #3) is violated (a common problem called heteroskedasticity), the OLS coefficient estimates themselves are still unbiased and consistent. However, the standard formulas for calculating their standard errors become biased.
This leads to two major problems:
- Inefficient Estimates: OLS is no longer the "Best" Linear Unbiased Estimator (BLUE).
- Invalid Inference: Your t-statistics, F-statistics, p-values, and confidence intervals are unreliable. You might conclude a variable is significant when it's not, or vice-versa.
Robust standard errors (also called Huber-White standard errors or "sandwich" estimators) solve the inference problem. They provide a consistent estimate of the standard errors even in the presence of heteroskedasticity.
How to Do It in Python
We'll look at three primary methods:
statsmodels(Recommended): The most straightforward and statistically complete method.linearmodels: Excellent, especially for panel data, and also very easy for cross-sectional data.- Manual Calculation: For understanding the underlying mechanics.
Setup and Data
First, let's set up our environment and create some sample data that intentionally has heteroskedasticity. This way, we can see the difference between standard and robust errors.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from linearmodels import OLS as linearmodels_OLS
import matplotlib.pyplot as plt
import seaborn as sns
# Set a seed for reproducibility
np.random.seed(42)
# Create a synthetic dataset with heteroskedasticity
n = 500
X = np.random.normal(0, 1, n)
# The error term's variance is related to X. This creates heteroskedasticity.
# As X increases, the variance of the error increases.
errors = np.random.normal(0, np.exp(X/2), n)
# True relationship: y = 2.5 + 1.5*X + error
y = 2.5 + 1.5 * X + errors
# Create a pandas DataFrame
df = pd.DataFrame({'y': y, 'X': X})
df = sm.add_constant(df) # Add a constant (intercept) term
Let's visualize the heteroskedasticity to confirm it's there.
sns.scatterplot(x='X', y='y', data=df, alpha=0.6)'Scatter Plot showing Heteroskedasticity')
plt.xlabel('X')
plt.ylabel('y')
plt.show()
You'll see a pattern where the spread of the data points increases as X gets larger.
Method 1: statsmodels (The Standard Approach)
statsmodels makes this incredibly easy. You just fit a standard OLS model and then use the .get_robustcov_results() method.
A. Standard OLS (for comparison)
First, let's run the standard OLS model and see its results.
# Standard OLS Model
model_ols = sm.OLS(df['y'], df[['const', 'X']])
results_ols = model_ols.fit()
print("--- Standard OLS Results ---")
print(results_ols.summary())
Key things to note in the output:
- The coefficient for
Xis close to the true value of 1.5. - Look at the
std errforX. Let's say it's089. - The
OmnibusandJarque-Beratests, as well as theDurbin-Watsonstatistic, might hint at problems, but the main issue is hidden.
B. OLS with Robust Standard Errors
Now, let's get the robust version of the exact same model.
# Get robust standard errors (default is HC3)
# HC3 is generally recommended for cross-sectional data.
results_robust = results_ols.get_robustcov_results(cov_type='HC3')
print("\n--- OLS with HC3 Robust Standard Errors ---")
print(results_robust.summary())
Key differences in the output:
- The coefficients (
const,X) are exactly the same. This is expected; robust errors don't change the point estimates. - The standard errors (
std err) have changed. Let's say it's now112. - The t-values, p-values (
P>|t|), and confidence intervals have all been updated based on the new standard errors. - The model's R-squared and F-statistic remain the same, as they are not affected by the error structure.
Choosing the cov_type:
'HC0': The original Huber (1967) estimator.'HC1': A small-sample correction of HC0 (default in some older software).'HC2': A more refined correction that is often better than HC1.'HC3': (Default instatsmodelsfor this method) A "jackknife" estimator that performs very well in many situations, especially with influential observations. For most cases, HC3 is a great choice.
Method 2: linearmodels (A Powerful Alternative)
The linearmodels library is fantastic, particularly for panel data, but its syntax for simple OLS is very clean.
# linearmodels requires a dependent variable and a DataFrame of exogenous variables
# The constant is automatically added.
mod = linearmodels_OLS(df['y'], df[['X']])
results_linearmodels = mod.fit(cov_type='robust')
print("--- linearmodels OLS with Robust Standard Errors ---")
print(results_linearmodels)
This is even more direct. You specify the cov_type directly in the .fit() method. It defaults to a robust estimator similar to HC1.
Method 3: Manual Calculation (For Understanding)
The robust variance estimator is often called the "sandwich" estimator because its formula looks like a sandwich: (B⁻¹) * M * (B⁻¹), where B is the "bread" and M is the "meat".
- Bread (
B): The outer product of the gradient,X'X. - Meat (
M): A weighted sum of the outer products of the residuals,Σ(eᵢ² * xᵢxᵢ').
Let's implement this manually.
# Manual calculation of robust standard errors
# X is our design matrix (including constant)
X = df[['const', 'X']].values
# y is our dependent variable
y = df['y'].values
# 1. Get OLS coefficients
beta_manual = np.linalg.inv(X.T @ X) @ X.T @ y
# 2. Calculate residuals
residuals = y - X @ beta_manual
# 3. Calculate the "meat" of the sandwich (M)
# Create a diagonal matrix of squared residuals
n = len(y)
meat = np.zeros((X.shape[1], X.shape[1]))
for i in range(n):
x_i = X[i, :] # ith row of X
meat += residuals[i]**2 * np.outer(x_i, x_i)
# 4. Calculate the "bread" of the sandwich (B)
bread = X.T @ X
# 5. Calculate the robust covariance matrix
cov_robust_manual = np.linalg.inv(bread) @ meat @ np.linalg.inv(bread)
# 6. Calculate robust standard errors (square root of the diagonal)
robust_se_manual = np.sqrt(np.diag(cov_robust_manual))
print("\n--- Manual Robust Standard Errors ---")
print(f"Robust SE for const: {robust_se_manual[0]:.4f}")
print(f"Robust SE for X: {robust_se_manual[1]:.4f}")
# Compare with statsmodels
print("\n--- Comparison with statsmodels (HC0) ---")
print(f"statsmodels HC0 SE for const: {results_ols.HC0_se[0]:.4f}")
print(f"statsmodels HC0 SE for X: {results_ols.HC0_se[1]:.4f}")
This manual calculation should give you results very close to statsmodels's HC0 estimator. The small differences are due to floating-point arithmetic. This exercise demonstrates that robust standard errors are just a different way of calculating the variance of your coefficient estimates.
Summary and Recommendation
| Method | Pros | Cons | Best For |
|---|---|---|---|
statsmodels |
- Easy to use on an existing model. - Well-documented. - Offers multiple types (HC0-HC3). |
- Two-step process (fit, then get robust). | Most users. The standard, reliable choice for cross-sectional data. |
linearmodels |
- Clean, one-step syntax. - Excellent for panel data (Fixed/Random Effects). - Fast and modern. |
- Less known for simple OLS compared to statsmodels. |
Users working with panel data or those who prefer a more direct API. |
| Manual | - Best for understanding the underlying math. | - Prone to error. - Not efficient. - Reinvents the wheel. |
Learning and educational purposes. |
Final Recommendation:
For almost all practical purposes in Python, use statsmodels.
- Fit your model:
model = sm.OLS(y, X).fit() - Get robust results:
results_robust = model.get_robustcov_results(cov_type='HC3') - Print the summary:
print(results_robust.summary())
This gives you the best of both worlds: the convenience of standard OLS and the statistical validity of robust inference when you suspect heteroskedasticity.
