Issue
I have the following problem: I have one target vector y, and two sets of features X1 and X2 (both contain about 10 features each). I want to write a sklearn pipeline that achieves the following: First, it trains a regressor (regressor1) to predict y given X1. Then, it takes the residuals from that regression and trains a second regressor (regressor2) that learns to predict the residuals from the first regression with features X2. For what it's worth: regressor 1 and regressor 2 are two different methods (e.g. a linear regression and a random forest).
Solution
Try the custom estimator below - SequentialResidualRegressor
. It takes in a list of regressors, and in .fit()
it fits them sequentially. The first regressor is fit using input X1
and the target y
. For each subsequent step (i.e. for each subsequent regressor), the input is the data for that step (e.g. X2
) and the target is the error (residual) of the previous step. This is repeated for however many regressors are fed into SequentalResidualRegressor
. The concepts here are very similar to gradient boosting.
The complete code for this example is at the end.
Briefly, usage is like this:
#Create the base regressors, and feed them into the main estimator
regressors = [LinearRegression(), RandomForestRegressor()]
sequential_regressor = SequentialResidualRegressor(regressors)
#Put your X1 and X2 into a list
X = [X1, X2]
#Fit estimator
sequential_regressor.fit(X, y)
#Compute a prediction
y_hat = sequential_regressor.predict(X)
Input data, features, and the prediction:
You can access the data that was used to fit each regressor in the fit_data_
attribute, as below. The residual of one step should be the target of the next step. The right pane shows how the prediction improves at each step (i.e. with each subsequent regressor in the sequence).
The class:
from sklearn.base import clone, BaseEstimator, MetaEstimatorMixin
import numpy as np
class SequentialResidualRegressor(BaseEstimator, MetaEstimatorMixin):
def __init__(self, estimators):
self.estimators = estimators
def fit(self, X, y=None):
#Record the input feature names & how many in total
if hasattr(X[0], 'columns'):
self.feature_names_in_ = np.array([X_step.columns for X_step in X]).ravel()
self.n_features_in_ = sum([X_step.shape[1] for X_step in X])
self.estimators_ = [clone(estimator) for estimator in self.estimators]
#
#Fit estimators sequentially
#
#Keep a record of the fit data for each step, for user's reference
fit_data = {
'targets': [],
'predictions': [],
'residuals': []
}
#Fit first estimator on y, and subsequent estimators on residuals
target = y
for estimator, X_step in zip(self.estimators_, X):
print(f'[SequentialResidualRegressor] fitting: {estimator.__repr__()}')
estimator.fit(X_step, target)
#Get residuals in order to compose next target
prediction = estimator.predict(X_step)
residual = target - prediction
#Record data used for fitting at this step, for user's reference
for name, data in zip(fit_data.keys(), [target, prediction, residual]):
fit_data[name].append(data)
#The next target
target = residual
self.fit_data_ = fit_data
return self
def predict(self, X):
preds = [estimator.predict(X_step).reshape(-1, 1)
for estimator, X_step
in zip(self.estimators_, X)]
#Predicted y is the sum over each estimator's prediction
y_hat = np.concatenate(preds, axis=1).sum(axis=1)
return y_hat
def get_feature_names_out(self, features_in=None):
return ['predicted_value']
Usage example with synthetic data & plots:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
#
#Synthetic data
#
np.random.seed(0)
t = np.linspace(0, 2 * np.pi)
y = 2 * np.sin(t) ** 5 + t/2 + 0.05 #* np.random.randn(len(t))
X1 = pd.DataFrame({'feat0': y**0.5 + 0.25 * np.random.randn(len(y)),
'feat1': t/3 + 0.1 * np.random.randn(len(y))
})
X2 = np.log(2 + X1)
#View the target y (black line) and the features (coloured)
plt.plot(t, y, 'k', linewidth=5)
plt.plot(t, X1.feat0, t, X1.feat1, t, X2.feat0, t, X2.feat1)
plt.title('Target (black line) and features')
#
#Create the estimators
#
regressors = [LinearRegression(), RandomForestRegressor()]
sequential_regressor = SequentialResidualRegressor(regressors)
#Put X1 and X2 into a list
X = [X1, X2]
#Fit estimator and plot prediction
sequential_regressor.fit(X, y)
y_hat = sequential_regressor.predict(X)
#
#Plot the fit for each regressor
#
#Plot final prediction
plt.plot(t, y_hat, color='gold', linewidth=3, linestyle='--')
plt.title('Target (black line), prediction (dashed), and features')
#Plot fits for each regressor
fit_data = sequential_regressor.fit_data_
f, axs = plt.subplots(len(regressors), 2, figsize=(9, 3 * len(regressors)), sharex=True, layout='tight')
axs = axs.flatten()
axs_left = axs[::2] #left column of axes
axs_right = axs[1::2] #right column of axes
for step, (regressor, ax) in enumerate(zip(regressors, axs_left)):
target = fit_data['targets'][step]
prediction = fit_data['predictions'][step]
residual = fit_data['residuals'][step]
ax.plot(target, label='target')
ax.plot(prediction, label='prediction')
ax_right = ax.twinx()
ax_right.plot(residual, 'k:', linewidth=1, label='residual')
ax.set_ylabel('target, prediction')
ax_right.set_ylabel('residual')
ax.set_title(f'Fit data | step {step + 1}: {regressor.__repr__()}')
if ax is axs[0]:
f.legend(loc='lower left')
#
#Right col
#
axs_right[step].plot(y, label='y')
axs_right[step].plot(np.concatenate(fit_data['predictions'][:step+1], axis=1).sum(axis=1),
label=r'$\hat{y}$' + f' (step {step + 1})')
axs_right[step].set_ylabel('y and $\hat{y}$')
axs_right[step].legend(loc='lower left')
axs_right[0].set_title('Improvement in $\hat{y}$ at each step')
axs_left[-1].set_xlabel('sample index')
This code is untested apart from the quick sketch above, so I'd recommend checking the results carefully if you use it.
Here is a similar algorithm, but where the models are optimised over jointly (using PyTorch) rather than each model being fit individually: https://stats.stackexchange.com/a/624395/394904
Answered By - some3128
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.