Saturday, December 2, 2023

[FIXED] Pipeline to first regress with one set of features, and then regress on the residuals with a second set of features

December 02, 2023 pipeline, scikit-learn No comments

Issue

I have the following problem: I have one target vector y, and two sets of features X1 and X2 (both contain about 10 features each). I want to write a sklearn pipeline that achieves the following: First, it trains a regressor (regressor1) to predict y given X1. Then, it takes the residuals from that regression and trains a second regressor (regressor2) that learns to predict the residuals from the first regression with features X2. For what it's worth: regressor 1 and regressor 2 are two different methods (e.g. a linear regression and a random forest).

Solution

Try the custom estimator below - SequentialResidualRegressor. It takes in a list of regressors, and in .fit() it fits them sequentially. The first regressor is fit using input X1 and the target y. For each subsequent step (i.e. for each subsequent regressor), the input is the data for that step (e.g. X2) and the target is the error (residual) of the previous step. This is repeated for however many regressors are fed into SequentalResidualRegressor. The concepts here are very similar to gradient boosting.

The complete code for this example is at the end.

Briefly, usage is like this:

#Create the base regressors, and feed them into the main estimator
regressors = [LinearRegression(), RandomForestRegressor()]
sequential_regressor = SequentialResidualRegressor(regressors)

#Put your X1 and X2 into a list
X = [X1, X2]

#Fit estimator
sequential_regressor.fit(X, y)

#Compute a prediction
y_hat = sequential_regressor.predict(X)

Input data, features, and the prediction:

You can access the data that was used to fit each regressor in the fit_data_ attribute, as below. The residual of one step should be the target of the next step. The right pane shows how the prediction improves at each step (i.e. with each subsequent regressor in the sequence).

The class:

from sklearn.base import clone, BaseEstimator, MetaEstimatorMixin
import numpy as np

class SequentialResidualRegressor(BaseEstimator, MetaEstimatorMixin):
    def __init__(self, estimators):
        self.estimators = estimators

    def fit(self, X, y=None):
        #Record the input feature names & how many in total
        if hasattr(X[0], 'columns'):
            self.feature_names_in_ = np.array([X_step.columns for X_step in X]).ravel()
        self.n_features_in_ = sum([X_step.shape[1] for X_step in X])

        self.estimators_ = [clone(estimator) for estimator in self.estimators]
        
        #
        #Fit estimators sequentially
        #
        #Keep a record of the fit data for each step, for user's reference
        fit_data = {
            'targets': [],
            'predictions': [],
            'residuals': []
        }
        
        #Fit first estimator on y, and subsequent estimators on residuals        
        target = y
        for estimator, X_step in zip(self.estimators_, X):
            print(f'[SequentialResidualRegressor] fitting: {estimator.__repr__()}')
            estimator.fit(X_step, target)
            
            #Get residuals in order to compose next target
            prediction = estimator.predict(X_step)
            residual = target - prediction
            
            #Record data used for fitting at this step, for user's reference
            for name, data in zip(fit_data.keys(), [target, prediction, residual]):
                fit_data[name].append(data)
            
            #The next target
            target = residual
        
        self.fit_data_ = fit_data
        return self
    
    def predict(self, X):
        preds = [estimator.predict(X_step).reshape(-1, 1)
                 for estimator, X_step
                 in zip(self.estimators_, X)]
        
        #Predicted y is the sum over each estimator's prediction
        y_hat = np.concatenate(preds, axis=1).sum(axis=1)
        return y_hat
    
    def get_feature_names_out(self, features_in=None):
        return ['predicted_value']

Usage example with synthetic data & plots:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

#
#Synthetic data
#
np.random.seed(0)
t = np.linspace(0, 2 * np.pi)
y = 2 * np.sin(t) ** 5 + t/2 + 0.05 #* np.random.randn(len(t))
X1 = pd.DataFrame({'feat0': y**0.5 + 0.25 * np.random.randn(len(y)),
                   'feat1': t/3 + 0.1 * np.random.randn(len(y))
                   })
X2 = np.log(2 + X1)

#View the target y (black line) and the features (coloured)
plt.plot(t, y, 'k', linewidth=5)
plt.plot(t, X1.feat0, t, X1.feat1, t, X2.feat0, t, X2.feat1)
plt.title('Target (black line) and features')

#
#Create the estimators
#
regressors = [LinearRegression(), RandomForestRegressor()]
sequential_regressor = SequentialResidualRegressor(regressors)

#Put X1 and X2 into a list
X = [X1, X2]

#Fit estimator and plot prediction
sequential_regressor.fit(X, y)
y_hat = sequential_regressor.predict(X)

#
#Plot the fit for each regressor
#

#Plot final prediction
plt.plot(t, y_hat, color='gold', linewidth=3, linestyle='--')
plt.title('Target (black line), prediction (dashed), and features')

#Plot fits for each regressor
fit_data = sequential_regressor.fit_data_

f, axs = plt.subplots(len(regressors), 2, figsize=(9, 3 * len(regressors)), sharex=True, layout='tight')
axs = axs.flatten()
axs_left = axs[::2] #left column of axes
axs_right = axs[1::2] #right column of axes

for step, (regressor, ax) in enumerate(zip(regressors, axs_left)):
    target = fit_data['targets'][step]
    prediction = fit_data['predictions'][step]
    residual = fit_data['residuals'][step]
    
    ax.plot(target, label='target')
    ax.plot(prediction, label='prediction')
    ax_right = ax.twinx()
    ax_right.plot(residual, 'k:', linewidth=1, label='residual')
    
    ax.set_ylabel('target, prediction')
    ax_right.set_ylabel('residual')
    
    ax.set_title(f'Fit data | step {step + 1}: {regressor.__repr__()}')
    if ax is axs[0]:
        f.legend(loc='lower left')
    
    #
    #Right col
    #
    axs_right[step].plot(y, label='y')
    axs_right[step].plot(np.concatenate(fit_data['predictions'][:step+1], axis=1).sum(axis=1),
                         label=r'$\hat{y}$' + f' (step {step + 1})')
    axs_right[step].set_ylabel('y and $\hat{y}$')
    axs_right[step].legend(loc='lower left')
    
axs_right[0].set_title('Improvement in $\hat{y}$ at each step')    
axs_left[-1].set_xlabel('sample index')

This code is untested apart from the quick sketch above, so I'd recommend checking the results carefully if you use it.

Here is a similar algorithm, but where the models are optimised over jointly (using PyTorch) rather than each model being fit individually: https://stats.stackexchange.com/a/624395/394904

Answered By - some3128

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 2, 2023

[FIXED] Pipeline to first regress with one set of features, and then regress on the residuals with a second set of features

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels