Issue
I want to use Pipeline
and ColumnTransformer
modules from sklearn library to apply scaling on numpy array. Scaler is applied on some of the columns. And, I want to have the output with same column order of input.
Example:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
X = np.array ( [(25, 1, 2, 0),
(30, 1, 5, 0),
(25, 10, 2, 1),
(25, 1, 2, 0),
(np.nan, 10, 4, 1),
(40, 1, 2, 1) ] )
column_trans = ColumnTransformer(
[ ('scaler', MinMaxScaler(), [0,2]) ],
remainder='passthrough')
X_scaled = column_trans.fit_transform(X)
The problem is that ColumnTransformer
changes the order of columns. How can I preserve the original order of columns?
I am aware of this post. But, it is for pandas DataFrame. For some reasons, I cannot use DataFrame and I have to use numpy array in my code.
Thanks.
Solution
Here is a solution by adding a transformer which will apply the inverse column permutation after the column transform:
from sklearn.base import BaseEstimator, TransformerMixin
import re
class ReorderColumnTransformer(BaseEstimator, TransformerMixin):
index_pattern = re.compile(r'\d+$')
def __init__(self, column_transformer):
self.column_transformer = column_transformer
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
order_after_column_transform = [int( self.index_pattern.search(col).group()) for col in self.column_transformer.get_feature_names_out()]
order_inverse = np.zeros(len(order_after_column_transform), dtype=int)
order_inverse[order_after_column_transform] = np.arange(len(order_after_column_transform))
return X[:, order_inverse]
It relies on parsing
column_trans.get_feature_names_out()
# = array(['scaler__x1', 'scaler__x3', 'remainder__x0', 'remainder__x2'],
# dtype=object)
to read the initial column order from the suffix number. Then computing and applying the inverse permutation.
To be used as:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
X = np.array ( [(25, 1, 2, 0),
(30, 1, 5, 0),
(25, 10, 2, 1),
(25, 1, 2, 0),
(np.nan, 10, 4, 1),
(40, 1, 2, 1) ] )
column_trans = ColumnTransformer(
[ ('scaler', MinMaxScaler(), [0,2]) ],
remainder='passthrough')
pipeline = make_pipeline( column_trans, ReorderColumnTransformer(column_transformer=column_trans))
X_scaled = pipeline.fit_transform(X)
#X_scaled has same column order as X
Alternative solution not relying on string parsing but reading the column slices of the column transformer:
from sklearn.base import BaseEstimator, TransformerMixin
class ReorderColumnTransformer(BaseEstimator, TransformerMixin):
def __init__(self, column_transformer):
self.column_transformer = column_transformer
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
slices = self.column_transformer.output_indices_.values()
n_cols = self.column_transformer.n_features_in_
order_after_column_transform = [value for slice_ in slices for value in range(n_cols)[slice_]]
order_inverse = np.zeros(n_cols, dtype=int)
order_inverse[order_after_column_transform] = np.arange(n_cols)
return X[:, order_inverse]
Answered By - Learning is a mess
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.