Issue
I have a set of features that I would like to model, one of which is actually a histogram sampled at 100 different points. Thus this histogram feature is actually 100 different features. I would like to reduce the dimensionality of my modeling problem by performing PCA on the histogram features, however I do not want to include the other features in the PCA in order to maintain interpretability of my model.
Ideally I would like to form a pipeline with the PCA to transform the histogram features and SVC to perform the fitting, which I would the feed to GridSearchCV to determine the SVC hyperparameters. Is it somehow possible in this setup to have PCA transform only a subset of my features (the histogram bins)? The easiest way would be to edit the PCA object to accept a feature mask, but I would certainly prefer to use existing functionality.
EDIT
After implementing @eickenberg's answer I realized that I also wanted an inverse_transform method for the new PCA class. This method recreates the initial feature set with columns in their original order. It is provided below for anyone else who is interested:
def inverse_transform(self, X):
if self.mask is not None:
# Inverse transform appropriate data
inv_mask = np.arange(len(X[0])) >= sum(~self.mask)
inv_transformed = self.pca.inverse_transform(X[:, inv_mask])
# Place inverse transformed columns back in their original order
inv_transformed_reorder = np.zeros([len(X), len(self.mask)])
inv_transformed_reorder[:, self.mask] = inv_transformed
inv_transformed_reorder[:, ~self.mask] = X[:, ~inv_mask]
return inv_transformed_reorder
else:
return self.pca.inverse_transform(X)
Solution
This is not possible straight out of the box with scikit learn. In order to be able to exploit full functionality of Pipeline
and GridSearchCV
, consider creating an object MaskedPCA
, inheriting from sklearn.base.BaseEstimator
and exposing the methods fit
and transform
. In it you should use a PCA
object on your masked features. The mask should be passed to the constructor.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.decomposition import PCA
class MaskedPCA(BaseEstimator, TransformerMixin):
def __init__(self, n_components=2, mask=None):
# mask should contain selected cols. Suppose it is boolean to avoid code overhead
self.n_components = n_components
self.mask = mask
def fit(self, X):
self.pca = PCA(n_components=self.n_components)
mask = self.mask
mask = self.mask if self.mask is not None else slice(None)
self.pca.fit(X[:, mask])
return self
def transform(self, X):
mask = self.mask if self.mask is not None else slice(None)
pca_transformed = self.pca.transform(X[:, mask])
if self.mask is not None:
remaining_cols = X[:, ~mask]
return np.hstack([remaining_cols, pca_transformed])
else:
return pca_transformed
You can test it on some generated data
import numpy as np
X = np.random.randn(100, 20)
mask = np.arange(20) > 4
mpca = MaskedPCA(n_components=2, mask=mask)
transformed = mpca.fit(X).transform(X)
# check whether first five columns are equal
from numpy.testing import assert_array_equal
assert_array_equal(X[:, :5], transformed[:, :5])
Observe that transformed
now has (~mask).sum + mpca.n_components == 7
columns
Answered By - eickenberg
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.