Issue
I am getting an error on the inverse_transform after fit_transform. I am trying to inverse_transform float64 back to its original datatype which is string.
getting the data:
df = pd.read_csv("pris.csv", usecols=['judge', 'plea_orcs', 'prior_cases', 'race', 'pris_yrs'])
transforming string columns in csv:
oe = OrdinalEncoder()
df[['plea_orcs']] = oe.fit_transform(df[['plea_orcs']])
df[['judge']] = oe.fit_transform(df[['judge']])
df[['race']] = oe.fit_transform(df[['race']])
X and y for sklearn:
X = df[['plea_orcs', 'judge', 'race', 'prior_cases', 'pris_yrs']]
y = df[['to_prison']]
this is raising the error:
print(oe.inverse_transform(X.plea_orcs[0].reshape(-1,1)))
error:
IndexError Traceback (most recent call last)
<ipython-input-291-11e4763a5a03> in <module>
----> 1 print(oe.inverse_transform(X.plea_orcs[0].reshape(-1,1)))
~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\preprocessing\_encoders.py in inverse_transform(self, X)
733 for i in range(n_features):
734 labels = X[:, i].astype('int64', copy=False)
--> 735 X_tr[:, i] = self.categories_[i][labels]
736
737 return X_tr
IndexError: index 68 is out of bounds for axis 0 with size 5
Should I not be using OrdinalEncoding? I have several different ways but this one seems to be an error in the right direction.
Solution
The problem
oe = OrdinalEncoder()
df[['plea_orcs']] = oe.fit_transform(df[['plea_orcs']])
df[['judge']] = oe.fit_transform(df[['judge']])
df[['race']] = oe.fit_transform(df[['race']])
In the second line, you fit
your ordinal encoder on the column 'plea_orcs'
. You can then transform
that data (as you do with the convenience fit_transform
and inverse_transform
the result.
But then in the third line, you refit
the ordinal encoder on the column 'judge'
. This loses all information about plea_orcs
, and you will no longer be able to transform test data, or inverse-transform.
Some solutions
In increasing order of (IMO) elegance:
- Instantiate separate ordinal encoders for each feature.
- Use just one ordinal encoder, and fit and transform all three columns at once.
- Use just one ordinal encoder together with a
ColumnTransformer
for selecting the appropriate columns. Usepassthrough
for other columns, if you don't need to do any preprocessing to them.
Off-topic...
...but consider whether ordinal encoding is appropriate: if your data isn't naturally ordered, then you're adding false relationships to your data. See e.g. this DS.SE post.
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.