Issue
I am following the sklearn_pandas walk through found on the sklearn_pandas README on github and am trying to modify the DateEncoder() custom transformer example to do 2 additional things:
- Convert string type columns to datetime while taking the date format as a parameter
- Append the original column names when spitting out the new columns. E.g: if Input Column: Date1 then Outputs: Date1_year, Date1_month, Date_1 day.
Here is my attempt (with a rather rudimentary understanding of sklearn pipelines):
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn_pandas import DataFrameMapper
class DateEncoder(TransformerMixin):
'''
Specify date format using python strftime formats
'''
def __init__(self, date_format='%Y-%m-%d'):
self.date_format = date_format
def fit(self, X, y=None):
self.dt = pd.to_datetime(X, format=self.date_format)
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
data = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
DATE_COLS = ['dates1', 'dates2']
Mapper = DataFrameMapper([(i, DateEncoder(date_format='%Y-%m-%d')) for i in DATE_COLS], input_df=True, df_out=True)
test = Mapper.fit_transform(data)
But on runtime, I get the following error:
AttributeError: Can only use .dt accessor with datetimelike values
Why am I getting this error and how to fix it? Also any help with renaming the column names as mentioned above with the original columns (Date1_year, Date1_month, Date_1 day) would be greatly appreciated!
Solution
I was able to break the data format conversion and date splitter into two separate transformers and it worked.
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn_pandas import DataFrameMapper
data2 = pd.DataFrame({'dates1': ['2001-12-20','2002-10-21','2003-08-22','2004-08-23',
'2004-07-20','2007-12-21','2006-12-22','2003-04-23'],
'dates2' : ['2012-12-20','2009-10-21','2016-08-22','2017-08-23',
'2014-07-20','2011-12-21','2014-12-22','2015-04-23']})
class DateFormatter(TransformerMixin):
def fit(self, X, y=None):
# stateless transformer
return self
def transform(self, X):
# assumes X is a DataFrame
Xdate = X.apply(pd.to_datetime)
return Xdate
class DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)
DATE_COLS = ['dates1', 'dates2']
datemult = DataFrameMapper(
[ (i,[DateFormatter(),DateEncoder()]) for i in DATE_COLS ]
, input_df=True, df_out=True)
df = datemult.fit_transform(data2)
This code outputs:
Out[4]:
dates1_0 dates1_1 dates1_2 dates2_0 dates2_1 dates2_2
0 2001 12 20 2012 12 20
1 2002 10 21 2009 10 21
2 2003 8 22 2016 8 22
3 2004 8 23 2017 8 23
4 2004 7 20 2014 7 20
5 2007 12 21 2011 12 21
6 2006 12 22 2014 12 22
7 2003 4 23 2015 4 23
However I am still looking for a way to rename the new columns while applying the DateEncoder()
transformer. E.g: dates_1_0
--> dates_1_year
and dates_2_2
--> dates_2_month
. I'd be happy to select that as the solution.
Answered By - Jason
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.