Issue
Basic question here:
I'm trying to implement a simple classification model for credit card default where I just use model.fit
, model.predict
on my input data. However, that input data contains both categorical data (like demographic information like Age, Married or Not, Education level) and continuous data (like credit balances).
data.info()
<div class="output"><div class="output_area"><div class="run_this_cell"></div><div class="prompt"></div><div class="output_subarea output_text output_stream output_stdout"><pre><class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
LIMIT_BAL 30000 non-null float64
SEX 30000 non-null int64
EDUCATION 30000 non-null int64
MARRIAGE 30000 non-null int64
AGE 30000 non-null int64
PAY_1 30000 non-null int64
PAY_2 30000 non-null int64
PAY_3 30000 non-null int64
PAY_4 30000 non-null int64
PAY_5 30000 non-null int64
PAY_6 30000 non-null int64
BILL_AMT1 30000 non-null float64
BILL_AMT2 30000 non-null float64
BILL_AMT3 30000 non-null float64
BILL_AMT4 30000 non-null float64
BILL_AMT5 30000 non-null float64
BILL_AMT6 30000 non-null float64
PAY_AMT1 30000 non-null float64
PAY_AMT2 30000 non-null float64
PAY_AMT3 30000 non-null float64
PAY_AMT4 30000 non-null float64
PAY_AMT5 30000 non-null float64
PAY_AMT6 30000 non-null float64
default 30000 non-null int64
dtypes: float64(13), int64(11)
memory usage: 5.7 MB
</pre></div></div></div>
From my understanding, scikit-learn requires all data to be numerical and continuous or specifically coded as a categorical variable. The numerical part is not a problem since all of my data is coded numerically (like 0 for Married, 1 for not) but 3 of my variables (SEX, EDUCATION, and MARRIAGE) are nominal/ordinal and need to be encoded as categorical variables instead of int64 ones.
How do I use encode these 3 variables with scikit-learn's preprocessing module to properly feed these features into a model like Logistic Regression?
Thanks in advance, and please forgive the formatting (feel free to edit or recommend how I can properly include Jupyter Notebook output into a Stack Overflow post).
Solution
Categorical features need more attention in feature engineering, because features like Age, date etc are difficult to encode. There are many ways to encode these features, by analyzing, domain-knowledge and many more.
There is a library category_encoders, which have many functionality to encode such features, by the use of statistics. More you can find here.
Here, is another good resource, that will shows you the use of encoding method by an example.
Answered By - Ankish Bansal
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.