Issue
I'm on the stage of cleaning categorical variables from my data. More specifically, I'm now removing quasi-constant categorical variables.
I've searched and found that VarianceThreshold()
from sklearn.feature_selection
can do the job. However, I've got unexpected results. My piece of code:
# Create a temporal dataframe that fill null values with string "null_val", as OrdinalEncoder() doesn't work with null values
temp_df = train_df_cat.fillna("null_val")
# Initiate ordinal encoder and encode the different labels with numbers, then convert result to dataframe
ord_enc = OrdinalEncoder()
temp_df = ord_enc.fit_transform(temp_df)
temp_df = pd.DataFrame(temp_df, columns=train_df_cat.columns)
# Get the columns with 90% or more values being constant
var_thr = VarianceThreshold(threshold = 0.1)
var_thr.fit(temp_df)
quasi_constant_cat = [column for column in temp_df.columns
if column not in temp_df.columns[var_thr.get_support()]]
# Display results
display(quasi_constant_cat)
Returns this:
['Street',
'Utilities',
'LandSlope',
'Condition2',
'Heating',
'CentralAir',
'PoolQC']
Supposedly, those are the features that have any value present 90% or more of the time. However:
display(temp_df["Alley"].value_counts(normalize=True))
Returns, as I had seen on a plot above:
2.00 0.94
0.00 0.03
1.00 0.03
Name: Alley, dtype: float64
Therefore, Alley
feature (and maybe others) has a 94% of the same value 2.00
(which actually is the number imputed for null values in this temp_df), but is not included in the output of the VarianceThreshold()
function.
What should I change in my code to make this function work properly?
Solution
The variance in this particular case would be (1.91^2 * 0.94 + 1^2 * 0.03) - (1.91 * 0.94 + 1 * 0.03)^2 = 0.1419 > 0.1
Looks like you'll need a bit higher threshold.
Answered By - dx2-66
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.