Issue
By generic; I mean to say that I do not know the name of a column that needs to be dropped ahead of pulling in the file. Examples I have found; assume that you know the name of a column you wish to drop. Those familiar with the PlayTennis data set are probably used to seeing:
my_df = pd.DataFrame({"Outlook": [Sunny,Cloudy,Rainy], "Temp":[Hot,Cold],
"Humidity":[high,low]...})
However in my class we get a first column 'Days' so something like:
my_df = pd.DataFrame({"Days":[D1,D2,...,D14],"Outlook": [Sunny,Cloudy,Rainy], "Temp":[Hot,Cold],"Humidity":[high,low]...})
Obviously, looking at this I would want to drop the 'Days' column:
df.drop(columns=['Days'], inplace=True)
The problem is that playtennis is just a sample dataset and in the actual dataset the column I may need to drop for the same reason as 'Days' will not be called Days. I need a way to drop the useless column by some method that can see that the number of unique values in a column and understands its too many to be useful (Edit: Meaning it overfits, if I have 30 instances and 30 days the model will try to predict a result based on what day it is and therefore, useless for predictability); Before I read it into my machine learning algorithm.
import pandas as pd
import numpy as np
df_train = pd.read_csv("assets\playtennis.csv") # read in data
df_train.head() # see first 5
# get a list of attribute excluding the class label (e.g.,PlayTennis)
def attributes (df,label):
return df.columns.drop(label).values.tolist()
def trash(df,attr,label):
# Do something to trash useless columns
df.drop(columns=[x],inplace=True)
class_label = df_train.columns[-1] # class label in the last column
attr = attributes(df_train,class_label)
trash(df_train,attr,class_label)
I only have about 6 weeks working with python so please forgive(and point out) syntax errors.
Solution
First thing, it was not quite obvious why you want to drop Days column in your dataset.
I assume that you want to drop a feature with distinct values on each row or too many unique entries such that the feature has no predictability to your testing label.
You can get the unique values of a column (eg. 'name') by calling df['name'].unique()
, and call len()
on top of that to get the number of unique values.
I would suggest you have a threshold for highest the proportion of unique values before you drop that column.
def trash(df, attr, label, threshold=0.8):
for col in att:
proportion = len(df.col.unique())/len(df)
if proportion >= threshold:
df.drop([col], inplace=True)
Answered By - Mengxiao Li
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.