Saturday, December 3, 2022

[FIXED] Generic way to drop columns that are not needed for learning (in python using pandas df)

December 03, 2022 numpy, pandas, python No comments

Issue

By generic; I mean to say that I do not know the name of a column that needs to be dropped ahead of pulling in the file. Examples I have found; assume that you know the name of a column you wish to drop. Those familiar with the PlayTennis data set are probably used to seeing:

my_df = pd.DataFrame({"Outlook": [Sunny,Cloudy,Rainy], "Temp":[Hot,Cold],
"Humidity":[high,low]...})

However in my class we get a first column 'Days' so something like:

my_df = pd.DataFrame({"Days":[D1,D2,...,D14],"Outlook": [Sunny,Cloudy,Rainy], "Temp":[Hot,Cold],"Humidity":[high,low]...})

Obviously, looking at this I would want to drop the 'Days' column:

df.drop(columns=['Days'], inplace=True)

The problem is that playtennis is just a sample dataset and in the actual dataset the column I may need to drop for the same reason as 'Days' will not be called Days. I need a way to drop the useless column by some method that can see that the number of unique values in a column and understands its too many to be useful (Edit: Meaning it overfits, if I have 30 instances and 30 days the model will try to predict a result based on what day it is and therefore, useless for predictability); Before I read it into my machine learning algorithm.

import pandas as pd
import numpy as np

df_train = pd.read_csv("assets\playtennis.csv") # read in data
df_train.head() # see first 5

# get a list of attribute excluding the class label (e.g.,PlayTennis)
def attributes (df,label):
    return df.columns.drop(label).values.tolist()
    
    
def trash(df,attr,label):
    # Do something to trash useless columns
    df.drop(columns=[x],inplace=True)
    
class_label = df_train.columns[-1] # class label in the last column
attr = attributes(df_train,class_label)
trash(df_train,attr,class_label)

I only have about 6 weeks working with python so please forgive(and point out) syntax errors.

Solution

First thing, it was not quite obvious why you want to drop Days column in your dataset. I assume that you want to drop a feature with distinct values on each row or too many unique entries such that the feature has no predictability to your testing label. You can get the unique values of a column (eg. 'name') by calling df['name'].unique(), and call len() on top of that to get the number of unique values.

I would suggest you have a threshold for highest the proportion of unique values before you drop that column.

def trash(df, attr, label, threshold=0.8):
    for col in att:
        proportion = len(df.col.unique())/len(df)
        if proportion >= threshold:
            df.drop([col], inplace=True)

Answered By - Mengxiao Li

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 3, 2022

[FIXED] Generic way to drop columns that are not needed for learning (in python using pandas df)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels