Wednesday, January 31, 2024

[FIXED] How to find average of object data type in pandas python?

January 31, 2024 dataset, pandas, python No comments

Issue

Memory Speed	Device Weight	Screen Size	GPU Memory Type	GPU Memory Size	GPU Type	Panel Type	Processor Generation	Processor	Operating System	Card Reader	Backlit Keyboard	Max Processor Speed	Max Screen Resolution	Fingerprint Reader	RAM (System Memory)	SSD Capacity	Product Model	Price
2666 MHz	2 - 4 kg	15.6 inches	Onboard	Shared	Integrated Graphics	LED	10th Generation	1005G1	Windows 11 Home	None	0	3.4 GHz	1920 x 1080	None	4 GB	256 GB	Notebook	Very Low

I have a dataset like upside. dataset formatted as xlsx file. I'm doing data preprocessing now. mean, median, mod i have to find. "Conducting a descriptive data analysis of the dataset involves examining record counts, attribute numbers, attribute types, measures of central tendency, measures of dispersion from the center, and generating five-number summaries" the certain explanation :)

Now i want to ask how to find mean for object data type according to my dataset? Backlit Keyboard column's average only i can find now.

import pandas as panda

 dataset = panda.read_excel('data.xlsx')
 print(dataset.info())

RangeIndex: 994 entries, 0 to 993 Data columns (total 19 columns):

Column Name	Non-Null Count	Dtype
Memory Speed	888	object
Device Weight	985	object
Screen Size	994	object
GPU Memory Type	884	object
GPU Memory Size	946	object
GPU Type	955	object
Panel Type	994	object
Processor Generation	946	object
Processor	979	object
Operating System	994	object
Card Reader	864	object
Backlight	994	int64
Max Processor Speed	950	object
Max Screen Resolution	988	object
Fingerprint Reader	886	object
RAM (System Memory)	987	object
SSD Capacity	991	object
Product Model	994	object
Price	994	object

the result of dataset.head():

Memory Speed	Device Weight	Screen Size	Graphics Card Memory Type	Graphics Card Memory	Graphics Card Type	Screen Panel Type	Processor Generation	Processor	Operating System	Card Reader	Max Processor Speed	Max Screen Resolution	Fingerprint Reader	RAM (System Memory)	SSD Capacity	Product Model	Price
1066 MHz	NaN	10 inches	NaN	1 GB	NaN	IPS	1st Generation	1000M	Android	NaN	1.05 GHz	NaN	NaN	NaN	1 TB	Notebook	High
1066 MHz	NaN	10 inches	NaN	1 GB	NaN	IPS	1st Generation	1000M	Android	NaN	1.05 GHz	NaN	NaN	NaN	1 TB	Notebook	High
1066 MHz	NaN	10 inches	NaN	1 GB	NaN	IPS	1st Generation	1000M	Android	NaN	1.05 GHz	NaN	NaN	NaN	1 TB	Notebook	Medium
3200 MHz	1 - 2 kg	15.6 inches	GDDR4	2 GB	External GPU	LED	10th Generation	1035G1	Windows 10 Home	Yes	3.6 GHz	1920 x 1080	No	8 GB	512 GB	Notebook	Low
3200 MHz	1 - 2 kg	15.6 inches	GDDR5	2 GB	External GPU	LED	10th Generation	1035G1	Windows 10 Home	No	3.6 GHz	1920 x 1080	No	12 GB	1 TB	Notebook	Low

Solution

There is a pandas describe function:

dataset.describe(include = "all")

In terms of a better answer than describe, there is the issue with converting (when possible) the values to numerical types. So I had to go through tedious checking:

import pandas as pd
import numpy as np

df = pd.read_excel("Dataset.xlsx") # import data from op's file
df_new = pd.DataFrame(columns=df.columns) # generate an empty copy with headers

for col in df.columns: # go through columns
    if df[col].dtypes == "object": # check if object, if yes go in, if no assign to df_new
        values = list() # init empty list
        for val in df[col].values: # for every value in the column
            if pd.isna(val): # if it is nan, append it to values
                # print("Nan detected in "+str(val))
                values.append(np.nan) # append
            elif " " in val: # if there are spaces, this hints to the existence of number + unit
                # print("space detected in "+val)
                val_splitted = val.split(" ") # split with space
                if "," in val_splitted[0]: # if there is a comma, since the dec points are sometimes with , and sometimes with .
                    val_splitted[0] = val_splitted[0].replace(",", ".") # replace , with .
                if len(val_splitted) == 2: # if there are only two values in the splitted form
                    # print("one space detected in "+val)
                    try: # try to convert to float
                        if col == "SSD Capacity": # only here, since the unit switches between GB and TB
                            if val_splitted[1] == "GB": # 
                                values.append(float(val_splitted[0])/1000) # append divided by 1000
                            else: # else
                                values.append(float(val_splitted[0])) # assign as is
                        else: # if it is not the column SSD Capacity
                            values.append(float(val_splitted[0])) # append the val as is
                    except ValueError: # oops! value cannot be converted, then it is not numeric
                        values.append(val) # assign as is
                else:
                    #print("too many spaces inm keeping "+val+" as it is") 
                    values.append(val) # if too many spaces, such as "1920 x 1080" append as is
            else:
                # print("keeping "+val+" as it is")
                values.append(val) # if no spaces and not nan, assign as is
        df_new[col] = values # assign the values to the column in df_new
    else:
        df_new[col] = df[col] # if the value is not an object type, assign the whole column to df_new
        
print(df_new.describe(include = "all")) # print description

The results are like this:

	Memory Speed	Screen Size	Backlit	Max Processor Speed	RAM (System Memory)	SSD Capacity
count	888.000	994.000	994.000	950.000	987.000	991.000
mean	3339.874	15.336	0.243	4.293	17.431	0.638
std	626.284	0.923	0.429	0.616	12.235	0.423
min	1066.000	10.000	0.000	1.050	4.000	0.000
25%	3200.000	15.600	0.000	4.200	8.000	0.500
50%	3200.000	15.600	0.000	4.400	16.000	0.512
75%	3200.000	15.600	0.000	4.700	16.000	1.000
max	6400.000	18.400	1.000	5.600	128.000	4.000

Answered By - Tino D

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 31, 2024

[FIXED] How to find average of object data type in pandas python?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels