Issue
Memory Speed | Device Weight | Screen Size | GPU Memory Type | GPU Memory Size | GPU Type | Panel Type | Processor Generation | Processor | Operating System | Card Reader | Backlit Keyboard | Max Processor Speed | Max Screen Resolution | Fingerprint Reader | RAM (System Memory) | SSD Capacity | Product Model | Price |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2666 MHz | 2 - 4 kg | 15.6 inches | Onboard | Shared | Integrated Graphics | LED | 10th Generation | 1005G1 | Windows 11 Home | None | 0 | 3.4 GHz | 1920 x 1080 | None | 4 GB | 256 GB | Notebook | Very Low |
I have a dataset like upside. dataset formatted as xlsx file. I'm doing data preprocessing now. mean, median, mod i have to find. "Conducting a descriptive data analysis of the dataset involves examining record counts, attribute numbers, attribute types, measures of central tendency, measures of dispersion from the center, and generating five-number summaries" the certain explanation :)
Now i want to ask how to find mean for object data type according to my dataset? Backlit Keyboard column's average only i can find now.
import pandas as panda
dataset = panda.read_excel('data.xlsx')
print(dataset.info())
RangeIndex: 994 entries, 0 to 993 Data columns (total 19 columns):
Column Name | Non-Null Count | Dtype |
---|---|---|
Memory Speed | 888 | object |
Device Weight | 985 | object |
Screen Size | 994 | object |
GPU Memory Type | 884 | object |
GPU Memory Size | 946 | object |
GPU Type | 955 | object |
Panel Type | 994 | object |
Processor Generation | 946 | object |
Processor | 979 | object |
Operating System | 994 | object |
Card Reader | 864 | object |
Backlight | 994 | int64 |
Max Processor Speed | 950 | object |
Max Screen Resolution | 988 | object |
Fingerprint Reader | 886 | object |
RAM (System Memory) | 987 | object |
SSD Capacity | 991 | object |
Product Model | 994 | object |
Price | 994 | object |
the result of dataset.head():
Memory Speed | Device Weight | Screen Size | Graphics Card Memory Type | Graphics Card Memory | Graphics Card Type | Screen Panel Type | Processor Generation | Processor | Operating System | Card Reader | Backlit Keyboard | Max Processor Speed | Max Screen Resolution | Fingerprint Reader | RAM (System Memory) | SSD Capacity | Product Model | Price |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1066 MHz | NaN | 10 inches | NaN | 1 GB | NaN | IPS | 1st Generation | 1000M | Android | NaN | 0 | 1.05 GHz | NaN | NaN | NaN | 1 TB | Notebook | High |
1066 MHz | NaN | 10 inches | NaN | 1 GB | NaN | IPS | 1st Generation | 1000M | Android | NaN | 0 | 1.05 GHz | NaN | NaN | NaN | 1 TB | Notebook | High |
1066 MHz | NaN | 10 inches | NaN | 1 GB | NaN | IPS | 1st Generation | 1000M | Android | NaN | 0 | 1.05 GHz | NaN | NaN | NaN | 1 TB | Notebook | Medium |
3200 MHz | 1 - 2 kg | 15.6 inches | GDDR4 | 2 GB | External GPU | LED | 10th Generation | 1035G1 | Windows 10 Home | Yes | 0 | 3.6 GHz | 1920 x 1080 | No | 8 GB | 512 GB | Notebook | Low |
3200 MHz | 1 - 2 kg | 15.6 inches | GDDR5 | 2 GB | External GPU | LED | 10th Generation | 1035G1 | Windows 10 Home | No | 0 | 3.6 GHz | 1920 x 1080 | No | 12 GB | 1 TB | Notebook | Low |
Solution
There is a pandas describe function:
dataset.describe(include = "all")
In terms of a better answer than describe
, there is the issue with converting (when possible) the values to numerical types. So I had to go through tedious checking:
import pandas as pd
import numpy as np
df = pd.read_excel("Dataset.xlsx") # import data from op's file
df_new = pd.DataFrame(columns=df.columns) # generate an empty copy with headers
for col in df.columns: # go through columns
if df[col].dtypes == "object": # check if object, if yes go in, if no assign to df_new
values = list() # init empty list
for val in df[col].values: # for every value in the column
if pd.isna(val): # if it is nan, append it to values
# print("Nan detected in "+str(val))
values.append(np.nan) # append
elif " " in val: # if there are spaces, this hints to the existence of number + unit
# print("space detected in "+val)
val_splitted = val.split(" ") # split with space
if "," in val_splitted[0]: # if there is a comma, since the dec points are sometimes with , and sometimes with .
val_splitted[0] = val_splitted[0].replace(",", ".") # replace , with .
if len(val_splitted) == 2: # if there are only two values in the splitted form
# print("one space detected in "+val)
try: # try to convert to float
if col == "SSD Capacity": # only here, since the unit switches between GB and TB
if val_splitted[1] == "GB": #
values.append(float(val_splitted[0])/1000) # append divided by 1000
else: # else
values.append(float(val_splitted[0])) # assign as is
else: # if it is not the column SSD Capacity
values.append(float(val_splitted[0])) # append the val as is
except ValueError: # oops! value cannot be converted, then it is not numeric
values.append(val) # assign as is
else:
#print("too many spaces inm keeping "+val+" as it is")
values.append(val) # if too many spaces, such as "1920 x 1080" append as is
else:
# print("keeping "+val+" as it is")
values.append(val) # if no spaces and not nan, assign as is
df_new[col] = values # assign the values to the column in df_new
else:
df_new[col] = df[col] # if the value is not an object type, assign the whole column to df_new
print(df_new.describe(include = "all")) # print description
The results are like this:
Memory Speed | Screen Size | Backlit | Max Processor Speed | RAM (System Memory) | SSD Capacity | |
---|---|---|---|---|---|---|
count | 888.000 | 994.000 | 994.000 | 950.000 | 987.000 | 991.000 |
mean | 3339.874 | 15.336 | 0.243 | 4.293 | 17.431 | 0.638 |
std | 626.284 | 0.923 | 0.429 | 0.616 | 12.235 | 0.423 |
min | 1066.000 | 10.000 | 0.000 | 1.050 | 4.000 | 0.000 |
25% | 3200.000 | 15.600 | 0.000 | 4.200 | 8.000 | 0.500 |
50% | 3200.000 | 15.600 | 0.000 | 4.400 | 16.000 | 0.512 |
75% | 3200.000 | 15.600 | 0.000 | 4.700 | 16.000 | 1.000 |
max | 6400.000 | 18.400 | 1.000 | 5.600 | 128.000 | 4.000 |
Answered By - Tino D
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.