Issue
My dataset contains 5 measurements taken daily, over a 700 day timespan. I wish to be able to group these values by the day of the week, and then apply the trim_mean
function from scipy.stats
to the each of the 5 measurements, using 1/stddev
as the proportiontocut
parameter.
My data:
import pandas as pd
import numpy as np
from scipy.stats import trim_mean
np.random.seed(42)
data = np.random.randint(0, 100, size=(5, 700))
col_names = pd.date_range('11-16-2023', periods=700)
df = pd.DataFrame(data, columns=col_names)
# df
2023-11-16 2023-11-17 ... 2025-10-15
0 51 92 ... 57
1 88 48 ... 32
2 89 52 ... 96
3 61 99 ... 48
4 0 7 ... 34
Now, I can do this using the following (not very elegant) process:
df_T = df.T
df_T['Day of Week'] = pd.to_datetime(df_T.index).isocalendar().day
## Room for improvement here ##
# Apply calculation to each type of measurement
gb = df_T.groupby('Day of Week')
m0 = gb[0].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m1 = gb[1].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m2 = gb[2].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m3 = gb[3].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m4 = gb[4].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
results_df = pd.DataFrame([m0, m1, m2, m3, m4])
results_df.columns = columns=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
# results_df
Mon Tue Wed Thu Fri Sat Sun
0 50.936170 51.712766 44.659574 49.117021 48.702128 47.414894 51.223404
1 49.244681 49.000000 49.138298 49.191489 45.872340 49.010638 47.074468
2 49.436170 46.404255 49.021277 46.553191 55.031915 51.265957 50.638298
3 43.744681 47.787234 48.574468 45.882979 47.255319 47.914894 49.606383
4 49.265957 46.255319 50.276596 50.872340 46.723404 45.255319 49.904255
This is very inefficient and doesn't make much sense if I have a lot of measurements. Is there a clever way of applying/mapping my trim_mean
function to achieve the same aim?
Solution
A possible option :
from calendar import day_abbr
results_df = (
(ser:=df.T.stack()).droplevel(0).groupby(
[ser.index.get_level_values(0).dayofweek, pd.Grouper(level=0)])
.apply(lambda g: trim_mean(g, proportiontocut=1/np.std(g)))
.unstack(0).set_axis(list(day_abbr), axis=1)
)
Output :
print(results_df)
Mon Tue Wed Thu Fri Sat Sun
0 50.936170 51.712766 44.659574 49.117021 48.702128 47.414894 51.223404
1 49.244681 49.000000 49.138298 49.191489 45.872340 49.010638 47.074468
2 49.436170 46.404255 49.021277 46.553191 55.031915 51.265957 50.638298
3 43.744681 47.787234 48.574468 45.882979 47.255319 47.914894 49.606383
4 49.265957 46.255319 50.276596 50.872340 46.723404 45.255319 49.904255
[5 rows x 7 columns]
Answered By - Timeless
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.