Tuesday, November 28, 2023

[FIXED] Is there a more efficient way to Apply by Row, then by Column?

November 28, 2023 group-by, numpy, pandas, python No comments

Issue

My dataset contains 5 measurements taken daily, over a 700 day timespan. I wish to be able to group these values by the day of the week, and then apply the trim_mean function from scipy.stats to the each of the 5 measurements, using 1/stddev as the proportiontocut parameter.

My data:

import pandas as pd
import numpy as np
from scipy.stats import trim_mean

np.random.seed(42)

data = np.random.randint(0, 100, size=(5, 700))
col_names = pd.date_range('11-16-2023', periods=700)
df = pd.DataFrame(data, columns=col_names)

# df
    2023-11-16  2023-11-17 ...  2025-10-15
0   51          92         ...  57
1   88          48         ...  32
2   89          52         ...  96
3   61          99         ...  48
4   0           7          ...  34

Now, I can do this using the following (not very elegant) process:

df_T = df.T
df_T['Day of Week'] = pd.to_datetime(df_T.index).isocalendar().day

## Room for improvement here ##
# Apply calculation to each type of measurement
gb = df_T.groupby('Day of Week')
m0 = gb[0].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m1 = gb[1].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m2 = gb[2].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m3 = gb[3].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))
m4 = gb[4].apply(lambda x: trim_mean(x, proportiontocut=1/np.std(x)))

results_df = pd.DataFrame([m0, m1, m2, m3, m4])
results_df.columns = columns=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

# results_df
    Mon         Tue         Wed         Thu         Fri         Sat         Sun
0   50.936170   51.712766   44.659574   49.117021   48.702128   47.414894   51.223404
1   49.244681   49.000000   49.138298   49.191489   45.872340   49.010638   47.074468
2   49.436170   46.404255   49.021277   46.553191   55.031915   51.265957   50.638298
3   43.744681   47.787234   48.574468   45.882979   47.255319   47.914894   49.606383
4   49.265957   46.255319   50.276596   50.872340   46.723404   45.255319   49.904255

This is very inefficient and doesn't make much sense if I have a lot of measurements. Is there a clever way of applying/mapping my trim_mean function to achieve the same aim?

Solution

A possible option :

from calendar import day_abbr

results_df = (
   (ser:=df.T.stack()).droplevel(0).groupby(
     [ser.index.get_level_values(0).dayofweek, pd.Grouper(level=0)])
      .apply(lambda g: trim_mean(g, proportiontocut=1/np.std(g)))
      .unstack(0).set_axis(list(day_abbr), axis=1)
)

Output :

print(results_df)

         Mon        Tue        Wed        Thu        Fri        Sat        Sun
0  50.936170  51.712766  44.659574  49.117021  48.702128  47.414894  51.223404
1  49.244681  49.000000  49.138298  49.191489  45.872340  49.010638  47.074468
2  49.436170  46.404255  49.021277  46.553191  55.031915  51.265957  50.638298
3  43.744681  47.787234  48.574468  45.882979  47.255319  47.914894  49.606383
4  49.265957  46.255319  50.276596  50.872340  46.723404  45.255319  49.904255

[5 rows x 7 columns]

Answered By - Timeless

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 28, 2023

[FIXED] Is there a more efficient way to Apply by Row, then by Column?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels