Thursday, November 16, 2023

[FIXED] How to calculate an average across columns based on values in adjacent columns in Pandas

November 16, 2023 dataframe, pandas, python No comments

Issue

I have an example dataset attached. My actual dataset is much larger. The "yr" columns are split between "cd" and "qty" and go from yr 10 to yr 1. Not every row contains a full set of data, some contain zeros. Example data:

Sales	yr10cd	yr10qty	yr9cd	yr9qty	yr8cd	yr8qty
42	A	45	A	47	A	49
56	T	58	A	52	0	0
78	A	75	0	0	0	0

I want to be able to take an average of the qty columns (yr10qty, yr9qty, yr8qty, etc.) but only if the associated indicator (yr10cd, yr9cd, yr8cd, etc. ) in the adjacent column is an "A" value. if the associated indicator is 0 or any other value like "T", I do not want to include it in my average calculation.

I have tried using a function that utilizes if statements to append values to a list and then average if non zero. Then, using df.apply to apply that function to each row in the df. Unfortunately, I keep getting 0 for all my averages, which is not expected.

Here is the expected output. It is an added column to my df that contains the average for that row. Expected output:

Sales	yr10cd	yr10qty	yr9cd	yr9qty	yr8cd	yr8qty	average
42	A	45	A	47	A	49	47
56	T	58	A	52	0	0	52
78	A	75	0	0	0	0	75

I have tried searching stack overflow for sometime now and none of the solutions I have come across have worked for my specific scenario.

Solution

In some situations where the structure of the columns can be unknown it can be useful to expand out into a MultiIndex to perform operations which allows us to take advantage of pandas indexing to ensure data integrity during our computation.

Assuming a structure of yr[num]cd paired with yr[num]qty we can isolate those columns and create a multi-index such that the [num] values are in their own level.

v = df.filter(regex='^yr\d+(cd|qty)$')
v.columns = (
    v.columns
    .str.replace(r'yr(\d+)(cd|qty)', r'\2_\1', regex=True)
    .str.split('_', expand=True)
)

Here I've isolated the interesting columns into the variable v and restructured the columns so that the cd and qty values are in level 0 and the numbers are in level 1 through the use of replace and split.

V looks like:

  cd qty cd qty cd qty
  10  10  9   9  8   8
0  A  45  A  47  A  49
1  T  58  A  52  0   0
2  A  75  0   0  0   0

Note there are many ways to restructure the columns into a MultiIndex. Here's another example for reference:

v.columns = (
    v.columns
    .str.split(r'(\d+)', regex=True, expand=True)
    .droplevel(0)
    .swaplevel(0, 1)
)

Depending on the column name format different approaches may be better at restructuring.

The primary benefit of a MultiIndex with this level order is that we can very easily get access to all cd columns and qty columns by accessing them v['cd'] and v['qty].

v['qty'] for reference:

   10   9   8
0  45  47  49
1  58  52   0
2  75   0   0

The great thing about this is that regardless of column order, we can reliably align computations between 10, 9 and 8.

This allows us to filter out where is equal to 'A' v['qty'].where(v['cd'].eq('A')):

     10     9     8
0  45.0  47.0  49.0
1   NaN  52.0   NaN
2  75.0   NaN   NaN

Then take the mean across rows v['qty'].where(v['cd'].eq('A')).mean(axis='columns'):

0    47.0
1    52.0
2    75.0
dtype: float64

This has the same index as df so we can very simply assign the values back

df['Average'] = v['qty'].where(v['cd'].eq('A')).mean(axis='columns')

df with the new column:

   Sales yr8cd yr9cd yr10cd  yr10qty  yr9qty  yr8qty  Average
0     42     A     A      A       45      47      49     47.0
1     56     0     A      T       58      52       0     52.0
2     78     0     0      A       75       0       0     75.0

Again something really great about this approach is that our initial data column order does not matter.

Imagine a situation where our cd columns are grouped with numbers in ascending order and our qty columns are grouped with numbers in descending order.

Sales	yr8cd	yr9cd	yr10cd	yr10qty	yr9qty	yr8qty
42	A	A	A	45	47	49
56	0	A	T	58	52	0
78	0	0	A	75	0	0

Or, perhaps more realistically, imagine a scenario where someone accidentally dragged one of the columns out of order. The approach outlined here will still result in the correct average values because the calculation and value filtering is aligning based on the numerical values and not their relative position within the DataFrame.

0    47.0
1    52.0
2    75.0
dtype: float64

So, while this is likely not the fastest solution or the most memory efficient, it is fairly performant while not sacrificing any alignment integrity checks that indexing can provide.

Complete working example with version number

import pandas as pd  # v2.1.2

df = pd.DataFrame({
    'Sales': [42, 56, 78],
    'yr10cd': ['A', 'T', 'A'],
    'yr10qty': [45, 58, 75],
    'yr9cd': ['A', 'A', '0'],
    'yr9qty': [47, 52, 0],
    'yr8cd': ['A', '0', '0'],
    'yr8qty': [49, 0, 0]
})

v = df.filter(regex='^yr\d+(cd|qty)$')
v.columns = (
    v.columns
    .str.replace(r'yr(\d+)(cd|qty)', r'\2_\1', regex=True)
    .str.split('_', expand=True)
)

df['Average'] = v['qty'].where(v['cd'].eq('A')).mean(axis='columns')

print(df)

Answered By - Henry Ecker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 16, 2023

[FIXED] How to calculate an average across columns based on values in adjacent columns in Pandas

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels