Issue
I have an example dataset attached. My actual dataset is much larger. The "yr" columns are split between "cd" and "qty" and go from yr 10 to yr 1. Not every row contains a full set of data, some contain zeros. Example data:
Sales | yr10cd | yr10qty | yr9cd | yr9qty | yr8cd | yr8qty |
---|---|---|---|---|---|---|
42 | A | 45 | A | 47 | A | 49 |
56 | T | 58 | A | 52 | 0 | 0 |
78 | A | 75 | 0 | 0 | 0 | 0 |
I want to be able to take an average of the qty columns (yr10qty, yr9qty, yr8qty, etc.) but only if the associated indicator (yr10cd, yr9cd, yr8cd, etc. ) in the adjacent column is an "A" value. if the associated indicator is 0 or any other value like "T", I do not want to include it in my average calculation.
I have tried using a function that utilizes if statements to append values to a list and then average if non zero. Then, using df.apply to apply that function to each row in the df. Unfortunately, I keep getting 0 for all my averages, which is not expected.
Here is the expected output. It is an added column to my df that contains the average for that row. Expected output:
Sales | yr10cd | yr10qty | yr9cd | yr9qty | yr8cd | yr8qty | average |
---|---|---|---|---|---|---|---|
42 | A | 45 | A | 47 | A | 49 | 47 |
56 | T | 58 | A | 52 | 0 | 0 | 52 |
78 | A | 75 | 0 | 0 | 0 | 0 | 75 |
I have tried searching stack overflow for sometime now and none of the solutions I have come across have worked for my specific scenario.
Solution
In some situations where the structure of the columns can be unknown it can be useful to expand out into a MultiIndex to perform operations which allows us to take advantage of pandas indexing to ensure data integrity during our computation.
Assuming a structure of yr[num]cd
paired with yr[num]qty
we can isolate those columns and create a multi-index such that the [num] values are in their own level.
v = df.filter(regex='^yr\d+(cd|qty)$')
v.columns = (
v.columns
.str.replace(r'yr(\d+)(cd|qty)', r'\2_\1', regex=True)
.str.split('_', expand=True)
)
Here I've isolated the interesting columns into the variable v
and restructured the columns so that the cd
and qty
values are in level 0 and the numbers are in level 1 through the use of replace and split.
V looks like:
cd qty cd qty cd qty
10 10 9 9 8 8
0 A 45 A 47 A 49
1 T 58 A 52 0 0
2 A 75 0 0 0 0
Note there are many ways to restructure the columns into a MultiIndex. Here's another example for reference:
v.columns = (
v.columns
.str.split(r'(\d+)', regex=True, expand=True)
.droplevel(0)
.swaplevel(0, 1)
)
Depending on the column name format different approaches may be better at restructuring.
The primary benefit of a MultiIndex with this level order is that we can very easily get access to all cd
columns and qty
columns by accessing them v['cd']
and v['qty]
.
v['qty']
for reference:
10 9 8
0 45 47 49
1 58 52 0
2 75 0 0
The great thing about this is that regardless of column order, we can reliably align computations between 10, 9 and 8.
This allows us to filter out where is equal to 'A' v['qty'].where(v['cd'].eq('A'))
:
10 9 8
0 45.0 47.0 49.0
1 NaN 52.0 NaN
2 75.0 NaN NaN
Then take the mean across rows v['qty'].where(v['cd'].eq('A')).mean(axis='columns')
:
0 47.0
1 52.0
2 75.0
dtype: float64
This has the same index as df
so we can very simply assign the values back
df['Average'] = v['qty'].where(v['cd'].eq('A')).mean(axis='columns')
df
with the new column:
Sales yr8cd yr9cd yr10cd yr10qty yr9qty yr8qty Average
0 42 A A A 45 47 49 47.0
1 56 0 A T 58 52 0 52.0
2 78 0 0 A 75 0 0 75.0
Again something really great about this approach is that our initial data column order does not matter.
Imagine a situation where our cd
columns are grouped with numbers in ascending order and our qty
columns are grouped with numbers in descending order.
Sales | yr8cd | yr9cd | yr10cd | yr10qty | yr9qty | yr8qty |
---|---|---|---|---|---|---|
42 | A | A | A | 45 | 47 | 49 |
56 | 0 | A | T | 58 | 52 | 0 |
78 | 0 | 0 | A | 75 | 0 | 0 |
Or, perhaps more realistically, imagine a scenario where someone accidentally dragged one of the columns out of order. The approach outlined here will still result in the correct average values because the calculation and value filtering is aligning based on the numerical values and not their relative position within the DataFrame.
0 47.0
1 52.0
2 75.0
dtype: float64
So, while this is likely not the fastest solution or the most memory efficient, it is fairly performant while not sacrificing any alignment integrity checks that indexing can provide.
Complete working example with version number
import pandas as pd # v2.1.2
df = pd.DataFrame({
'Sales': [42, 56, 78],
'yr10cd': ['A', 'T', 'A'],
'yr10qty': [45, 58, 75],
'yr9cd': ['A', 'A', '0'],
'yr9qty': [47, 52, 0],
'yr8cd': ['A', '0', '0'],
'yr8qty': [49, 0, 0]
})
v = df.filter(regex='^yr\d+(cd|qty)$')
v.columns = (
v.columns
.str.replace(r'yr(\d+)(cd|qty)', r'\2_\1', regex=True)
.str.split('_', expand=True)
)
df['Average'] = v['qty'].where(v['cd'].eq('A')).mean(axis='columns')
print(df)
Answered By - Henry Ecker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.