Issue
I need to create a versatile function that I can pipe to a pandas
DataFrame that can run ANOVA on an unspecified number of groups and subgroups. I have a number of DataFrames that have two categorical columns (here, cat1
and cat2
) and a number of continuous columns (here, just contin_1
and contin2
).
cat1 cat2 contin_Z contin_Y ...
0 A 1 08 33
1 B 1 00 86
2 C 2 85 65
3 A 1 82 08
4 A 2 90 85
5 A 3 93 93
6 A 2 13 65
7 A 1 33 90
8 B 2 00 10
9 C 2 58 57
10 C 1 68 68
11 B 1 43 40
12 A 1 35 NaN
13 A 3 75 40
14 A 3 68 53
15 A 2 93 93
16 B 3 18 65
17 C 3 33 28
18 A 1 50 94
19 B 1 25 90
Taking the sample data above, I would want to group it by cat1
and run ANOVA on all the cat2
groups for each cat1
group and also output the mean for each cat1
/cat2
combination. I would want to be able to do this regardless of the number of continuous columns -- kind of like how pandas.DataFrame.groupby.mean
can operate on any number of columns.
The output should look like this, in some kind of shape: (This is a sample output just to show the shape of the output I'm seeking; the mean and p-values are nonsense values.)
cat1 cat2 Z_mean Z_anova_pval Y_mean Y_anova_pval ...
0 A 1 54 .005 43 .076
1 A 2 73 .005 34 .076
2 A 3 34 .005 42 .076
3 B 1 76 .567 32 .002
4 B 2 98 .567 78 .002
5 B 3 73 .567 101 .002
6 C 1 84 .043 15 .041
7 C 2 23 .043 43 .041
8 C 3 82 .043 87 .041
I'm imagining the function would be called something like this:
data.groupby('cat1').pipe(f)
The closest I've got to the function I need is this line passing all the cat2
subgroups to f_oneway
with a generator expression, but I can't figure out how to do this within a groupby
on cat1
, and especially not while also grabbing the mean, which I'm supposing might need to be done as a separate with groupby.agg
, but I'm unsure how to make that scalable to however many continuous columns happen to be in a DataFrame.
from scipy import stats
def run_anova(data):
return stats.f_oneway(*(data[data.cat2==cat].dropna() for cat in data.cat2.unique()))
Edit: Something along these lines seems to produce an accurate result for the p-value side of things.
from scipy.stats import f_oneway as anova
continuous_cols = [ ... list of all desired continuous cols ... ]
cats = data.cat1.unique()
pvals = (
data
.groupby(['cat2'])[continuous_cols]
.agg(
lambda x:
anova(*[x[data.cat1==cat].dropna() for cat in cats])[0]
)
)
> contin_Z contin_Y
> cat2
> 1 .045 .087
> 2 .654 .945
> 3 .943 .003
However, I can't figure out how to integrate this with needing the mean of each subgroup (other than to do two GroupBys and then merge after resetting one level of the index on the means table):
pvals = pvals.add_suffix('_pval')
means = data.groupby(['cat2', 'cat1'])[continuous_cols].mean()
means = means.add_suffix('_mean')
table = (
pvals
.merge(means.reset_index(level=1), left_index=True, right_index=True)
)
Solution
You already have a good idea of the steps involved. The gist here is that your means and pvalues are calculated on two different aggregation levels:
- mean is calculated at the
(cat2, cat1)
level - pvalue is calculated at the
cat2
level
The data.groupby(...).pipe(f)
idea isn't really workable because it restricts you to a single level. The solution below will compute the means and the pvalues separately then join them together.
It does not require you to list the continuous variables in advance, nor change the code to accomodate new variables. It only cares about the categorical variables, which are fixed in your use case.
# Move the category columns into the index so only we know that only continuos
# variables remain in the dataframe
tmp = data.set_index(["cat2", "cat1"])
# The means are easy to calculate. To distinguish them from the later pvalues,
# add a level to the column names
means = tmp.groupby(["cat2", "cat1"]).mean()
means.columns = pd.MultiIndex.from_product([["mean"], means.columns])
# Calculate the pvalues
from scipy.stats import f_oneway
cat1 = data["cat1"].unique()
def anova(group: pd.DataFrame):
# We know that every column in `group` is a continuous variable. The
# categorical variables have been moved to the index
result = {
("pvalue", col): f_oneway(
*[group.loc[(slice(None), cat), col].dropna() for cat in cat1]
).pvalue
for col in group.columns
}
return pd.Series(result)
pvalues = tmp.groupby("cat2").apply(anova)
# Merge the results
result = pd.merge(means.reset_index("cat1"), pvalues, on="cat2").set_index("cat1", append=True)
Result:
mean pvalue
contin_Z contin_Y contin_Z contin_Y
cat2 cat1
1 A 41.600000 56.25 0.341883 0.856163
B 22.666667 72.00 0.341883 0.856163
C 68.000000 68.00 0.341883 0.856163
2 A 65.333333 81.00 0.392376 0.034400
B 0.000000 10.00 0.392376 0.034400
C 71.500000 61.00 0.392376 0.034400
3 A 78.666667 62.00 0.086695 0.610693
B 18.000000 65.00 0.086695 0.610693
C 33.000000 28.00 0.086695 0.610693
Answered By - Code Different
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.