Issue
I have a data set with 5 years of data. I would like to create a dataframe that determines the proportion of rows that meet a condition (i.e., Column 1 value > 10) for each County, and how important each county is in the data set (i.e., number of rows). I would like to determine that separately for each year, so the results can be average by year. I've accomplished in the code below that for one year of data:
df_2018_1 = df[(df.Year=='2018')]
df_2018_2 = df[(df.column_1 > 10) & (df.Year=='2018')]
df_2018_cur = pd.DataFrame()
df_2018_cur['Column 1 > 10'] = df_2018_2.County.value_counts()/df_2018_1.County.value_counts()*100
# Percent of submissions by county out of all submissions (county importance).
df_2018_cur['PCT of State'] = df_2018_1.County.value_counts()/len(df_2018_1)*100
# Repeat for remaining years, then average across dataframes.
I would love an alternative strategy with more concise code if possible. I don't believe that pivot_table()
supports a value_counts()
function. I'm wondering if groupby
might be useful here, but if so it has not occurred to me just how it would be.
Thank you.
Solution
df['Column1_gt_10'] = df['column_1'] > 10
grouped = df.groupby(['Year', 'County'])
aggregated = grouped.agg(
Column1_gt_10_pct = ('Column1_gt_10', lambda x: x.mean() * 100),
County_count = ('Column1_gt_10', 'size')
total_counts_by_year = df.groupby('Year')['County'].count()
aggregated['PCT_of_State'] = aggregated['County_count'] / aggregated.index.get_level_values('Year').map(total_counts_by_year) * 100
final_result = aggregated.groupby(level='County').mean()
As per my understanding of the question
- Grouped by
Year
andCounty
- Applied the condition of
column_1>10
for each group - Get average across year, but before that calculated proportion on county level.
Answered By - Arunbh Yashaswi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.