Issue
The question is,"Restricting to the female population, stratify the subjects into age bands no wider than ten years, and construct the distribution of marital status within each age band. Within each age band, present the distribution in terms of proportions that must sum to 1." The output I want is:
female
(0,18] married 123
not married 123
divorced 123
(18,20] married 123
not married 123
divorced 123
(20, 30] married 123
not married 123
divorced 123
and so on
The code I have so far is:
age_distinct = da[["agegrp","RIAGENDRV2","DMDMARTLV2"]].dropna()
#da["agegrp"] = pd.cut(da.RIDAGEYR, [0, 18,20, 30, 40, 50, 60, 70, 80])
#da.groupby(["agegrp", "RIAGENDRV2"])["DMDMARTLV2"].value_counts()
(age_distinct.query('RIAGENDRV2 == "Female"'))
#da.groupby(by='RIAGENDRV2').size()
The output this gives is:
agegrp RIAGENDRV2 DMDMARTLV2
3 (50, 60] Female Living_With_Partner
4 (40, 50] Female Divorced
5 (70, 80] Female Separated
7 (30, 40] Female Married
12 (20, 30] Female Living_With_Partner
13 (60, 70] Female Married
15 (50, 60] Female Separated
16 (18, 20] Female Missing
17 (20, 30] Female Never_Married
18 (20, 30] Female Never_Married
19 (50, 60] Female Divorced
21 (70, 80] Female Widowed
22 (60, 70] Female Separated
23 (50, 60] Female Married
25 (20, 30] Female Never_Married
27 (50, 60] Female Divorced
29 (60, 70] Female Divorced
30 (60, 70] Female Married
33 (70, 80] Female Married
34 (30, 40] Female Married
35 (70, 80] Female Married
36 (20, 30] Female Married
38 (18, 20] Female Never_Married
39 (60, 70] Female Married
43 (70, 80] Female Widowed
46 (18, 20] Female Never_Married
47 (20, 30] Female Never_Married
50 (30, 40] Female Married
52 (40, 50] Female Separated
54 (0, 18] Female Missing
... ... ... ...
5678 (20, 30] Female Never_Married
5679 (20, 30] Female Married
5681 (50, 60] Female Married
5682 (70, 80] Female Divorced
5683 (20, 30] Female Never_Married
5684 (60, 70] Female Married
5685 (30, 40] Female Married
5686 (50, 60] Female Living_With_Partner
5689 (40, 50] Female Married
5692 (70, 80] Female Widowed
5696 (50, 60] Female Divorced
5697 (60, 70] Female Married
5699 (70, 80] Female Divorced
5703 (60, 70] Female Married
5704 (70, 80] Female Never_Married
5707 (20, 30] Female Living_With_Partner
5708 (20, 30] Female Married
5710 (70, 80] Female Widowed
5712 (20, 30] Female Living_With_Partner
5715 (30, 40] Female Married
5716 (70, 80] Female Widowed
5719 (60, 70] Female Married
5721 (30, 40] Female Divorced
5722 (30, 40] Female Never_Married
5723 (70, 80] Female Widowed
5724 (40, 50] Female Married
5727 (60, 70] Female Married
5730 (70, 80] Female Widowed
5732 (70, 80] Female Widowed
5734 (20, 30] Female Never_Married
Solution
- simulated some sample data that has same characteristics
- generate an
aggrp
I just went to 20 year buckets - only take females, drop the
sex
column going into groupby so it doesn't impact output count
works fine as aggregate- rename an index so "Female" comes out as column header as per your sample output
- meet requirement by changing counts into percentage that adds up to 1
sex = ["Male","Female"]
s = ['Living_With_Partner','Divorced','Separated','Married','Missing','Never_Married','Widowed']
df = pd.DataFrame([[random.randint(15,80), sex[random.randint(0,1)], s[random.randint(0,len(s)-1)]] for r in range(200)],
columns=["age","sex","status"])
df["agegrp"] = pd.cut(df["age"], pd.interval_range(start=0, end=100, freq=20))
dfa = df[df["sex"]=="Female"].drop("sex",1).groupby(["agegrp","status"]).agg({"age":"count"}).dropna()
dfa.index.names = ['agegrp', 'Female'] # rename column from status to Female as per requirement
dfa = dfa[dfa["age"]>0] # exclude any aggregates where value is zero
dfa.groupby(level=0).apply(lambda x: 100* x / (float(x.sum()))).round(2) # change from counts to percentage
output sample
age
agegrp Female
(0, 20] Divorced 22.22
Living_With_Partner 5.56
Married 16.67
Missing 5.56
Never_Married 22.22
Separated 16.67
Widowed 11.11
even sized bins
b=[]
bs=6
found = False
while not found:
found = True
b = sorted([int(round(i.left)) for i in df["age"].value_counts(bins=bs).index] + [df["age"].max()])
for i in range(1, len(b)-1):
if b[i]-b[i-1]>10:
bs += 1
found = False
break
df["agegrp"] = pd.cut(df["age"], b)
Answered By - Rob Raymond
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.