Saturday, October 23, 2021

[FIXED] How to segregate data in jupyter python 3

October 23, 2021 jupyter-notebook, pandas, python No comments

Issue

The question is,"Restricting to the female population, stratify the subjects into age bands no wider than ten years, and construct the distribution of marital status within each age band. Within each age band, present the distribution in terms of proportions that must sum to 1." The output I want is:

                  female
(0,18]        married       123
              not married   123
              divorced      123
(18,20]       married       123
              not married   123
              divorced      123
(20, 30]      married       123
              not married   123
              divorced      123
and so on

The code I have so far is:

age_distinct = da[["agegrp","RIAGENDRV2","DMDMARTLV2"]].dropna()
#da["agegrp"] = pd.cut(da.RIDAGEYR, [0, 18,20, 30, 40, 50, 60, 70, 80])
#da.groupby(["agegrp", "RIAGENDRV2"])["DMDMARTLV2"].value_counts()
(age_distinct.query('RIAGENDRV2 == "Female"'))
#da.groupby(by='RIAGENDRV2').size()

The output this gives is:

    agegrp  RIAGENDRV2  DMDMARTLV2
3   (50, 60]    Female  Living_With_Partner
4   (40, 50]    Female  Divorced
5   (70, 80]    Female  Separated
7   (30, 40]    Female  Married
12  (20, 30]    Female  Living_With_Partner
13  (60, 70]    Female  Married
15  (50, 60]    Female  Separated
16  (18, 20]    Female  Missing
17  (20, 30]    Female  Never_Married
18  (20, 30]    Female  Never_Married
19  (50, 60]    Female  Divorced
21  (70, 80]    Female  Widowed
22  (60, 70]    Female  Separated
23  (50, 60]    Female  Married
25  (20, 30]    Female  Never_Married
27  (50, 60]    Female  Divorced
29  (60, 70]    Female  Divorced
30  (60, 70]    Female  Married
33  (70, 80]    Female  Married
34  (30, 40]    Female  Married
35  (70, 80]    Female  Married
36  (20, 30]    Female  Married
38  (18, 20]    Female  Never_Married
39  (60, 70]    Female  Married
43  (70, 80]    Female  Widowed
46  (18, 20]    Female  Never_Married
47  (20, 30]    Female  Never_Married
50  (30, 40]    Female  Married
52  (40, 50]    Female  Separated
54  (0, 18] Female  Missing
... ... ... ...
5678    (20, 30]    Female  Never_Married
5679    (20, 30]    Female  Married
5681    (50, 60]    Female  Married
5682    (70, 80]    Female  Divorced
5683    (20, 30]    Female  Never_Married
5684    (60, 70]    Female  Married
5685    (30, 40]    Female  Married
5686    (50, 60]    Female  Living_With_Partner
5689    (40, 50]    Female  Married
5692    (70, 80]    Female  Widowed
5696    (50, 60]    Female  Divorced
5697    (60, 70]    Female  Married
5699    (70, 80]    Female  Divorced
5703    (60, 70]    Female  Married
5704    (70, 80]    Female  Never_Married
5707    (20, 30]    Female  Living_With_Partner
5708    (20, 30]    Female  Married
5710    (70, 80]    Female  Widowed
5712    (20, 30]    Female  Living_With_Partner
5715    (30, 40]    Female  Married
5716    (70, 80]    Female  Widowed
5719    (60, 70]    Female  Married
5721    (30, 40]    Female  Divorced
5722    (30, 40]    Female  Never_Married
5723    (70, 80]    Female  Widowed
5724    (40, 50]    Female  Married
5727    (60, 70]    Female  Married
5730    (70, 80]    Female  Widowed
5732    (70, 80]    Female  Widowed
5734    (20, 30]    Female  Never_Married

Solution

simulated some sample data that has same characteristics
generate an aggrp I just went to 20 year buckets
only take females, drop the sex column going into groupby so it doesn't impact output
count works fine as aggregate
rename an index so "Female" comes out as column header as per your sample output
meet requirement by changing counts into percentage that adds up to 1

sex = ["Male","Female"]
s = ['Living_With_Partner','Divorced','Separated','Married','Missing','Never_Married','Widowed']
df = pd.DataFrame([[random.randint(15,80), sex[random.randint(0,1)], s[random.randint(0,len(s)-1)]] for r in range(200)],
            columns=["age","sex","status"])

df["agegrp"] = pd.cut(df["age"], pd.interval_range(start=0, end=100, freq=20))
dfa = df[df["sex"]=="Female"].drop("sex",1).groupby(["agegrp","status"]).agg({"age":"count"}).dropna()
dfa.index.names = ['agegrp', 'Female'] # rename column from status to Female as per requirement
dfa = dfa[dfa["age"]>0] # exclude any aggregates where value is zero
dfa.groupby(level=0).apply(lambda x: 100* x / (float(x.sum()))).round(2) # change from counts to percentage

output sample

                            age
agegrp  Female  
(0, 20] Divorced            22.22
        Living_With_Partner 5.56
        Married             16.67
        Missing             5.56
        Never_Married       22.22
        Separated           16.67
        Widowed             11.11

even sized bins

b=[]
bs=6
found = False
while not found:
    found = True
    b = sorted([int(round(i.left)) for i in df["age"].value_counts(bins=bs).index] + [df["age"].max()])
    for i in range(1, len(b)-1):
        if b[i]-b[i-1]>10:
            bs += 1
            found = False
            break

df["agegrp"] = pd.cut(df["age"], b)

Answered By - Rob Raymond

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 23, 2021

[FIXED] How to segregate data in jupyter python 3

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels