Issue
I have a data frame:
import pandas as pd
data = {'userName': {0: 'john', 1: 'amanda', 2: 'sara', 3: 'john'}, 'serialNum': {0: '[a4G, bweQ, fp_dE4]', 1:'' , 2: '[H2dw45, IfC4, bAf23g, Lkfr54-op, a3dLa]', 3: '[Tr45b, kM30, a4G, riU91]'}}
df = pd.DataFrame(data)
df
*each userName is unique. for each userName I want to count number of data in serialNum column.
my code is:
df['serialNum'] = df['serialNum'].str.strip('[]').str.replace("'", "").str.replace('"', '').replace('', np.nan).str.split(',').fillna({i: [] for i in df.index})
df_count = (df['serialNum'].str.len()
.groupby(df['userName'], sort=False).sum()
).to_frame(name='count').reset_index()
my df_count data frame:
userName | serialNum |
---|---|
john | 7 |
amanda | 0 |
sara | 5 |
but I don't want to count repetitive data (serialNum) for each userName and true df_count is:
userName | serialNum |
---|---|
john | 6 |
amanda | 0 |
sara | 5 |
my question is: How to count without duplicate data ( in column serialNum)?
and if you have a better offer for counting the number of values in the serialNum column, please share with me.
note: I prefer to use pandas or numpy methods and functions instead of writing new functions or for loop.
Solution
Use explode
+ groupby agg
nunique
:
u_count_df = (
df.explode('serialNum')
.groupby('userName', sort=False).agg({'serialNum': 'nunique'})
.reset_index()
.rename(columns={'serialNum': 'uCount'})
)
u_count_df
:
userName uCount
0 john 6
1 amanda 0
2 sara 5
With the current setup you'll need to strip the extra whitespace off because of the way the strings are processed:
e = df.explode('serialNum')
e['serialNum'] = e['serialNum'].str.strip()
u_count_df = (
e.groupby('userName').agg({'serialNum': 'nunique'})
.reset_index()
.rename(columns={'serialNum': 'uCount'})
)
Or change the way the strings are split:
df['serialNum'] = (
df['serialNum'].str.strip('[]')
.replace({"'": '', '"': ''}, regex=True)
.replace({'': np.NaN})
.str.split(r'\s*,\s*')
.fillna({i: [] for i in df.index})
)
u_count_df = (
df.explode('serialNum')
.groupby('userName', sort=False).agg({'serialNum': 'nunique'})
.reset_index()
.rename(columns={'serialNum': 'uCount'})
)
u_count_df
:
userName uCount
0 john 6
1 amanda 0
2 sara 5
Answered By - Henry Ecker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.