Issue
I have df with 30 millions rows with form:
0;401
0;924
0;925
1;145
1;414
1;673
2;144
2;145
2;153
And I need to extract rows where the value in the first column is repeated multiple times (e.g. 100). I'm try rude method:
df1 = pd.DataFrame()
state_last = None
for index,row in df.iterrows():
if row.loc['S1'] != state_last: # to skip iterations where I'm already estimate part of df
state_last = row.loc['S1']
temp = df.loc[df['S1']==row['S1']]
if temp.shape[0] > 100:
df1=df1.append(temp)
I also tried:
for i in range(19709): #max number in df
temp = df.loc[df['S1']==i]
if temp.shape[0] > 100:
df1=df1.append(temp)
But these methods are too ineffective. Can this be done more quickly?
Thanks in advance
Solution
Assuming your columns as first_column and second_column
you can do
df = df.loc[df['first_column'].duplicated(keep=False), :]
you can read more here
EDIT:
You can do a group by on first_column and count the no of rows and then loc on count > 100
check this answer from Pedro M Duarte
Answered By - Nishad Wadwekar
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.