Issue
I would like to find rows in following table, which contain repeated email addresses. I create an extra column in the dataframe in the following code with value 'ja', when an email address is repeated. This is fine for a small number of rows (150). For large number of rows (30000), the script hangs. Any better ways to loop over the rows?
import pandas as pd
data={'Name':['Danny','Damny','Monny','Quony','Dimny','Danny'],
'Email':['[email protected]','[email protected]','[email protected]','[email protected]','[email protected]','[email protected]']}
df=pd.DataFrame(data)
df['email_repeated']=None
col_email=df.columns.get_loc("Email")
row_count=len(df.index)
for i in range(0,row_count):
for k in range(0,row_count):
emailadres=df.iloc[i,col_email]
if k!=i:
if emailadres==df.iloc[k,col_email]:
df['email_repeated'][k] = 'ja'
Solution
df.duplicated('Email', keep=False)
computes exactly what you want (in boolean form)
If you insist on having 'ja'/None, you can keep your initial column creation
df['email_repeated']=None
df.loc[dfOrg.duplicated('Email', keep=False), 'email_repeated']='ja'
As for the literal question (is there better way to iterate over pandas rows), generally speaking, the answer is "not to". The better way to iterate is to avoid iteration, at all cost. Of course, there is an iteration somewhere. duplicated
surely iterate over the rows. But, it does it inside pandas code, in C, not inside your interpreted python code. It is very rare that you really need loops in dataframe. And it is a good attitude to think "If I am iterating over pandas rows, then I am doing something wrong". Even very convoluted "non-iterations" (I mean, succession of operations to achieve the result, when the algorithm seems straighforward using loops) are generally preferable to for loops.
In this case, it was not convoluted (there is a function dedicated exactly to your task). But even answers consisting in merging the dataframe with itself to find duplicates, or things like that would probably be way faster than anything with a for loop.
Answered By - chrslg
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.