Issue
My DataFrame 3 fields are account ,month and salary.
account month Salary
1 201501 10000
2 201506 20000
2 201506 20000
3 201508 30000
3 201508 30000
3 201506 10000
3 201506 10000
3 201506 10000
3 201506 10000
I am doing groupby on Account and Month and calculating sum of salary for group. Then removing duplicates.
MyDataFrame['salary'] = MyDataFrame.groupby(['account'], ['month'])['salary'].transform(sum)
MyDataFrame = MyDataFrame.drop_duplicates()
Expecting output like below:
account month Salary
1 201501 10000
2 201506 40000
3 201508 60000
3 201506 40000
It works well for few records. I tried same for 600 Million records and it is in progress since 4-5 hours. Initially when I loaded data using pd.read_csv() data acquired 60 GB RAM, till 1-2 hour RAM usages was in between 90 to 120 GB. After 3 hours process is taking 236 GB RAM and it is still running.
Please suggest if any other alternative faster way is available for this.
EDIT: Now 15 Minutes in df.groupby(['account', 'month'], sort=False)['Salary'].sum()
Solution
Just to follow up on chrisb's answer and Alexander's comment, you indeed will get more performance out of the .sum()
and .agg('sum')
methods. Here's a Jupyter %%timeit
output for the three:
So, the answers that chrisb and Alexander mention are about twice as fast on your very small example dataset.
Also, according to the Pandas API documentation, adding the kwarg sort=False
will also help performance. So, your groupby should look something like df.groupby(['account', 'month'], sort=False)['Salary'].sum()
. Indeed, when I ran it, it was about 10% faster than the runs shown in the above image.
Answered By - dagrha
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.