Issue
The original dataframe is:
import pandas as pd
array = {'id': [1, 1, 1, 1, 2, 3],
'color': ['yellow', 'red', 'yellow', 'red', 'yellow', 'white']}
df = pd.DataFrame(array)
df
id color
1 yellow
1 red
1 yellow
1 red
2 yellow
3 white
I have transformed it to the following dataframe with get_dummies:
df = pd.get_dummies(df, prefix='', prefix_sep='')
df
id red white yellow
0 1 0 0 1
1 1 1 0 0
2 1 0 0 1
3 1 1 0 0
4 2 0 0 1
5 3 0 1 0
which I want to groupby() column 'id':
df.groupby(['id']).max()
red white yellow
id
1 1 0 1
2 0 0 1
3 0 1 0
However, my original dataframe is 8,000 rows by 1,500,000 columns which makes this operation too slow.
Any ideas on how to make it quicker?
Solution
Update
Based on your original data frame, I would unique the data frame and pivot (or hot encode) it later. By this, you completely avoid any subsequent aggregation.
df_unique = df.drop_duplicates()
df_unique["val"] = 1
df_unique
id color val
0 1 yellow 1
1 1 red 1
4 2 yellow 1
5 3 white 1
df_unique.set_index("id").pivot(columns="color").fillna(0)
red white yellow
id
1 1.0 0.0 1.0
2 0.0 0.0 1.0
3 0.0 1.0 0.0
Coding Alternatives
Please try reshaping your data (which is also time-consuming) but might be faster than your current wide format:
# first approach using melt.groupby.max
pd.melt(df, id_vars = 'id').groupby(["id", "variable"]).max()
# second approach using melt.sort.groupby.first
pd.melt(df, id_vars = 'id').sort_values(by="variable", ascending=True).groupby(["id", "variable"]).first()
You can run this afterward to retain the desired shape again:
melted_and_aggregated_df.reset_index(level=["variable"]).pivot(columns=["variable"], values="value")
Data Size
Besides the pure coding efficiency, try to reduce your data.
- In case there are groups that only have a single row, you should use the max/first approach on the other groups only and combine the results afterward.
- Are there actually 1.5 million colors? Sounds enormous. Do you really need all of them or can it be reduced/aggregated priorly?
Answered By - mnist
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.