Issue
I have this
id phone1 phone2
1 300 301
1 303 300
1 300 303
2 400 401
Want this
id phone1 phone2 phone3
1 300 303 301
2 400 401
I have tried group by id and column phone1, apply count function, iterate over it adding to a list verifying if is already there the id and phone and sum the third column, and do the same thing with phone2 in the same list
After it reorganize the dataframe iterating the list but this is so slow with the millions of data that i have to proccess
dataframe1 = dataframe.groupby(['id', 'phone1']).count().reset_index()
dataframe2 = dataframe.groupby(['id', 'phone2']).count().reset_index()
result to add in a list
id phone1 phone2
1 300 2
1 303 1
2 401 1
id phone1 phone2
1 300 1
1 301 1
1 303 1
2 400 1
Solution
You can achieve this via grouping and sorting groups by count.
First, collect phones from each phone column:
phone_columns = df.columns[1:]
df["phone_list"] = df[phone_columns].apply(list, axis=1)
df = df.groupby("id").agg(phone_list_agg=("phone_list", lambda x: list(itertools.chain.from_iterable(x))))
phone_list_agg
id
1 [300, 301, 303, 300, 300, 303, 800, 800, 800, 800]
2 [400, 401]
Then, group phones and sort by count:
df["phone_tuples"] = df.apply(lambda x: [(k,len(list(g))) for k,g in itertools.groupby(sorted(x["phone_list_agg"]))], axis=1)
df = df.drop("phone_list_agg", axis=1)
df["phone_tuples"] = df.apply(lambda x: sorted(x["phone_tuples"], key=lambda y:y[1], reverse=True), axis=1)
phone_tuples
id
1 [(800, 4), (300, 3), (303, 2), (301, 1)]
2 [(400, 1), (401, 1)]
Finally, unpack tuples into separate columns:
df = pd.DataFrame(data=[[y[0] for y in x] for x in df["phone_tuples"]], index=df.index)
df.columns = [f"phone{i}" for i in range(1, len(df.columns) + 1)]
df = df.reset_index()
id phone1 phone2 phone3 phone4
0 1 800 300 303.0 301.0
1 2 400 401 NaN NaN
You can fill NaN above with some value (for ex. -1) and convert float to int using following:
df = df.fillna(-1).astype(int)
Sample dataset used:
data=[
(1,300,301),
(1,303,300),
(1,300,303),
(2,400,401),
(1,800,800),
(1,800,800),
]
columns = ["id", "phone1", "phone2"]
df = pd.DataFrame(data=data, columns=columns)
Answered By - Azhar Khan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.