Tuesday, October 25, 2022

[FIXED] How to organize fields in dataframe by repetition and drop duplicates

October 25, 2022 dataframe, pandas, python No comments

Issue

I have this

id phone1 phone2 
1  300    301
1  303    300
1  300    303
2  400    401

Want this

id phone1 phone2 phone3
1  300    303    301
2  400    401

I have tried group by id and column phone1, apply count function, iterate over it adding to a list verifying if is already there the id and phone and sum the third column, and do the same thing with phone2 in the same list

After it reorganize the dataframe iterating the list but this is so slow with the millions of data that i have to proccess

dataframe1 = dataframe.groupby(['id', 'phone1']).count().reset_index()
dataframe2 = dataframe.groupby(['id', 'phone2']).count().reset_index()

result to add in a list

id phone1 phone2
1  300    2    
1  303    1
2  401    1

id phone1 phone2
1  300    1   
1  301    1
1  303    1
2  400    1

Solution

You can achieve this via grouping and sorting groups by count.

First, collect phones from each phone column:

phone_columns = df.columns[1:]
df["phone_list"] = df[phone_columns].apply(list, axis=1)
df = df.groupby("id").agg(phone_list_agg=("phone_list", lambda x: list(itertools.chain.from_iterable(x))))

                                        phone_list_agg
id                                                   
1   [300, 301, 303, 300, 300, 303, 800, 800, 800, 800]
2                                           [400, 401]

Then, group phones and sort by count:

df["phone_tuples"] = df.apply(lambda x: [(k,len(list(g))) for k,g in itertools.groupby(sorted(x["phone_list_agg"]))], axis=1)
df = df.drop("phone_list_agg", axis=1)
df["phone_tuples"] = df.apply(lambda x: sorted(x["phone_tuples"], key=lambda y:y[1], reverse=True), axis=1)

                                phone_tuples
id                                          
1   [(800, 4), (300, 3), (303, 2), (301, 1)]
2                       [(400, 1), (401, 1)]

Finally, unpack tuples into separate columns:

df = pd.DataFrame(data=[[y[0] for y in x] for x in df["phone_tuples"]], index=df.index)
df.columns = [f"phone{i}" for i in range(1, len(df.columns) + 1)]
df = df.reset_index()

   id  phone1  phone2  phone3  phone4
0   1     800     300   303.0   301.0
1   2     400     401     NaN     NaN

You can fill NaN above with some value (for ex. -1) and convert float to int using following:

df = df.fillna(-1).astype(int)

Sample dataset used:

data=[
(1,300,301),
(1,303,300),
(1,300,303),
(2,400,401),
(1,800,800),
(1,800,800),
]

columns = ["id", "phone1", "phone2"]

df = pd.DataFrame(data=data, columns=columns)

Answered By - Azhar Khan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 25, 2022

[FIXED] How to organize fields in dataframe by repetition and drop duplicates

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels