Issue
I want to find the groups (rather than the grouping variable) in a pandas groupby. Here is an example:
Name Col1 Col2 Col3
John 1 A C
Sam 1 B C
Mike 1 B D
Kate 2 E G
Fred 3 E H
Liz 3 F H
Jane 4 X Y
Henry 4 Z T
If I group then using Col1 and (Col2 or Col3)
, the corresponding groups will be
output = [['John', 'Sam', 'Mike'], ['Kate'], ['Fred', 'Liz'], ['Jane'], ['Henry']]
because a group consists of people having the same Col1
values, as well as either the same Col2 or the same Col3 value.
I was able to get what I want by creating a graph and finding connected components. Grouping by Col1
first, then finding connected components is another idea. However, I believe there must be a simpler way.
I would also like to do this in a more general case, such as grouping by Col1 and Col2 and (Col3 or Col4) and (Col5 or Col6)
.
Solution
I've had a look around, and this question is effectively a duplicate of this post: Group a pandas dataframe by one column OR another one. So, I cannot - not remotely - take credit for the following solution, but let me just show how you can adjust the impressive answer provided there by @AmiTavory to suit your specific needs:
import pandas as pd
import networkx as nx
import itertools
G = nx.Graph()
G.add_nodes_from(df.Name)
G.add_edges_from(
[(r1[1]['Name'], r2[1]['Name'])
for (r1, r2) in itertools.product(df.iterrows(), df.iterrows())
if r1[1].Name < r2[1].Name and
(r1[1]['Col1'] == r2[1]['Col1'] and
(r1[1]['Col2'] == r2[1]['Col2'] or r1[1]['Col3'] == r2[1]['Col3']))]
)
df['group'] = df['Name'].map(
dict(itertools.chain.from_iterable([[(ee, i) for ee in e]
for (i, e) in enumerate(nx.connected_components(G))])))
# finally, we only need to add this to get the list with nested lists
# containing the names.
output = df.groupby('group')['Name'].apply(list).values.tolist()
output
# [['John', 'Sam', 'Mike'], ['Kate'], ['Fred', 'Liz'], ['Jane'], ['Henry']]
In order to achieve other combinations of and/or
, you will just have to rewrite this bit:
(r1[1]['Col1'] == r2[1]['Col1'] and
(r1[1]['Col2'] == r2[1]['Col2'] or r1[1]['Col3'] == r2[1]['Col3']))
Answered By - ouroboros1
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.