Issue
Good evening.
I have two dataframes, each having several columns and a couple of million lines but right now we're interested only in two of columns: FOOs and IDs for DataFrame1, BARs and IDs for DataFrame2. IDs is basically a many-to-many relationship and it goes like this:
import pandas as pd
data1=['foo1', 'foo2', 'foo1', 'foo1', 'foo3', 'foo2', 'foo3','foo4',
'foo1', 'foo3', 'foo1', 'foo2', 'foo1', 'foo2', 'foo3']
id1=[1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 7, 7, 7]
data2=['bar1', 'bar1', 'bar1', 'bar2', 'bar3', 'bar1', 'bar1',
'bar3', 'bar2', 'bar1' ]
id2=[1, 1, 1, 1, 2, 3, 4, 5, 6, 7]
df_foo=pd.DataFrame(data=zip(data1, id1), columns=['FOOs', 'IDs'],
dtype='object')
df_bar=pd.DataFrame(data=zip(data2, id2), columns=['BARs', 'IDs'],
dtype='object')
What I need to do is to aggregate all the FOOs with the BARs. I have a solution that works but it does look messy:
def my_agg(series):
return df_bar[df_bar.IDs.isin(series)].groupby('BARs').agg({'BARs': pd.unique})
print(df_foo.groupby('FOOs').agg({'FOOs': pd.unique, 'IDs': my_agg}).values)
And the output is:
[[array(['foo1'], dtype=object)
array([[array(['bar1'], dtype=object)],
[array(['bar2'], dtype=object)],
[array(['bar3'], dtype=object)]], dtype=object)]
[array(['foo2'], dtype=object)
array([[array(['bar1'], dtype=object)],
[array(['bar2'], dtype=object)]], dtype=object)]
[array(['foo3'], dtype=object)
array([[array(['bar1'], dtype=object)],
[array(['bar3'], dtype=object)]], dtype=object)]
[array(['foo4'], dtype=object)
array([[array(['bar1'], dtype=object)]], dtype=object)]]
Question is: Is there a way to make a clean solution with nice readable output like
FOOs BARs
foo1 bar1
bar2
bar3
foo2 bar1
bar2
foo3 bar1
bar3
foo4 bar1
Thanks in advance.
Solution
How about:
df_foo.merge(df_bar, on='IDs')[['FOOs', 'BARs']].drop_duplicates()
Answered By - ozacha
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.