Issue
All the dfs have a key col "id". pd.merge is not a viable option even with the suffix option. There are over 40k cols in each of the dfs so column binding and deleting later (suffix_x) is not an option. Exactly 50k (common) rows in each of the dfs identified by "id" col.
Minimal example with two common cols:
df1 = pd.DataFrame({
'id': ['a', 'b', 'c'],
'col1': [123, 121, 111],
'col2': [456, 454, 444],
'col3': [786, 787, 777],
})
df2 = pd.DataFrame({
'id': ['a', 'b', 'c'],
'col1': [123, 121, 111],
'col2': [456, 454, 444],
'col4': [11, 44, 77],
})
df3 = pd.DataFrame({
'id': ['a', 'b', 'c'],
'col1': [123, 121, 111],
'col2': [456, 454, 444],
'col5': [1786, 1787, 1777],
})
Final answer:
finaldf = pd.DataFrame({
'id': ['a', 'b', 'c'],
'col1': [123, 121, 111],
'col2': [456, 454, 444],
'col3': [786, 787, 777],
'col4': [11, 44, 77],
'col5': [1786, 1787, 1777],
})
Solution
If memory is limiting and the dataframes are already aligned, you could try to set up the output and update
it:
from functools import reduce
dfs = [df1, df2, df3]
cols = reduce(lambda a,b: a.union(b, sort=False),
(x.columns for x in dfs))
out = pd.DataFrame(index=dfs[0].index,
columns=cols)
for x in dfs:
out.update(x)
Variant for the last step:
out = pd.DataFrame(dfs[0],
columns=cols)
for x in dfs[1:]:
out.update(x)
Output:
id col1 col2 col3 col4 col5
0 a 123 456 786 11 1786
1 b 121 454 787 44 1787
2 c 111 444 777 77 1777
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.