Issue
I am using Pandas 2.1.3.
I am trying to join two DataFrames on multiple index levels, and one of the index levels has NA's. The minimum reproducible example looks something like this:
a = pd.DataFrame({
'idx_a':['A', 'A', 'B'],
'idx_b':['alpha', 'beta', 'gamma'],
'idx_c': [1.0, 1.0, 1.0],
'x':[10, 20, 30]
}).set_index(['idx_a', 'idx_b', 'idx_c'])
b = pd.DataFrame({
'idx_b':['gamma', 'delta', 'epsilon', np.nan, np.nan],
'idx_c': [1.0, 1.0, 1.0, 1.0, 1.0],
'y':[100, 200, 300, 400, 500]
}).set_index(['idx_b', 'idx_c'])
c = a.join(
b,
how='inner',
on=['idx_b', 'idx_c']
)
print(a)
x
idx_a idx_b idx_c
A alpha 1.0 10
beta 1.0 20
B gamma 1.0 30
print(b)
y
idx_b idx_c
gamma 1.0 100
delta 1.0 200
epsilon 1.0 300
NaN 1.0 400
1.0 500
print(c)
x y
idx_a idx_b idx_c
B gamma 1.0 30 100
1.0 30 400
1.0 30 500
I would have expected:
print(c)
x y
idx_a idx_b idx_c
B gamma 1.0 30 100
Why is join
matching on the NaN
values?
Solution
You can resolve your problem by removing the indexes and using merge
instead of join
:
a = pd.DataFrame({
'idx_a':['A', 'A', 'B'],
'idx_b':['alpha', 'beta', 'gamma'],
'idx_c': [1.0, 1.0, 1.0],
'x':[10, 20, 30]
})
b = pd.DataFrame({
'idx_b':['gamma', 'delta', 'epsilon', np.nan, np.nan],
'idx_c': [1.0, 1.0, 1.0, 1.0, 1.0],
'y':[100, 200, 300, 400, 500]
})
c = a.merge(b, on=['idx_b', 'idx_c'], how='inner')
Output:
idx_a idx_b idx_c x y
0 B gamma 1.0 30 100
If you want to keep the indexes on a
and b
as they are in the question you can do this (thanks @mozway):
c = (a
.reset_index()
.merge(b.reset_index(), on=['idx_b', 'idx_c'], how='inner')
.set_index(list(dict.fromkeys(a.index.names+b.index.names)))
)
Output:
x y
idx_a idx_b idx_c
B gamma 1.0 30 100
Answered By - Nick
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.