Saturday, January 20, 2024

[FIXED] Pandas join with multi-index and NaN

January 20, 2024 dataframe, pandas, python No comments

Issue

I am using Pandas 2.1.3.

I am trying to join two DataFrames on multiple index levels, and one of the index levels has NA's. The minimum reproducible example looks something like this:

a = pd.DataFrame({
    'idx_a':['A', 'A', 'B'],
    'idx_b':['alpha', 'beta', 'gamma'],
    'idx_c': [1.0, 1.0, 1.0],
    'x':[10, 20, 30]
}).set_index(['idx_a', 'idx_b', 'idx_c'])

b = pd.DataFrame({
    'idx_b':['gamma', 'delta', 'epsilon', np.nan, np.nan],
    'idx_c': [1.0, 1.0, 1.0, 1.0, 1.0],
    'y':[100, 200, 300, 400, 500]
}).set_index(['idx_b', 'idx_c'])

c = a.join(
    b,
    how='inner',
    on=['idx_b', 'idx_c']
)

print(a)
                    x
idx_a idx_b idx_c    
A     alpha 1.0    10
      beta  1.0    20
B     gamma 1.0    30

print(b)
                y
idx_b   idx_c     
gamma   1.0    100
delta   1.0    200
epsilon 1.0    300
NaN     1.0    400
        1.0    500

print(c)
                    x    y
idx_a idx_b idx_c         
B     gamma 1.0    30  100
            1.0    30  400
            1.0    30  500

I would have expected:

print(c)
                    x    y
idx_a idx_b idx_c         
B     gamma 1.0    30  100

Why is join matching on the NaN values?

Solution

You can resolve your problem by removing the indexes and using merge instead of join:

a = pd.DataFrame({
    'idx_a':['A', 'A', 'B'],
    'idx_b':['alpha', 'beta', 'gamma'],
    'idx_c': [1.0, 1.0, 1.0],
    'x':[10, 20, 30]
})

b = pd.DataFrame({
    'idx_b':['gamma', 'delta', 'epsilon', np.nan, np.nan],
    'idx_c': [1.0, 1.0, 1.0, 1.0, 1.0],
    'y':[100, 200, 300, 400, 500]
})

c = a.merge(b, on=['idx_b', 'idx_c'], how='inner')

Output:

  idx_a  idx_b  idx_c   x    y
0     B  gamma    1.0  30  100

If you want to keep the indexes on a and b as they are in the question you can do this (thanks @mozway):

c = (a
    .reset_index()
    .merge(b.reset_index(), on=['idx_b', 'idx_c'], how='inner')
    .set_index(list(dict.fromkeys(a.index.names+b.index.names)))
)

Output:

                    x    y
idx_a idx_b idx_c
B     gamma 1.0    30  100

Answered By - Nick

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 20, 2024

[FIXED] Pandas join with multi-index and NaN

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels