Tuesday, January 2, 2024

[FIXED] Drop the duplicated index and keep one in the multi-index dataframe

January 02, 2024 dataframe, pandas, python-3.x No comments

Issue

My toy data df has three layers of index: name, name, year, assuming the index columns name and name both names and contents are duplicated, so I need to keep one only.

import pandas as pd

# create MultiIndex
index = pd.MultiIndex.from_tuples([
    ('name1', 'name1', '2020'),
    ('name1', 'name1', '2021'),
    ('name2', 'name2', '2020'),
    ('name2', 'name2', '2021'),
    ('name3', 'name3', '2020'),
    ('name3', 'name3', '2021')
], names=['name', 'name', 'year'])

df = pd.DataFrame({
    'quantity': [10, 15, 20, 25, 30, 35],
    'price': [100, 150, 200, 250, 300, 350]
}, index=index)

print(df)

Out:

                  quantity  price
name  name  year                 
name1 name1 2020        10    100
            2021        15    150
name2 name2 2020        20    200
            2021        25    250
name3 name3 2020        30    300
            2021        35    350

I tried the following code and did not succeed:

# Create a Boolean sequence, where TRUE indicates that the index is repeated
duplicates = df.index.duplicated(keep='first')

# Use Bolnes to choose those lines that are not repeated
df = df[~duplicates]
df

Out:

                  quantity  price
name  name  year                 
name1 name1 2020        10    100
            2021        15    150
name2 name2 2020        20    200
            2021        25    250
name3 name3 2020        30    300
            2021        35    350

If we reset_index() then drop duplicated columns, we will get ValueError: cannot insert name, already exists.

How to get the following results? Thanks.

            quantity  price
name  year                 
name1 2020        10    100
      2021        15    150
name2 2020        20    200
      2021        25    250
name3 2020        30    300
      2021        35    350

Solution

Just use droplevel:

df.droplevel(0)

Output:

            quantity  price
name  year
name1 2020        10    100
      2021        15    150
name2 2020        20    200
      2021        25    250
name3 2020        30    300
      2021        35    350

If you don't know the ordering of names in the index, you could find the first occurrence of name in the index:

level = df.index.names.index('name')
df.droplevel(level)

Answered By - Nick

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 2, 2024

[FIXED] Drop the duplicated index and keep one in the multi-index dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels