Tuesday, January 2, 2024

[FIXED] Efficiently filtering out the last row of a duplicate column

January 02, 2024 pandas, python No comments

Issue

I need to filter out the last row where col2 = 3 but preserve the rest of the dataframe.

I can do that like so, while maintaining the order relative to the index:

import pandas


d = {
     'col1': [0, 1, 2, 3, 3, 3, 3, 4, 5, 6],
     'col2': [0, 11, 21, 31, 32, 33, 34, 41, 51, 61]
    }

df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]

pandas.concat([df2,df3]).sort_index()

    col1 col2
0   0    0
1   1    11
2   2    21
3   3    31
4   3    32
5   3    33
7   4    41
8   5    51
9   6    61

But for a larger dataframe, this operation gets progressively more expensive to perform.

Is there a more efficient way?

UPDATE

Based on the answers provided this far, here are the results:

import pandas
import random

dupes = 1000
rows = 10000000
d = {'col1': [random.choice(range(dupes)) for i in range(rows)], 'col2': [range for range in range(rows)]}
df = pandas.DataFrame(d)

df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
%timeit pandas.concat([df2,df3]).sort_index()

df = pandas.DataFrame(d)
%timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())

df = pandas.DataFrame(d)
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
%timeit df.drop(idx)

df = pandas.DataFrame(d)
%timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]

df = pandas.DataFrame(d)
%timeit df.drop(df.index[df['col1'].eq(3)][-1])

df = pandas.DataFrame(d)
%timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())

df = pandas.DataFrame(d)
%timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]

df = pandas.DataFrame(d)
%timeit df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)

703 ms ± 60.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
497 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
413 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
253 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 8.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
404 ms ± 8.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
792 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
491 ms ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution

Also possible:

out = df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]

out:

   col1  col2
0     0     0
1     1    11
2     2    21
3     3    31
4     3    32
5     3    33
7     4    41
8     5    51
9     6    61

Answered By - Chrysophylaxs

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 2, 2024

[FIXED] Efficiently filtering out the last row of a duplicate column

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels