Issue
I need to filter out the last row where col2 = 3
but preserve the rest of the dataframe.
I can do that like so, while maintaining the order relative to the index:
import pandas
d = {
'col1': [0, 1, 2, 3, 3, 3, 3, 4, 5, 6],
'col2': [0, 11, 21, 31, 32, 33, 34, 41, 51, 61]
}
df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
pandas.concat([df2,df3]).sort_index()
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
But for a larger dataframe, this operation gets progressively more expensive to perform.
Is there a more efficient way?
UPDATE
Based on the answers provided this far, here are the results:
import pandas
import random
dupes = 1000
rows = 10000000
d = {'col1': [random.choice(range(dupes)) for i in range(rows)], 'col2': [range for range in range(rows)]}
df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
%timeit pandas.concat([df2,df3]).sort_index()
df = pandas.DataFrame(d)
%timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
df = pandas.DataFrame(d)
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
%timeit df.drop(idx)
df = pandas.DataFrame(d)
%timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
df = pandas.DataFrame(d)
%timeit df.drop(df.index[df['col1'].eq(3)][-1])
df = pandas.DataFrame(d)
%timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())
df = pandas.DataFrame(d)
%timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]
df = pandas.DataFrame(d)
%timeit df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)
703 ms ± 60.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
497 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
413 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
253 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 8.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
404 ms ± 8.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
792 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
491 ms ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution
Also possible:
out = df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
out:
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
Answered By - Chrysophylaxs
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.