Saturday, January 13, 2024

[FIXED] How to speed up `pandas.where`?

January 13, 2024 pandas, python No comments

Issue

I have two dataframes: df1 with shape (34, 1151649), df2 with shape (76, 3467). I would like to perform pandas.where.groupby operation on them. They are quite slow. I'm wondering is there's any way to speed up the code. A sample code is as follows.

df1 = pd.DataFrame(np.arange(6).reshape(2, 3), index=[1, 2], columns=pd.MultiIndex.from_tuples((('a', 1), ('a', 2), ('b', 3)), names=['n1', 'n2']))
df2 = pd.DataFrame(np.arange(6).reshape(3, 2), index=[0, 1, 2], columns=pd.Index(['a', 'c'], name='n1'))

df1
df2
df1.where(df2 == 2).groupby(level=1, axis=1).sum()

The output is as follows:

Solution

IIUC, you can use the following, which is >500x faster for large frames:

out = df1.mul(
    df2 == 2, fill_value=0
).groupby(level=1, axis=1).sum().reindex(
    index=df1.index, fill_value=0
)

On your example data

>>> out
n2    1    2    3
1   0.0  1.0  0.0
2   0.0  0.0  0.0

Speed

from string import ascii_lowercase

np.random.seed(0)
n = 20
r0, c0 = 10, 20_000
r1, c1 = 10, n
letters = list(ascii_lowercase[:n])
df1 = pd.DataFrame(
    np.random.randint(0, 10, (r0, c0)),
    columns=pd.MultiIndex.from_product([letters, range(c0 // n)], names=['n1', 'n2']))
df2 = pd.DataFrame(
    np.random.randint(0, 10, (r1, c1)), columns=pd.Index(letters, name='n1'))

Then:

orig_t = %timeit -o df1.where(df2 == 2).groupby(level=1, axis=1).sum()
# 1.6 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

okay_t = %timeit -o df1.mul(df2 == 2, fill_value=0).groupby(level=1, axis=1).sum().reindex(index=df1.index, fill_value=0)
# 2.82 ms ± 5.61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> orig_t.best / okay_t.best
555.9134844909726

Bonus: if df2 is "complete" (i.e., it has all the columns of level 0 of df1), then the output isn't converted to float because of NaN values.

For example, with the 20K columns example above:

>>> out.iloc[:5, :10]
n2   0   1   2   3   4   5   6   7   8   9
0    8  10   0   9   5  13   6   9  16  12
1   15  11  17   4  14  10  16   7   8  10
2   11  14   9  15  16  15  10  11  12   4
3    8   8   1   0   6   1   3   1   6   4
4    5   7   1   2   7   7   3   9   2   2

whereas, from the original code:

>>> df1.where(df2 == 2).groupby(level=1, axis=1).sum().iloc[:5, :10]
n2     0     1     2     3     4     5     6     7     8     9
0    8.0  10.0   0.0   9.0   5.0  13.0   6.0   9.0  16.0  12.0
1   15.0  11.0  17.0   4.0  14.0  10.0  16.0   7.0   8.0  10.0
2   11.0  14.0   9.0  15.0  16.0  15.0  10.0  11.0  12.0   4.0
3    8.0   8.0   1.0   0.0   6.0   1.0   3.0   1.0   6.0   4.0
4    5.0   7.0   1.0   2.0   7.0   7.0   3.0   9.0   2.0   2.0

If df2 is not "complete", one can still achieve the same outcome, with a small extra cost:

out = df1.mul(
    df2.reindex(
        index=df1.index, columns=df1.columns.unique(level=0),
        fill_value=-1
    ) == 2, fill_value=0).groupby(level=1, axis=1).sum()

On the speed test above, it is 13% slower (3.19 ms instead of 2.82 ms). This is still much faster that the original (1.6 s).

Answered By - Pierre D

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 13, 2024

[FIXED] How to speed up `pandas.where`?

Issue

Solution

On your example data

Speed

0 comments:

Post a Comment

Popular Posts

Labels