Issue
I have two dataframes: df1
with shape (34, 1151649)
, df2
with shape (76, 3467)
. I would like to perform pandas.where.groupby
operation on them. They are quite slow. I'm wondering is there's any way to speed up the code. A sample code is as follows.
df1 = pd.DataFrame(np.arange(6).reshape(2, 3), index=[1, 2], columns=pd.MultiIndex.from_tuples((('a', 1), ('a', 2), ('b', 3)), names=['n1', 'n2']))
df2 = pd.DataFrame(np.arange(6).reshape(3, 2), index=[0, 1, 2], columns=pd.Index(['a', 'c'], name='n1'))
df1
df2
df1.where(df2 == 2).groupby(level=1, axis=1).sum()
The output is as follows:
Solution
IIUC, you can use the following, which is >500x faster for large frames:
out = df1.mul(
df2 == 2, fill_value=0
).groupby(level=1, axis=1).sum().reindex(
index=df1.index, fill_value=0
)
On your example data
>>> out
n2 1 2 3
1 0.0 1.0 0.0
2 0.0 0.0 0.0
Speed
from string import ascii_lowercase
np.random.seed(0)
n = 20
r0, c0 = 10, 20_000
r1, c1 = 10, n
letters = list(ascii_lowercase[:n])
df1 = pd.DataFrame(
np.random.randint(0, 10, (r0, c0)),
columns=pd.MultiIndex.from_product([letters, range(c0 // n)], names=['n1', 'n2']))
df2 = pd.DataFrame(
np.random.randint(0, 10, (r1, c1)), columns=pd.Index(letters, name='n1'))
Then:
orig_t = %timeit -o df1.where(df2 == 2).groupby(level=1, axis=1).sum()
# 1.6 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
okay_t = %timeit -o df1.mul(df2 == 2, fill_value=0).groupby(level=1, axis=1).sum().reindex(index=df1.index, fill_value=0)
# 2.82 ms ± 5.61 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> orig_t.best / okay_t.best
555.9134844909726
Bonus: if df2
is "complete" (i.e., it has all the columns of level 0
of df1
), then the output isn't converted to float
because of NaN
values.
For example, with the 20K columns example above:
>>> out.iloc[:5, :10]
n2 0 1 2 3 4 5 6 7 8 9
0 8 10 0 9 5 13 6 9 16 12
1 15 11 17 4 14 10 16 7 8 10
2 11 14 9 15 16 15 10 11 12 4
3 8 8 1 0 6 1 3 1 6 4
4 5 7 1 2 7 7 3 9 2 2
whereas, from the original code:
>>> df1.where(df2 == 2).groupby(level=1, axis=1).sum().iloc[:5, :10]
n2 0 1 2 3 4 5 6 7 8 9
0 8.0 10.0 0.0 9.0 5.0 13.0 6.0 9.0 16.0 12.0
1 15.0 11.0 17.0 4.0 14.0 10.0 16.0 7.0 8.0 10.0
2 11.0 14.0 9.0 15.0 16.0 15.0 10.0 11.0 12.0 4.0
3 8.0 8.0 1.0 0.0 6.0 1.0 3.0 1.0 6.0 4.0
4 5.0 7.0 1.0 2.0 7.0 7.0 3.0 9.0 2.0 2.0
If df2
is not "complete", one can still achieve the same outcome, with a small extra cost:
out = df1.mul(
df2.reindex(
index=df1.index, columns=df1.columns.unique(level=0),
fill_value=-1
) == 2, fill_value=0).groupby(level=1, axis=1).sum()
On the speed test above, it is 13% slower (3.19 ms instead of 2.82 ms). This is still much faster that the original (1.6 s).
Answered By - Pierre D
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.