Issue
Consider the following solution to computing a within-group diff in Pandas:
df = df.set_index(['ticker', 'date']).sort_index()[['value']]
df['diff'] = np.nan
idx = pd.IndexSlice
for ix in df.index.levels[0]:
df.loc[ idx[ix,:], 'diff'] = df.loc[idx[ix,:], 'value' ].diff()
For:
> df
date ticker value
0 63 C 1.65
1 88 C -1.93
2 22 C -1.29
3 76 A -0.79
4 72 B -1.24
5 34 A -0.23
6 92 B 2.43
7 22 A 0.55
8 32 A -2.50
9 59 B -1.01
It returns:
> df
value diff
ticker date
A 22 0.55 NaN
32 -2.50 -3.05
34 -0.23 2.27
76 -0.79 -0.56
B 59 -1.01 NaN
72 -1.24 -0.23
92 2.43 3.67
C 22 -1.29 NaN
63 1.65 2.94
88 -1.93 -3.58
The solution does not scale well for large dataframes. It takes minutes for a dataframe with a shape (405344,2)
. This is presumably the case because I am iterating through each value for the first level in the main loop.
Is there any way of speeding this up in Pandas? Is looping through index values a good way of solving this problem? Could numba
perhaps be used for this?
Solution
Here's another way, which ought to be a lot faster.
First, sort based on ticker and date:
In [11]: df = df.set_index(['ticker', 'date']).sort_index()
In [12]: df
Out[12]:
value
ticker date
A 22 0.55
32 -2.50
34 -0.23
76 -0.79
B 59 -1.01
72 -1.24
92 2.43
C 22 -1.29
63 1.65
88 -1.93
Add the diff column:
In [13]: df['diff'] = df['value'].diff()
To fill in the NaNs, we can find the first line as follows (there may be a nicer way):
In [14]: s = pd.Series(df.index.labels[0])
In [15]: s != s.shift()
Out[15]:
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 True
8 False
9 False
dtype: bool
In [16]: df.loc[(s != s.shift()).values 'diff'] = np.nan
In [17]: df
Out[17]:
value diff
ticker date
A 22 0.55 NaN
32 -2.50 -3.05
34 -0.23 2.27
76 -0.79 -0.56
B 59 -1.01 NaN
72 -1.24 -0.23
92 2.43 3.67
C 22 -1.29 NaN
63 1.65 2.94
88 -1.93 -3.58
Answered By - Andy Hayden
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.