Issue
np.cumsum([1, 2, 3, np.nan, 4, 5, 6])
will return nan
for every value after the first np.nan
. Moreover, it will do the same for any generator. However, np.cumsum(df['column'])
will not. What does np.cumsum(...)
do, such that dataframes are treated specially?
In [2]: df = pd.DataFrame({'column': [1, 2, 3, np.nan, 4, 5, 6]})
In [3]: np.cumsum(df['column'])
Out[3]:
0 1.0
1 3.0
2 6.0
3 NaN
4 10.0
5 15.0
6 21.0
Name: column, dtype: float64
Solution
When you call np.cumsum(object)
with an object that is not a numpy array, it will try calling object.cumsum()
See this thread for details
. You can also see it in the Numpy source.
The pandas method has a default of skipna=True
. So np.cumsum(df)
gets turned into the equivalent of df.cumsum(axis=None, skipna=True, *args, **kwargs)
, which, of course skips the NaN values. The Numpy method does not have a skipna
option.
You can also verify this yourself by overriding the pandas method with your own:
class DF(pd.DataFrame):
def cumsum(self, axis=None, skipna=True, *args, **kwargs):
print('calling pandas cumsum')
return super().cumsum(axis=None, skipna=True, *args, **kwargs)
df = DF({'column': [1, 2, 3, np.nan, 4, 5, 6]})
# does calling the numpy function call your pandas method?
np.cumsum(df)
This will print
calling pandas cumsum
and return the expected result:
column
0 1.0
1 3.0
2 6.0
3 NaN
4 10.0
5 15.0
6 21.0
You can then experiment with the result of changing skipna=True
.
Answered By - Mark
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.