Issue
Given the following data frame:
a | b | c | d | e | sd |
---|---|---|---|---|---|
-100 | 2 | 3 | 60 | 4 | 1 |
7 | 5 | -50 | 9 | 130 | 2 |
How would I calculate the standard deviation sd
column which excludes the minimum and maximum values from each row?
The actual data frame is a few million rows long so something vectorised would be great!
To replicate:
df = pd.DataFrame(
{"a": [-100, 7], "b": [2, 5], "c": [3, -50], "d": [60, 9], "e": [4, 130]}
)
Solution
Excluding the first min/max
An approach that should be fast would be to use numpy to sort
the values per row, exclude the first and last and compute the std
:
df['sd'] = np.sort(df, axis=1)[:, 1:-1].std(axis=1, ddof=1)
Handling duplicate min/max (excluding all)
If you can have several times the same min/max and want to exclude all, then you could compute a mask:
m1 = df.ne(df.min(axis=1), axis=0)
m2 = df.ne(df.max(axis=1), axis=0)
df['sd'] = df.where(m1&m2).std(axis=1)
Output:
a b c d e sd
0 -100 2 3 60 4 1.0
1 7 5 -50 9 130 2.0
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.