Issue
So I have a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3, 3, 2, 1], [4, 3, 6, 6 ,3 ,4], [7, 2, 9, 9, 2, 7]]),
columns=['a', 'b', 'c', 'a_select','b_select','c_select'])
df
Now, I may need to reorganize the dataframe (or use two) to accomplish this, but...
I'd like to select the 2 largest values from each '_select' column per row, then use that to mean the corresponding column.
For example, row 1 would mean the values from a & b, row 2 a & c (NOT the values from the _select columns that we're looking at).
Currently I'm just iterating each row - as that seems rather simple, but slow with a large dataset - however I can't figure out how to use an apply or lambda function to do the equivelant (or if it's even possible).
Solution
Simple oneliner using nlargest
>>> df.filter(like='select').apply(lambda s: s.nlargest(2), 1).mean(1)
For performance, maybe numpy
is useful:
>>> np.sort(df.filter(like='select').to_numpy(), 1)[:, -2:].mean(1)
To get values from the first columns, use argsort
>>> arr = df.filter(like='select').to_numpy()
>>> df[['a', 'b', 'c']].to_numpy()[[[x] for x in np.arange(len(arr))],
np.argsort(arr, 1)][:, -2:].mean(1)
array([1.5, 5. , 8. ])
Answered By - rafaelc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.