Issue
Let's say a user can input the columns and values to compare for a DF, so we can have:
column_list = ['col1', 'col2', 'col3']
value_list = [val1, val2, val3]
So to select the rows that satisfy where col1 >= val1 AND col2 >= val2 AND col3 >= val3 we would write:
selection = (df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3))
or it can be in the form:
selection = df.loc[(df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3)]
The number of columns is not known in advance, so we can have n columns. We can try this approach:
if n=1:
selection = (df['col1'] >= val1))
elif n=2:
selection = (df['col1'] >= val1) & (df['col2'] >= val2))
elif n=3:
selection = (df['col1'] >= val1) & (df['col2'] >= val2) & (df['col3'] >= val3))
But this is neither scalable nor efficient. I tried by generating strings "df['col<>'] >= val<>)" with a foor loop given the input lists but it didn't work for Pandas because of the str format.
What would be the best pythonic approach for this? To avoid having all the options with if and else statements.
Thank you in advance!
Solution
To perform a comparison with the same operator for all columns, create a Series with the values and columns ids and use it to perform an aligned comparison with the dataframe:
df[df.gt(pd.Series(value_list, index=column_list)).all(1)]
Example input:
>>> value_list
[3, 7, 11]
>>> df
col1 col2 col3
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
output:
col1 col2 col3
4 12 13 14
intermediates:
>>> pd.Series(value_list, index=column_list)
col1 3
col2 7
col3 11
>>> df.gt(pd.Series(value_list, index=column_list))
col1 col2 col3
0 False False False
1 False False False
2 True False False
3 True True False
4 True True True
>>> df.gt(pd.Series(value_list, index=column_list)).all(1)
0 False
1 False
2 False
3 False
4 True
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.