Friday, February 4, 2022

[FIXED] Comparing pandas dataframes based on row values and a constant

February 04, 2022 dataframe, pandas, python, python-3.x No comments

Issue

I have 2 pandas dataframes:

df_test_values

column1 | col_values
0       | 0.99        
3       | 0.21       
1       | 0.55

df_reference_values

column1 | column2  | column3 ..... column100
0       | 0.80     |  0.21   |
3       | 0.22     |  0.11   |
1       | 0.11     |  0.33   |

Considering I have a constant value permitted_deviation and column1 is the same in both dataframes (same values and number of rows)

How can I compare all rows in df_test_values's col_values against a selected column in df_reference_values and determine how many of its values are within my variable permitted_deviation?

Input would be column_name for df_reference_values and permitted_deviation

An output could be something like this: "column_name has 100 values within the rage permitted_deviation"

Any inputs/ideas are appreciated.

Solution

This first suggestion is based on inferring that you want to evaluate each column individually against your permitted deviation, and not all the columns in aggregate.

(df_test_values["col_values"] - df_reference_values["column2"]).abs().apply(lambda x: 1 if x < permitted_deviation else 0).sum()

It uses the apply function, which can have performance issues for large dataframes, but provides a one-line way to count the number of records that meet your condition. I included absolute value on the deviation, figuring you care about the magnitude of the difference and not the direction, so adjust accordingly as necessary.

This also assumes that the rows in your 2 dataframes have the same number of rows (which you indicated), and that the rows they are in the exact same order. It looks like "column1" might be an index that identifies the corresponding rows, and if so you would need to make sure that they are sorted in the same order.

Here's another variation that performs the comparison and count on each column individually, but all at the same time, and it also makes sure that both original dataframes are sorted by column1, assuming that is suitable as an index for the rows:

df_reference_values.set_index("column1").sort_index() \
    .subtract(df_test_values.set_index("column1").sort_index()["col_values"], axis="index") \
    .abs() \
    .where(lambda x: x < permitted_deviation) \
    .count()

Here's another variant that performs the comparison across all of the columns, so that within each row all the columns must meet your criteria for the row to be counted as "passing":

df_reference_values.set_index("column1").sort_index() \
    .subtract(df_test_values.set_index("column1").sort_index()["col_values"], axis="index") \
    .abs() \
    .where(lambda x: x < permitted_deviation) \
    .dropna() \
    .shape[0]

Answered By - BioData41

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, February 4, 2022

[FIXED] Comparing pandas dataframes based on row values and a constant

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels