Monday, June 6, 2022

[FIXED] Pandas: merge_asof with conditions from other columns

June 06, 2022 pandas, python No comments

Issue

I have a toy example as follows

I would like to merge the actions column in rules to the original df. Merging conditions are the following.

(value >= lower) & (value < upper)
date in df must merge with the nearest previous date in rules

The expected output is shown in the above figure. Here is the df and rules

df = pd.DataFrame({"date": ["2022-05-15", "2022-05-20", "2022-05-25", "2022-05-30"],
                   "values": [10, 20, 30, 80]})
df["date"] = pd.to_datetime(df["date"])

rules = pd.DataFrame({"lower": [0, 25, 50, 75, 0],
                      "upper": [25, 50, 75, float("inf"), 25],
                      "actions": [5, 10, 15, 20, 8],
                      "date": ["2022-01-01", "2022-01-01", "2022-01-01", "2022-01-01", "2022-05-18"]})
rules["date"] = pd.to_datetime(rules["date"])

May I have suggestions about effective method to do this?

I'm trying to solve this problem in an alternative way using pandasql because thus join can be done easily in SQL. Here is my code

from pandasql import sqldf

sql = """SELECT DISTINCT on (df.date)
             df.date,
             df.values,
             rules.actions
         FROM df
         LEFT JOIN rules
         ON (df.date > rules.date) AND (df.values >= rules.lower) AND (df.values < rules.upper)
         ORDER BY df.date, rules.date DESC"""

pysqldf = lambda x: sqldf(x)
pysqldf(sql)

Even though the sql statement is working in postgres, it does not work when I run with pandasql. I got the following error.

PandaSQLException: (sqlite3.OperationalError) near "on": syntax error
[SQL: SELECT DISTINCT on (df.date)
             df.date,
             df.values,
             rules.actions
         FROM df
         LEFT JOIN rules
         ON (df.date > rules.date) AND (df.values >= rules.lower) AND (df.values < rules.upper)
         ORDER BY df.date, rules.date DESC]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

Did I overlook something?

Solution

One option is with conditional_join from pyjanitor, and after the merge, you can do a groupby to get the minimum rows:

# install from dev 
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd

(df
.astype({'values':float})
.conditional_join(
    rules.astype({'lower':float}), 
    # pass the conditions as a variable arguments of tuples 
    ('values', 'lower', '>='), 
    ('values', 'upper', '<'), 
    ('date', 'date', '>'),
    # select required columns with df_columns, and right_columns
    df_columns = ['date','values'], 
    right_columns={'actions':'actions', 'date':'date_right'})
# get the difference and keep the smallest days
.assign(dff = lambda df: df.date.sub(df.date_right))
.sort_values(['date', 'dff'])
.drop(columns = ['dff', 'date_right'])
.groupby('date', sort = False, as_index = False)
.nth(0)
)

        date  values  actions
0 2022-05-15    10.0        5
3 2022-05-20    20.0        8
4 2022-05-25    30.0       10
5 2022-05-30    80.0       20

Answered By - sammywemmy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, June 6, 2022

[FIXED] Pandas: merge_asof with conditions from other columns

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels