Issue
I have a toy example as follows
I would like to merge the actions column in rules to the original df. Merging conditions are the following.
- (value >= lower) & (value < upper)
- date in df must merge with the nearest previous date in rules
The expected output is shown in the above figure. Here is the df and rules
df = pd.DataFrame({"date": ["2022-05-15", "2022-05-20", "2022-05-25", "2022-05-30"],
"values": [10, 20, 30, 80]})
df["date"] = pd.to_datetime(df["date"])
rules = pd.DataFrame({"lower": [0, 25, 50, 75, 0],
"upper": [25, 50, 75, float("inf"), 25],
"actions": [5, 10, 15, 20, 8],
"date": ["2022-01-01", "2022-01-01", "2022-01-01", "2022-01-01", "2022-05-18"]})
rules["date"] = pd.to_datetime(rules["date"])
May I have suggestions about effective method to do this?
I'm trying to solve this problem in an alternative way using pandasql because thus join can be done easily in SQL. Here is my code
from pandasql import sqldf
sql = """SELECT DISTINCT on (df.date)
df.date,
df.values,
rules.actions
FROM df
LEFT JOIN rules
ON (df.date > rules.date) AND (df.values >= rules.lower) AND (df.values < rules.upper)
ORDER BY df.date, rules.date DESC"""
pysqldf = lambda x: sqldf(x)
pysqldf(sql)
Even though the sql statement is working in postgres, it does not work when I run with pandasql. I got the following error.
PandaSQLException: (sqlite3.OperationalError) near "on": syntax error
[SQL: SELECT DISTINCT on (df.date)
df.date,
df.values,
rules.actions
FROM df
LEFT JOIN rules
ON (df.date > rules.date) AND (df.values >= rules.lower) AND (df.values < rules.upper)
ORDER BY df.date, rules.date DESC]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
Did I overlook something?
Solution
One option is with conditional_join from pyjanitor, and after the merge, you can do a groupby to get the minimum rows:
# install from dev
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df
.astype({'values':float})
.conditional_join(
rules.astype({'lower':float}),
# pass the conditions as a variable arguments of tuples
('values', 'lower', '>='),
('values', 'upper', '<'),
('date', 'date', '>'),
# select required columns with df_columns, and right_columns
df_columns = ['date','values'],
right_columns={'actions':'actions', 'date':'date_right'})
# get the difference and keep the smallest days
.assign(dff = lambda df: df.date.sub(df.date_right))
.sort_values(['date', 'dff'])
.drop(columns = ['dff', 'date_right'])
.groupby('date', sort = False, as_index = False)
.nth(0)
)
date values actions
0 2022-05-15 10.0 5
3 2022-05-20 20.0 8
4 2022-05-25 30.0 10
5 2022-05-30 80.0 20
Answered By - sammywemmy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.