Issue
I want to conditionally fill the missing and update the value from another dataframe.
I want to fill missing and update the data on column values in dataframe smalldf.
The condition is, if the value in B column (large df) is in the range of columns Range_FROM and Range_TO in (small df). Always choose the minimum records in (largedf) to fill or update the values in smalldf.
- For example, in the second record of smalldf values (3), as there is a smaller value (1.3) in the largedf, it would end up being updated by a smaller value (1.3).
- Another example, the third record of smalldf (Nan) is being filled by 1.6, as it is the smallest value within the range of Range_FROM (1.5) and Range_TO (1.6)
This is the first dataframe (smalldf):
RoadNo | Range_FROM | Range_TO | values |
---|---|---|---|
A001 | 1.15 | 1.2 | 0.1 |
A001 | 1.35 | 1.4 | 3 |
A001 | 1.55 | 1.6 | Nan |
A001 | 1.75 | 1.8 | 0.1 |
A001 | 1.9 | 2 | Nan |
This is the second dataframe (largedf):
RoadNo | B | values |
---|---|---|
A001 | 1.1 | 0.2 |
A001 | 1.2 | 0.1 |
A001 | 1.3 | 1.9 |
A001 | 1.4 | 1.3 |
A001 | 1.5 | 1.6 |
A001 | 1.6 | 1.9 |
A001 | 1.7 | 0.2 |
A001 | 1.8 | 0.1 |
A001 | 1.9 | 1.9 |
A001 | 2 | 1.3 |
Below is the expected dataframe:
RoadNo | Range_FROM | Range_TO | values |
---|---|---|---|
A001 | 1.15 | 1.2 | 0.1 |
A001 | 1.35 | 1.4 | 1.3 |
A001 | 1.55 | 1.6 | 1.6 |
A001 | 1.75 | 1.8 | 0.1 |
A001 | 1.9 | 2 | 1.3 |
Below is the code to create the two dataframes:
smalldf = pd.DataFrame(data={"RoadNo":["A001", "A001", "A001", "A001", "A001"],
"Range_FROM": [1.15, 1.35, 1.55, 1.75, 1.9],
"Range_TO":[1.2, 1.4, 1.6, 1.8, 2],
"values": [0.1, 0.25, "Nan", 0.1, "Nan" ]})
largedf = pd.DataFrame(data={"RoadNo":["A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001", "A001"],
"B": [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2],
"values": [0.2, 0.1, 1.9, 1.3, 1.6, 1.9, 0.2, 0.1, 1.9, 1.3]})
Please keep in mind that there are other RoadNo (e.g. A002, A003).
Both left join and fillna(df) in pandas did not produce my desired result. Is there a function in Pandas or SQL that can help me with this operation?
Thank you very much!
Solution
In SQL your problem would look like:
SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2.values)
FROM df1
LEFT JOIN df2
ON df1.RoadNo = df2.RoadNo
AND df2.B >= df1.Range_FROM
AND df2.B <= df1.Range_TO
GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO
So we can do:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
q = '''SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2."values") AS "values"
FROM df1
LEFT JOIN df2
ON df1.RoadNo = df2.RoadNo
AND df2.B > df1.Range_LAG
AND df2.B <= df1.Range_TO
GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO'''
pysqldf(q)
df = pysqldf(q)
print(df)
Output:
RoadNo Range_FROM Range_TO values
0 A001 1.15 1.2 0.1
1 A001 1.35 1.4 1.3
2 A001 1.55 1.6 1.9
3 A001 1.75 1.8 0.1
4 A001 1.90 2.0 1.3
But since you actually want something a little different, let's try:
df1['Range_LAG'] = df1.Range_TO.shift(fill_value=0)
q = '''SELECT df1.RoadNo, df1.Range_FROM, df1.Range_TO, MIN(df2."values") AS "values"
FROM df1
LEFT JOIN df2
ON df1.RoadNo = df2.RoadNo
AND df2.B > df1.Range_LAG
AND df2.B <= df1.Range_TO
GROUP BY df1.RoadNo, df1.Range_FROM, df1.Range_TO'''
df = pysqldf(q)
print(df)
Output:
RoadNo Range_FROM Range_TO values
0 A001 1.15 1.2 0.1
1 A001 1.35 1.4 1.3
2 A001 1.55 1.6 1.6
3 A001 1.75 1.8 0.1
4 A001 1.90 2.0 1.3
Answered By - BeRT2me
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.