Issue
I have a dataframe that looks something like this:
df = pd.DataFrame([1,'A','X','1/3/22 12:00:00AM','1/1/22 12:00:00 AM','1/2/22 12:00:00 AM'],
[1,'A','X','1/4/22 12:00:00AM','1/3/22 12:00:00 AM','1/3/22 12:00:00 AM'],
[1,'A','Y','1/3/22 12:00:00AM','1/2/22 12:00:00 AM','1/1/22 12:00:00 AM'],
[1,'B','X','1/3/22 12:00:00AM','1/2/22 12:00:00 AM','1/3/22 12:00:00 AM'],
[2,'A','X','1/5/22 12:00:00AM','1/3/22 12:00:00 AM','1/4/22 12:00:00 AM'],
[2,'A','X','1/6/22 12:00:00AM','1/4/22 12:00:00 AM','1/5/22 12:00:00 AM']],
columns = ['ID','Category','Site','Task Completed','Access Completed', 'Upload Completed'])
ID | Category | Site | Task Completed | Access Completed | Upload Completed |
---|---|---|---|---|---|
1 | A | X | 1/3/22 12:00:00AM | 1/1/22 12:00:00 AM | 1/2/22 12:00:00 AM |
1 | A | X | 1/4/22 12:00:00AM | 1/3/22 12:00:00 AM | 1/3/22 12:00:00 AM |
1 | A | Y | 1/3/22 12:00:00AM | 1/2/22 12:00:00 AM | 1/1/22 12:00:00 AM |
1 | B | X | 1/3/22 12:00:00AM | 1/2/22 12:00:00 AM | 1/3/22 12:00:00 AM |
2 | A | X | 1/5/22 12:00:00AM | 1/3/22 12:00:00 AM | 1/4/22 12:00:00 AM |
2 | A | X | 1/6/22 12:00:00AM | 1/4/22 12:00:00 AM | 1/5/22 12:00:00 AM |
What I want to find is the time difference (in hours) between the later of max access completed date or min upload completed date and the first Task Completed date for every ID/Category/Site combination within the dataset.
In plain text, the logic is (the later date of (max access completed date vs min upload completed date)) - (min/first task completed date).
Based on this dataset, my intended result would look like:
ID | Category | Site | Time Difference | First Task Completed | Max Access Date | Min Upload Date |
---|---|---|---|---|---|---|
1 | A | X | 0 | 1/3/22 12:00:00AM | 1/3/22 12:00:00 AM | 1/2/22 12:00:00 AM |
1 | A | Y | 24 | 1/3/22 12:00:00AM | 1/2/22 12:00:00 AM | 1/1/22 12:00:00 AM |
1 | B | X | 0 | 1/3/22 12:00:00AM | 1/2/22 12:00:00 AM | 1/3/22 12:00:00 AM |
2 | A | X | 24 | 1/5/22 12:00:00AM | 1/4/22 12:00:00 AM | 1/4/22 12:00:00 AM |
I know how to get the max access date minus the first task completed date:
out = (df
.groupby(['ID', 'Category', 'Site'], as_index=False)
.agg({'Task Completed': 'first', 'Access Completed': 'max'})
.assign(**{'Time Difference': lambda d: d['Task Completed']
.sub(d['Access Completed'])
.dt.total_seconds().floordiv(3600)})
)
Which has output:
ID Category Site Task Completed Access Completed Time Difference
0 1 A X 2022-01-03 00:00:00 2022-01-02 24.0
1 1 A Y 2022-01-01 01:00:00 2022-01-01 1.0
2 1 B X 2022-01-01 01:00:00 2022-01-01 1.0
3 2 A X 2022-01-03 00:00:00 2022-01-02 24.0
But I don't know how to incorporate min upload completed date.
Solution
Try:
df["Task Completed"] = pd.to_datetime(
df["Task Completed"], format="%m/%d/%y %H:%M:%S%p"
)
df["Access Completed"] = pd.to_datetime(
df["Access Completed"], format="%m/%d/%y %H:%M:%S %p"
)
df["Upload Completed"] = pd.to_datetime(
df["Upload Completed"], format="%m/%d/%y %H:%M:%S %p"
)
out = df.groupby(["ID", "Category", "Site"], as_index=False).agg(
{
"Task Completed": "first",
"Access Completed": "max",
"Upload Completed": "min",
}
)
out["Time Difference"] = np.where(
(out["Access Completed"] - out["Upload Completed"]) > pd.Timedelta(0),
(out["Access Completed"] - out["Task Completed"]).abs().dt.total_seconds()
/ 3600,
(out["Upload Completed"] - out["Task Completed"]).abs().dt.total_seconds()
/ 3600,
)
print(out)
Prints:
ID | Category | Site | Task Completed | Access Completed | Upload Completed | Time Difference | |
---|---|---|---|---|---|---|---|
0 | 1 | A | X | 2022-01-03 12:00:00 | 2022-01-03 12:00:00 | 2022-01-02 12:00:00 | 0 |
1 | 1 | A | Y | 2022-01-03 12:00:00 | 2022-01-02 12:00:00 | 2022-01-01 12:00:00 | 24 |
2 | 1 | B | X | 2022-01-03 12:00:00 | 2022-01-02 12:00:00 | 2022-01-03 12:00:00 | 0 |
3 | 2 | A | X | 2022-01-05 12:00:00 | 2022-01-04 12:00:00 | 2022-01-04 12:00:00 | 24 |
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.