Issue
For the given dataset below, I am able to calculate "distinct" visits to a hotel, which is defined as any visit where the traveler arrives at least one full calendar day (i.e., 2 or more days) since a previous departure. Any arrival that is less than one full calendar day since the previous visit will be merged with the previous visit. The negative days_between_visits
are simulating some data errors that are found throughout the dataset.
I get the new start date by running: df["NewStartDate"] = df.groupby((df["days_between_visits"]>=2).cumsum().shift().fillna(0))["arrival_date "].transform("min")
and get the new end date by running: df["NewEndDate"] = df.groupby((df["days_between_visits"]>=2).cumsum().shift().fillna(0))["departure_date "].transform("max")
Check here for additional info on NewStartDate
and NewEndDate
were calculated.
ID | arrival_date | departure_date | days_between_visits |
---|---|---|---|
1 | 2010-04-01 | 2010-04-02 | 0 |
1 | 2010-04-02 | 2010-04-02 | 1 |
1 | 2010-04-03 | 2010-04-03 | 1 |
1 | 2010-04-04 | 2010-04-04 | 1 |
1 | 2010-04-05 | 2010-04-05 | 15 |
1 | 2010-04-20 | 2010-04-21 | -1 |
1 | 2010-04-20 | 2010-04-22 | -2 |
1 | 2010-04-20 | 2010-04-27 | 39 |
1 | 2010-06-05 | 2010-06-05 | 1 |
1 | 2010-06-06 | 2010-06-06 | 2 |
1 | 2010-06-08 | 2010-09-22 | 375 |
1 | 2011-10-01 | 2011-10-02 | 0 |
1 | 2011-10-02 | 2011-10-04 | 1 |
1 | 2011-10-05 | 2011-10-06 | 87 |
1 | 2012-01-01 | 2012-01-02 |
The problem: I don't understand what is happening on the conditional groupby
with the .cumsum()
. The conditional groupby
(on its own) returns two groups (i.e., those with <2 days between visits (i.e., False
group) and those with >=2 days between visits (i.e., True
group). When adding the .cumsum()
to the process, we get a cumulative sum on the days_between_visits
within each group BUT only after new groups are created, which are split where the condition in the groupby
is True
(i.e., each row where days_between_visits >= 2
is the start of a new group). Can somebody explain what is happening here and why the .cumsum()
is resulting in more groups?
Final result:
ID | arrival_date | departure_date | days_between_visits | NewStartDate | NewEndDate |
---|---|---|---|---|---|
1 | 2010-04-01 | 2010-04-02 | 0 | 2010-04-01 | 2010-04-05 |
1 | 2010-04-02 | 2010-04-02 | 1 | 2010-04-01 | 2010-04-05 |
1 | 2010-04-03 | 2010-04-03 | 1 | 2010-04-01 | 2010-04-05 |
1 | 2010-04-04 | 2010-04-04 | 1 | 2010-04-01 | 2010-04-05 |
1 | 2010-04-05 | 2010-04-05 | 15 | 2010-04-01 | 2010-04-05 |
1 | 2010-04-20 | 2010-04-21 | -1 | 2010-04-20 | 2010-04-27 |
1 | 2010-04-20 | 2010-04-22 | -2 | 2010-04-20 | 2010-04-27 |
1 | 2010-04-20 | 2010-04-27 | 39 | 2010-04-20 | 2010-04-27 |
1 | 2010-06-05 | 2010-06-05 | 1 | 2010-06-05 | 2010-06-06 |
1 | 2010-06-06 | 2010-06-06 | 2 | 2010-06-05 | 2010-06-06 |
1 | 2010-06-08 | 2010-09-22 | 375 | 2010-06-08 | 2010-09-22 |
1 | 2011-10-01 | 2011-10-02 | 0 | 2011-10-01 | 2011-10-06 |
1 | 2011-10-02 | 2011-10-04 | 1 | 2011-10-01 | 2011-10-06 |
1 | 2011-10-05 | 2011-10-06 | 87 | 2011-10-01 | 2011-10-06 |
1 | 2012-01-01 | 2012-01-02 | 2012-01-01 | 2012-01-02 |
Solution
Let's do this piece by piece: (df["days_between_visits"]>=2)
returns a boolean series. We treat this series as an int64 because pandas
allows for some integer operations to be performed on boolean series (with False == 0
and True == 1
).
When we perform the .cumsum()
on the boolean series, we get the following:
However, because the True
values are indications of the last row for a given visit to the hotel (i.e., >=2 days until the next visit visit), we need to include the row that contains the True
(i.e., 1
) value into the visit by shifting (i.e., shift()
) the data down one row producing the following:
Now we can fillna(0)
and we have our distinct groups. We can now groupby this newly created series and perform our transform functions on the arrival_date
and departure_date
as needed.
Answered By - adm-gis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.