Issue
I have a pandas DataFrame as shown below. The columns are date, color, time, and duration (in seconds). I need to calculate the amount of time throughout a day that we are displaying a color.
date color start time duration(seconds)
2021-07-06 RED 11:00:00.00 5
2021-07-06 RED 11:00:00.00 9
2021-07-06 BLUE 11:00:00.00 3
2021-07-06 RED 11:00:00.00 3
2021-07-06 BLUE 12:00:00.00 10
2021-07-06 BLUE 12:00:00.00 7
2021-07-06 RED 12:00:00.00 9
2021-07-06 BLUE 12:00:00.00 5
2021-07-06 RED 12:00:00.00 1
2021-07-06 RED 12:00:00.00 2
For example, in a 24 hour day I need to understand how long we are displaying a color. There will be a variable number of colors each day, staggered start times, and varying durations.
If we're looking at the color red in the example above, the duration of displaying red would be 18 seconds. We don't double count any display overlaps.
My desired output would be a DataFrame which tells me how long each color was displayed for, and how long all colors were displayed. The maximum amount of time for each color, or any color can only be 24 hours. For the example above, the answer would be:
Red Duration: 18 seconds
Blue Duration: 13 seconds
Total Duration: 19 seconds
How would I go about doing this?
Solution
Each row in your data corresponds to an interval. For any colour, how many intervals overlap a point in time is a step function. A package called staircase has been built upon pandas and numpy for analysis with step functions.
setup
I'm going to use different data, to highlight that this approach works with staggered start times (which your example doesn't have). I'm also combining date and start time to simplify
df = pd.DataFrame(
{
"color":["RED", "RED","BLUE","RED","BLUE","BLUE","RED","BLUE","RED","RED"],
"start":[
pd.Timestamp("2021-07-06 11:00:00"),
pd.Timestamp("2021-07-06 11:00:02"),
pd.Timestamp("2021-07-06 11:00:01"),
pd.Timestamp("2021-07-06 11:00:04"),
pd.Timestamp("2021-07-06 11:00:05"),
pd.Timestamp("2021-07-06 11:00:05"),
pd.Timestamp("2021-07-06 11:00:08"),
pd.Timestamp("2021-07-06 11:00:07"),
pd.Timestamp("2021-07-06 11:00:12"),
pd.Timestamp("2021-07-06 11:00:14"),
],
"duration":[5,4,7,3,4,6,4,7,2,3]
}
)
df["end"] = df["start"] + pd.to_timedelta(df["duration"], "s")
solution
We'll create a step function for each colour. A step function is represented by the staircase.Stairs clas
s. This class is to staircase as Series is to pandas. To do this we group the dataframe on this variable and pass the sub-dataframes to the Stairs constructor.
import staircase as sc
stepfunctions = df.groupby("color").apply(sc.Stairs, "start", "end")
Our stepfunctions
variable looks like this. It is a series, with a multi-index, and values are Stairs
objects.
color
BLUE <staircase.Stairs, id=2640555912520>
RED <staircase.Stairs, id=2640555814408>
dtype: object
We can add these step functions together to create one for the total
stepfunctions["TOTAL"] = stepfunctions.sum()
There are many things you can do with step functions in staircase including plotting.
stepfunctions["RED"].plot(style="hlines")
Since you do not want to count overlaps, we need to convert each step function to a boolean valued one - i.e. if it is non zero set it to 1.
stepfunctions = stepfunctions.apply(sc.Stairs.make_boolean)
Now plotting the step function for RED again gives us
What you are wanting to do is create bins and integrate (find the area under) these step functions. For a step function sf this means the following calculation. Let's say your bins are 5 second bins, (as opposed to daily), just to make it interesting
bins = pd.date_range(pd.Timestamp("2021-07-06 11:00:00"), freq = "5s", periods=4)
bin_sums = stepfunctions.apply(lambda sf: sf.slice(bins).integral())
This gives you a dataframe, indexed by color, with columns for each bin. We'll melt it to display it better
bin_sums.melt(var_name="bin", ignore_index=False)
color bin value
BLUE [2021-07-06 11:00:00, 2021-07-06 11:00:05) 0 days 00:00:04
RED [2021-07-06 11:00:00, 2021-07-06 11:00:05) 0 days 00:00:05
TOTAL [2021-07-06 11:00:00, 2021-07-06 11:00:05) 0 days 00:00:05
BLUE [2021-07-06 11:00:05, 2021-07-06 11:00:10) 0 days 00:00:05
RED [2021-07-06 11:00:05, 2021-07-06 11:00:10) 0 days 00:00:04
TOTAL [2021-07-06 11:00:05, 2021-07-06 11:00:10) 0 days 00:00:05
BLUE [2021-07-06 11:00:10, 2021-07-06 11:00:15) 0 days 00:00:04
RED [2021-07-06 11:00:10, 2021-07-06 11:00:15) 0 days 00:00:05
TOTAL [2021-07-06 11:00:10, 2021-07-06 11:00:15) 0 days 00:00:05
So this approach
- handles overlaps
- handles intervals crossing bin boundaries
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
Answered By - Riley
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.