Sunday, June 19, 2022

[FIXED] Aggregate interval data by binning to create time series

June 19, 2022 dataframe, pandas, python No comments

Issue

I have a DataFrame containing records on the start time, end time, user, product and usage rate. A user is capable of using the same product multiple times at different rates. I want to transform this DataFrame into a timeseries dataset where for each user, bins of 4 hours are created in which the usage rate is added up. As start and end times can overlap with the bin boundaries, fractional usage over that bin would need to be calculated and each product receives their own column.

Original DataFrame:

start	end	user	product	usage_rate
1	12	A	X_1	10
8	15	A	X_1	20
3	7	B	X_1	3
3	8	B	X_2	70

Desired output:

user	bin	start	end	X_1	X_2
A	0-4	0	4	30	0
A	4-8	4	8	40	0
A	8-12	8	12	120	0
A	12-16	12	16	60	0
B	0-4	0	4	3	70
B	4-8	4	8	9	280

I've tried to use pd.cut to create bins, but only manage to create only DataFrames where the start time matches and/or the end time matches, but not the bins in between. I want to prevent using for loops as these can become quite slow when expanding this to 10k users and 20 products. The process I thought of first duplicates each row for the number of time bins required (0-4, 4-8, 8-12 for the first record), calculates the duration within that bin, multiplies duration by usage_rate to get used_in_bin value, then groupby(['user', 'bin']) and retrieves the sum of those used_in_bin values. Then the last step would be to pivot the product name and values to get the table.

The hard part is that I'm stuck creating the time bins for time ranges (either datetime or numerical). In what way can I create duplicate records for the missing bins between the start and end time of my range?

Solution

The data you have is described by mathematical step functions, and staircase has been built upon pandas and numpy for this purpose.

setup

df = pd.DataFrame(
    {
        "start":[1,8,3,3],
        "end":[12,15,7,8],
        "user":["A", "A", "B", "B"],
        "product":["X_1", "X_1", "X_1", "X_2"],
        "usage_rate":[10,20,3,70]
    }
)

solution

We'll create a step function for each user and product. A step function is represented by the staircase.Stairs class. This class is to staircase as Series is to pandas. To do this we group the dataframe on these variables and pass the sub-dataframes to the Stairs constructor.

import staircase as sc

stepfunctions = df.groupby(["user", "product"]).apply(sc.Stairs, "start", "end", "usage_rate")

Our stepfunctions variable looks like this. It is a series, with a multi-index, and values are Stairs objects.

user  product
A     X_1        <staircase.Stairs, id=2516839332104>
B     X_1        <staircase.Stairs, id=2516834889160>
      X_2        <staircase.Stairs, id=2516835627464>
dtype: object

There are many things you can do with step functions in staircase including plotting.

stepfunctions["A", "X_1"].plot(style="hlines")

What you are wanting to do is create bins and integrate (find the area under) these step functions. For a step function sf this means the following calculation

sf.slice([0,4,8,12,16]).integral()

We can use pandas.Series.apply to do this for all our step functions.

binned = stepfunctions.apply(lambda sf: sf.slice([0,4,8,12,16]).integral())

The binned variable will we a dataframe, with the same index as stepfunctions, with a column for each bin interval

               [0, 4)  [4, 8)  [8, 12)  [12, 16)
user product                                   
A    X_1        30.0    40.0    120.0      60.0
B    X_1         3.0     9.0      0.0       0.0
     X_2        70.0   280.0      0.0       0.0

To get this data in a tidy format the following can be used

tidy_result = binned.melt(ignore_index=False).rename({"variable":"bin"}).reset_index()

The tidy_result dataframe will look like this:

   user product       bin  value
0     A     X_1    [0, 4)   30.0
1     B     X_1    [0, 4)    3.0
2     B     X_2    [0, 4)   70.0
3     A     X_1    [4, 8)   40.0
4     B     X_1    [4, 8)    9.0
5     B     X_2    [4, 8)  280.0
6     A     X_1   [8, 12)  120.0
7     B     X_1   [8, 12)    0.0
8     B     X_2   [8, 12)    0.0
9     A     X_1  [12, 16)   60.0
10    B     X_1  [12, 16)    0.0
11    B     X_2  [12, 16)    0.0

This may be enough for your purposes. If you want to get it in the exact format you have submitted, then this should be easy enough:

use tidy_result.pivot(index=["user", "bin"], columns="product") to turn the product column into columns for each product. Fill NA values for A and X_2 with 0.
use tidy_result["start"] = pd.IntervalIndex(tidy_result["bin"]).left etc

To recap, the solution (after imports) boils down to the following three lines

stepfunctions = df.groupby(["user", "product"]).apply(sc.Stairs, "start", "end", "usage_rate")
binned = stepfunctions.apply(lambda sf: sf.slice([0,4,8,12,16]).integral())
tidy_result = binned.melt(ignore_index=False).rename({"variable":"bin"}).reset_index()

note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.

Answered By - Riley

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, June 19, 2022

[FIXED] Aggregate interval data by binning to create time series

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels