Issue
I want to know how long it has been driven during certain time segments, in this case we want to look at one hour segments. To get the segments I make the following code:
init = '2022-03-10 01:00:00'
end = '2022-03-10 06:00:00'
freq = '1h'
bucket = pd.DataFrame(
{'start_date': pd.date_range(start=init, end=end, freq=freq)}
)
bucket['end_date'] = bucket['start_date'] + pd.Timedelta(seconds=3600)
bucket
start_date end_date
0 2022-03-10 01:00:00 2022-03-10 02:00:00
1 2022-03-10 02:00:00 2022-03-10 03:00:00
2 2022-03-10 03:00:00 2022-03-10 04:00:00
3 2022-03-10 04:00:00 2022-03-10 05:00:00
4 2022-03-10 05:00:00 2022-03-10 06:00:00
5 2022-03-10 06:00:00 2022-03-10 07:00:00
From the database I get all the trips that have been made in the start and end time periods. For this example, the data obtained is:
df = pd.DataFrame({
'start_date': ['2022-03-10 01:20:00', '2022-03-10 02:18:00', '2022-03-10 02:10:00', '2022-03-10 02:40:00', '2022-03-10 02:45:00', '2022-03-10 03:05:00', '2022-03-10 03:12:00', '2022-03-10 05:30:00'],
'end_date': ['2022-03-10 01:32:00', '2022-03-10 02:42:00', '2022-03-10 02:23:00', '2022-03-10 03:20:00', '2022-03-10 02:58:00', '2022-03-10 03:28:00', pd.NA, '2022-03-10 05:48:00'],
'number_of_trip': ["637hui", "384nfj", "102fiu", "948pvc", "473mds", "103fkd", "905783", "498wsq"],
'id': [1, 2, 3, 4, 5, 6, 7, 8]
})
df = df.replace(pd.NA, datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
start_date end_date number_of_trip id seconds
0 2022-03-10 01:20:00 2022-03-10 01:32:00 637hui 1 720
1 2022-03-10 02:18:00 2022-03-10 02:42:00 384nfj 2 1440
2 2022-03-10 02:10:00 2022-03-10 02:23:00 102fiu 3 780
3 2022-03-10 02:40:00 2022-03-10 03:20:00 948pvc 4 2400
4 2022-03-10 02:45:00 2022-03-10 02:58:00 473mds 5 780
5 2022-03-10 03:05:00 2022-03-10 03:28:00 103fkd 6 1380
6 2022-03-10 03:12:00 2022-04-06 17:04:15 905783 7 0
7 2022-03-10 05:30:00 2022-03-10 05:48:00 498wsq 8 1080
Now, I can't do a direct summation like this:
times = df.resample(freq, on='start_date')['seconds'].sum()
start_date
2022-03-10 01:00:00 720
2022-03-10 02:00:00 5400
2022-03-10 03:00:00 1380
2022-03-10 04:00:00 0
2022-03-10 05:00:00 1080
The reason is that it only takes the start date but not the end date, this means that if a trip enters into two segments, it is only added in the first segment when in reality it should add one part in one segment and the other in another segment.
For the segment from 2:00:00 to 3:00:00
there are 4 rows that can enter, but not complete. The rows are:
1 2022-03-10 02:18:00 2022-03-10 02:42:00 384nfj 2 1440
2 2022-03-10 02:10:00 2022-03-10 02:23:00 102fiu 3 780
3 2022-03-10 02:40:00 2022-03-10 03:20:00 948pvc 4 2400 # This is in two segments 2022-03-10 02:00:00 to 2022-03-10 03:00:00 (1200) and 2022-03-10 03:00:00 to 2022-03-10 04:00:00 (1200)
4 2022-03-10 02:45:00 2022-03-10 02:58:00 473mds 5 780
The resample code adds these 4 rows 1440+780+2400+780
giving the result of 5400
, however row 3 has time of 2 segments from 2:00-3:00 and from 3:00-4:00
, so for the segment 2:00-3:00
the elapsed time must be taken, in this case it must be from 2:40:00 to 3:00:00
, which is 1200 seconds, that makes the sum for the 2:00-3:00
segment is 1440+780+1200+780 = 4200
I have some code that gives the values correctly, but it is very slow. is there a way in pandas to improve this?
CODE:
from datetime import timedelta, datetime
import pandas as pd
def bucket_count(seconds_bucket, data, inicio, fin):
result = pd.DataFrame()
bucket = pd.DataFrame(
{'start_date': pd.date_range(start=inicio, end=fin, freq="H"),
"color": "#FF0000"}
)
bucket['end_date'] = bucket['start_date'] + pd.Timedelta(seconds=seconds_bucket)
for index, row_bucket in bucket.iterrows():
inicio = row_bucket['start_date']
fin = inicio + timedelta(seconds=seconds_bucket)
df = data[(
((inicio < data['start_date']) & (
fin > data['start_date'])) |
((data['end_date'] > inicio) & (data['end_date'] < fin)))
]
for index, row in df.iterrows():
counter = 0
if row['start_date'] > inicio < fin and row['end_date'] < fin > inicio:
seconds = int((row['end_date'] - row['start_date']).total_seconds())
counter += seconds
elif row['start_date'] < inicio and row['end_date'] < fin:
seconds = int((row['end_date'] - inicio).total_seconds())
counter += seconds
elif row['start_date'] > inicio and row['end_date'] > fin:
seconds = int((fin - row['start_date']).total_seconds())
counter += seconds
row['start_date'] = inicio
row['end_date'] = fin
row['seconds'] = counter
result = pd.concat([result, row.to_frame().T], ignore_index=True)
return result.groupby(['start_date'])["seconds"].apply(lambda x: x.astype(int).sum()).reset_index()
inicio = '2022-03-10 01:00:00'
fin = '2022-03-10 06:00:00'
df = pd.DataFrame({
'start_date': ['2022-03-10 01:20:00', '2022-03-10 02:18:00', '2022-03-10 02:10:00', '2022-03-10 02:40:00', '2022-03-10 02:45:00', '2022-03-10 03:05:00', '2022-03-10 03:12:00', '2022-03-10 05:30:00'],
'end_date': ['2022-03-10 01:32:00', '2022-03-10 02:42:00', '2022-03-10 02:23:00', '2022-03-10 03:20:00', '2022-03-10 02:58:00', '2022-03-10 03:28:00', pd.NA, '2022-03-10 05:48:00'],
'number_of_trip': ["637hui", "384nfj", "102fiu", "948pvc", "473mds", "103fkd", "905783", "498wsq"],
'id': [1, 2, 3, 4, 5, 6, 7, 8]
})
df = df.replace(pd.NA, datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
a = bucket_count(3600, df, inicio, fin)
This result is correct
start_date seconds
0 2022-03-10 01:00:00 720
1 2022-03-10 02:00:00 4200
2 2022-03-10 03:00:00 5460
3 2022-03-10 05:00:00 1080
Solution
This is what I came up with.
First off the time on your method was:
%%timeit
bucket_count(3600, df, inicio, fin)
>>> 44 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Versus the method I came up with:
%%timeit
bucket_count_new(3600, df, inicio, fin)
>>>17.1 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So about 50% faster. I think it will be even quicker on larger samples but I did not test this.
Additionally, I think your method has a bug in it, but I did not troubleshoot it. Here is what I came up with and the result.
import pandas as pd
def bucket_count_new(seconds_bucket: int, data: pd.DataFrame, start: pd.Timestamp, end: pd.Timestamp):
# filter out rows that are not within the range of start to end
filtered_data = data.loc[data.start_date.lt(end) & data.end_date.gt(start)]
# cut records start and ending outside of range to time range
filtered_data.loc[filtered_data.start_date.lt(start), 'start_date'] = start
filtered_data.loc[filtered_data.end_date.gt(end) | filtered_data.end_date.isnull(), 'end_date'] = end
# add total trip seconds within date range
filtered_data['trip_seconds'] = (filtered_data.end_date - filtered_data.start_date).dt.total_seconds()
results = pd.DataFrame(
{'start_date': pd.date_range(start=start, end=end, freq=pd.Timedelta(seconds=seconds_bucket)),
'total_seconds': 0})
for ind, trip_start, trip_end, trip_num, trip_id, trip_seconds in filtered_data.to_records():
# get first bucket index after start
first_ind_after_start = results['start_date'].gt(trip_start).values.argmax()
# calculate first timedelta to get to next bucket start
first_timedelta = (results.loc[first_ind_after_start, 'start_date'] - trip_start).total_seconds()
# if the first timedelta is less than the total seconds for that trip add the trip_seconds else the timedelta
results.loc[
first_ind_after_start - 1, 'total_seconds'] += first_timedelta if first_timedelta <= trip_seconds else trip_seconds
# subtract time from trip added to first bucket for trip
trip_seconds -= first_timedelta
# iterate through next buckets subtracting time until gone
ind = first_ind_after_start
while trip_seconds >= seconds_bucket:
results.loc[ind, 'total_seconds'] += seconds_bucket
trip_seconds -= seconds_bucket
ind += 1
if trip_seconds > 0:
results.loc[ind, 'total_seconds'] = trip_seconds
return results
inicio = pd.Timestamp('2022-03-10 01:00:00')
fin = pd.Timestamp('2022-03-10 06:00:00')
df = pd.DataFrame({
'start_date': ['2022-03-10 01:20:00', '2022-03-10 02:18:00', '2022-03-10 02:10:00', '2022-03-10 02:40:00',
'2022-03-10 02:45:00', '2022-03-10 03:05:00', '2022-03-10 03:12:00', '2022-03-10 05:30:00'],
'end_date': ['2022-03-10 01:32:00', '2022-03-10 02:42:00', '2022-03-10 02:23:00', '2022-03-10 03:20:00',
'2022-03-10 02:58:00', '2022-03-10 03:28:00', pd.NA, '2022-03-10 05:48:00'],
'number_of_trip': ["637hui", "384nfj", "102fiu", "948pvc", "473mds", "103fkd", "905783", "498wsq"],
'id': [1, 2, 3, 4, 5, 6, 7, 8]
})
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
a = bucket_count_new(3600, df.copy(deep=True), start=inicio, end=fin)
Result:
a
start_date total_seconds
0 2022-03-10 01:00:00 720
1 2022-03-10 02:00:00 4200
2 2022-03-10 03:00:00 2580
3 2022-03-10 04:00:00 0
4 2022-03-10 05:00:00 1080
5 2022-03-10 06:00:00 0
I believe this is correct unless I did my first filtering incorrectly because the total seconds within the range of start to end is the same as the total seconds in a. But I did not have time to debug it. Please let me know if this is the correct results.
Answered By - ak_slick
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.