Issue
I have a MySQL database with records associated with date time of record. When several values are within a time range of 3 minutes, I want to do the mean of each values. I made a fake file to illustrate.
#dataSample.csv
;y;datetime
0;1.885539280369374;2020-12-18 00:16:59
1;88.87944658745302;2020-12-18 00:18:26
2;5.4934801892366645;2020-12-18 00:21:47
3;27.481240675960745;2020-12-22 02:22:43
4;78.20955112191257;2021-03-12 00:01:45
5;69.20174844202616;2021-03-12 00:03:01
6;92.452056802478;2021-03-12 00:04:10
7;65.44391665410022;2021-03-12 00:06:12
8;40.59036279552053;2021-03-13 11:07:40
9;97.28850548113896;2021-03-13 11:08:46
10;94.73214209590618;2021-03-13 11:09:52
11;15.032038741334246;2021-03-14 00:50:10
12;26.96629037360529;2021-03-14 00:51:17
13;57.257554884427755;2021-03-14 00:52:20
14;18.845976481042804;2021-03-17 13:52:00
15;57.19160644979182;2021-03-17 13:53:48
16;3.81419643210113;2021-03-17 13:54:50
17;46.65212265222033;2021-03-17 20:00:06
18;78.99788944141437;2021-03-17 20:01:28
19;72.57950242929162;2021-03-17 20:02:18
20;31.953619913660063;2021-03-20 16:40:04
21;71.03880579866258;2021-03-20 16:41:14
22;80.07721218822367;2021-03-20 16:42:03
23;84.4974927845413;2021-03-23 23:51:04
24;23.332882564418554;2021-03-23 23:52:37
25;24.84651458538292;2021-03-23 23:53:44
26;3.2905723920299073;2021-04-13 01:07:13
27;95.00543057651691;2021-04-13 01:08:53
28;46.02579988887248;2021-04-13 01:10:03
29;71.73362449536457;2021-04-13 07:54:22
30;93.17353939667422;2021-04-13 07:56:03
31;28.06669274690586;2021-04-13 07:57:04
32;10.733532291051478;2021-04-21 23:52:19
33;92.92374999199961;2021-04-21 23:53:02
34;59.68694726616824;2021-04-21 23:54:12
35;30.01172074266929;2021-11-29 00:21:09
36;34.905022198511915;2021-11-29 00:23:09
37;25.149590827473055;2021-11-29 00:24:13
38;82.09740354280564;2021-12-01 08:30:00
39;25.58339148753002;2021-12-01 08:32:00
40;72.7009145748645;2021-12-01 08:34:00
41;8.43474445404563;2021-12-01 13:18:58
42;57.95936012084567;2021-12-01 13:19:45
43;31.118114587376713;2021-12-01 13:21:19
44;42.082098854369576;2021-12-01 20:24:46
45;75.8402567179772;2021-12-01 20:25:45
46;55.29546227636972;2021-12-01 20:26:20
47;72.52918512264547;2021-12-02 08:35:42
48;77.81077056479849;2021-12-02 08:36:35
49;34.63717484559066;2021-12-02 08:37:22
50;71.65724478546072;2021-12-06 00:05:00
51;19.54082334014094;2021-12-06 00:08:00
52;48.28967362303979;2021-12-06 00:10:00
53;34.894095185290105;2021-12-03 08:36:00
54;58.187428474357375;2021-12-03 08:40:00
55;94.53441120864328;2021-12-03 08:45:00
56;12.272217150555866;2021-12-03 13:10:00
57;87.21292441413424;2021-12-03 13:11:00
58;86.35470090744712;2021-12-03 13:12:00
59;50.23396755270806;2021-12-06 23:46:00
60;73.30424413459407;2021-12-06 23:48:00
61;60.48531615320234;2021-12-06 23:49:00
62;56.10336877052336;2021-12-06 23:51:00
63;87.6451368964707;2021-12-07 08:37:00
64;11.902048844734905;2021-12-07 10:48:00
65;57.596744167099494;2021-12-07 10:58:00
66;61.77125104854312;2021-12-07 11:05:00
67;21.542193987296695;2021-12-07 11:28:00
68;91.64520146457525;2021-12-07 11:29:00
69;78.42486998655676;2021-12-07 16:06:00
70;79.51721853991806;2021-12-07 16:08:00
71;54.46969194684532;2021-12-07 16:09:00
72;56.092025088935785;2021-12-07 16:12:00
73;2.546437552510464;2021-12-07 18:35:00
74;11.598686235757118;2021-12-07 18:40:00
75;40.26003639570842;2021-12-07 18:45:00
76;30.697636730470848;2021-12-07 23:39:00
77;66.3177096178856;2021-12-07 23:42:00
78;73.16870525875022;2021-12-07 23:47:00
79;61.68994018242363;2021-12-08 13:47:00
80;38.06598256433572;2021-12-08 13:48:00
81;43.91268499464372;2021-12-08 13:49:00
82;33.166594417250735;2021-12-15 00:23:00
83;52.68422837459157;2021-12-15 00:24:00
84;86.01398356923765;2021-12-15 00:26:00
85;21.444108620566542;2021-12-15 00:31:00
86;86.6839608035921;2021-12-18 01:09:00
87;43.83047571188636;2022-01-06 00:24:00
Here is my code:
import pandas as pd
import numpy as np
import datetime
from datetime import datetime, timedelta
fileName = "dataSample.csv"
df = pd.read_csv(fileName, sep=";", index_col=0)
df['datetime_object'] = df['datetime'].apply(datetime.fromisoformat)
def define_mask(d, delta_minutes):
return (d <= df["datetime_object"]) & (df["datetime_object"]<= d + timedelta(minutes=delta_minutes))
group = []
i = 0
while i < len(df):
d = df.loc[i]["datetime_object"]
mask = define_mask(d, 3)
for k in range(len(df[mask].index)):
group.append(i)
i += len(df[mask].index)
df["group"] = group
df_new = df.groupby("group").apply(np.mean)
It works well but I am wondering if this is good "pandas" practice .
I have 2 questions:
- Is there another way to do that with pandas ?
- Is there an SQL command to do that directly ?
Solution
You can use resample
:
df = pd.read_csv('data.csv', sep=';', index_col=0, parse_dates=['datetime'])
out = df.resample('3min', on='datetime').mean().dropna().reset_index()
print(out)
# Output
datetime y
0 2020-12-18 00:15:00 1.885539
1 2020-12-18 00:18:00 88.879447
2 2020-12-18 00:21:00 5.493480
3 2020-12-22 02:21:00 27.481241
4 2021-03-12 00:00:00 78.209551
.. ... ...
59 2021-12-15 00:21:00 33.166594
60 2021-12-15 00:24:00 69.349106
61 2021-12-15 00:30:00 21.444109
62 2021-12-18 01:09:00 86.683961
63 2022-01-06 00:24:00 43.830476
[64 rows x 2 columns]
Another way to get the first datetime value of a group of 3 minutes:
out = df.groupby(pd.Grouper(freq='3min', key='datetime'), as_index=False) \
.agg({'y': 'mean', 'datetime': 'first'}) \
.dropna(how='all').reset_index(drop=True)
print(out)
# Output
y datetime
0 1.885539 2020-12-18 00:16:59
1 88.879447 2020-12-18 00:18:26
2 5.493480 2020-12-18 00:21:47
3 27.481241 2020-12-22 02:22:43
4 78.209551 2021-03-12 00:01:45
.. ... ...
59 33.166594 2021-12-15 00:23:00
60 69.349106 2021-12-15 00:24:00
61 21.444109 2021-12-15 00:31:00
62 86.683961 2021-12-18 01:09:00
63 43.830476 2022-01-06 00:24:00
[64 rows x 2 columns]
Or
out = df.resample('3min', on='datetime') \
.agg({'y': 'mean', 'datetime': 'first'}) \
.dropna(how='all').reset_index(drop=True)`
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.