Saturday, February 12, 2022

[FIXED] How to do the mean over consecutive date time and is there a "simple" sql statement for this?

February 12, 2022 mysql, pandas, pandas-groupby, python No comments

Issue

I have a MySQL database with records associated with date time of record. When several values are within a time range of 3 minutes, I want to do the mean of each values. I made a fake file to illustrate.

#dataSample.csv
;y;datetime
0;1.885539280369374;2020-12-18 00:16:59
1;88.87944658745302;2020-12-18 00:18:26
2;5.4934801892366645;2020-12-18 00:21:47
3;27.481240675960745;2020-12-22 02:22:43
4;78.20955112191257;2021-03-12 00:01:45
5;69.20174844202616;2021-03-12 00:03:01
6;92.452056802478;2021-03-12 00:04:10
7;65.44391665410022;2021-03-12 00:06:12
8;40.59036279552053;2021-03-13 11:07:40
9;97.28850548113896;2021-03-13 11:08:46
10;94.73214209590618;2021-03-13 11:09:52
11;15.032038741334246;2021-03-14 00:50:10
12;26.96629037360529;2021-03-14 00:51:17
13;57.257554884427755;2021-03-14 00:52:20
14;18.845976481042804;2021-03-17 13:52:00
15;57.19160644979182;2021-03-17 13:53:48
16;3.81419643210113;2021-03-17 13:54:50
17;46.65212265222033;2021-03-17 20:00:06
18;78.99788944141437;2021-03-17 20:01:28
19;72.57950242929162;2021-03-17 20:02:18
20;31.953619913660063;2021-03-20 16:40:04
21;71.03880579866258;2021-03-20 16:41:14
22;80.07721218822367;2021-03-20 16:42:03
23;84.4974927845413;2021-03-23 23:51:04
24;23.332882564418554;2021-03-23 23:52:37
25;24.84651458538292;2021-03-23 23:53:44
26;3.2905723920299073;2021-04-13 01:07:13
27;95.00543057651691;2021-04-13 01:08:53
28;46.02579988887248;2021-04-13 01:10:03
29;71.73362449536457;2021-04-13 07:54:22
30;93.17353939667422;2021-04-13 07:56:03
31;28.06669274690586;2021-04-13 07:57:04
32;10.733532291051478;2021-04-21 23:52:19
33;92.92374999199961;2021-04-21 23:53:02
34;59.68694726616824;2021-04-21 23:54:12
35;30.01172074266929;2021-11-29 00:21:09
36;34.905022198511915;2021-11-29 00:23:09
37;25.149590827473055;2021-11-29 00:24:13
38;82.09740354280564;2021-12-01 08:30:00
39;25.58339148753002;2021-12-01 08:32:00
40;72.7009145748645;2021-12-01 08:34:00
41;8.43474445404563;2021-12-01 13:18:58
42;57.95936012084567;2021-12-01 13:19:45
43;31.118114587376713;2021-12-01 13:21:19
44;42.082098854369576;2021-12-01 20:24:46
45;75.8402567179772;2021-12-01 20:25:45
46;55.29546227636972;2021-12-01 20:26:20
47;72.52918512264547;2021-12-02 08:35:42
48;77.81077056479849;2021-12-02 08:36:35
49;34.63717484559066;2021-12-02 08:37:22
50;71.65724478546072;2021-12-06 00:05:00
51;19.54082334014094;2021-12-06 00:08:00
52;48.28967362303979;2021-12-06 00:10:00
53;34.894095185290105;2021-12-03 08:36:00
54;58.187428474357375;2021-12-03 08:40:00
55;94.53441120864328;2021-12-03 08:45:00
56;12.272217150555866;2021-12-03 13:10:00
57;87.21292441413424;2021-12-03 13:11:00
58;86.35470090744712;2021-12-03 13:12:00
59;50.23396755270806;2021-12-06 23:46:00
60;73.30424413459407;2021-12-06 23:48:00
61;60.48531615320234;2021-12-06 23:49:00
62;56.10336877052336;2021-12-06 23:51:00
63;87.6451368964707;2021-12-07 08:37:00
64;11.902048844734905;2021-12-07 10:48:00
65;57.596744167099494;2021-12-07 10:58:00
66;61.77125104854312;2021-12-07 11:05:00
67;21.542193987296695;2021-12-07 11:28:00
68;91.64520146457525;2021-12-07 11:29:00
69;78.42486998655676;2021-12-07 16:06:00
70;79.51721853991806;2021-12-07 16:08:00
71;54.46969194684532;2021-12-07 16:09:00
72;56.092025088935785;2021-12-07 16:12:00
73;2.546437552510464;2021-12-07 18:35:00
74;11.598686235757118;2021-12-07 18:40:00
75;40.26003639570842;2021-12-07 18:45:00
76;30.697636730470848;2021-12-07 23:39:00
77;66.3177096178856;2021-12-07 23:42:00
78;73.16870525875022;2021-12-07 23:47:00
79;61.68994018242363;2021-12-08 13:47:00
80;38.06598256433572;2021-12-08 13:48:00
81;43.91268499464372;2021-12-08 13:49:00
82;33.166594417250735;2021-12-15 00:23:00
83;52.68422837459157;2021-12-15 00:24:00
84;86.01398356923765;2021-12-15 00:26:00
85;21.444108620566542;2021-12-15 00:31:00
86;86.6839608035921;2021-12-18 01:09:00
87;43.83047571188636;2022-01-06 00:24:00

Here is my code:

import pandas as pd
import numpy as np
import datetime
from datetime import datetime, timedelta

fileName = "dataSample.csv"

df = pd.read_csv(fileName, sep=";", index_col=0)

df['datetime_object'] = df['datetime'].apply(datetime.fromisoformat)

def define_mask(d, delta_minutes):
    return (d <= df["datetime_object"]) & (df["datetime_object"]<= d + timedelta(minutes=delta_minutes))

group = []
i = 0
while i < len(df):
    d = df.loc[i]["datetime_object"]
    mask = define_mask(d, 3)
    
    for k in range(len(df[mask].index)):
        group.append(i)

    i += len(df[mask].index)

df["group"] = group
df_new = df.groupby("group").apply(np.mean)

It works well but I am wondering if this is good "pandas" practice .

I have 2 questions:

Is there another way to do that with pandas ?
Is there an SQL command to do that directly ?

Solution

You can use resample:

df = pd.read_csv('data.csv', sep=';', index_col=0, parse_dates=['datetime'])
out = df.resample('3min', on='datetime').mean().dropna().reset_index()
print(out)

# Output
              datetime          y
0  2020-12-18 00:15:00   1.885539
1  2020-12-18 00:18:00  88.879447
2  2020-12-18 00:21:00   5.493480
3  2020-12-22 02:21:00  27.481241
4  2021-03-12 00:00:00  78.209551
..                 ...        ...
59 2021-12-15 00:21:00  33.166594
60 2021-12-15 00:24:00  69.349106
61 2021-12-15 00:30:00  21.444109
62 2021-12-18 01:09:00  86.683961
63 2022-01-06 00:24:00  43.830476

[64 rows x 2 columns]

Another way to get the first datetime value of a group of 3 minutes:

out = df.groupby(pd.Grouper(freq='3min', key='datetime'), as_index=False) \
        .agg({'y': 'mean', 'datetime': 'first'}) \
        .dropna(how='all').reset_index(drop=True)
print(out)

# Output
            y            datetime
0    1.885539 2020-12-18 00:16:59
1   88.879447 2020-12-18 00:18:26
2    5.493480 2020-12-18 00:21:47
3   27.481241 2020-12-22 02:22:43
4   78.209551 2021-03-12 00:01:45
..        ...                 ...
59  33.166594 2021-12-15 00:23:00
60  69.349106 2021-12-15 00:24:00
61  21.444109 2021-12-15 00:31:00
62  86.683961 2021-12-18 01:09:00
63  43.830476 2022-01-06 00:24:00

[64 rows x 2 columns]

out = df.resample('3min', on='datetime') \
        .agg({'y': 'mean', 'datetime': 'first'}) \
        .dropna(how='all').reset_index(drop=True)`

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, February 12, 2022

[FIXED] How to do the mean over consecutive date time and is there a "simple" sql statement for this?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels