Tuesday, January 16, 2024

[FIXED] Pandas Dataframe to sliding window

January 16, 2024 keras, pandas, python, tensorflow No comments

Issue

I have a pandas DataFrame:

timestamp	A	B	C	D
2023-09-27 14:05:50	1	2	3	4
2023-09-27 14:05:51	5	6	7	8
2023-09-27 14:05:52	9	10	11	12
2023-09-27 14:05:53	13	14	15	16
2023-09-27 14:05:54	17	18	19	20
2023-09-27 14:05:55	21	22	23	24

To feed it to a keras autoencoder I need a windowed version of the data (eg. window=3):

timestamp	0	1	2	3	4	5	6	7	8	9	10	11
2023-09-27 14:05:50	1	2	3	4	5	6	7	8	9	10	11	12
2023-09-27 14:05:51	5	6	7	8	9	10	11	12	13	14	15	16
2023-09-27 14:05:52	9	10	11	12	13	14	15	16	17	18	19	20
2023-09-27 14:05:53	13	14	15	16	17	18	19	20	21	22	23	24

I wrote a function but I think I might miss the point. I got problems in the later process and it takes an incredible amount of time (>10 hours, on a machine with 128 cores, a lot of RAM and 32 GPU cards) on my data.

def makeWindowDataFrame(df, windowSize):
    table = []
    for window in df.rolling(window=windowSize):
        if len(window) >= windowSize:
            arr = []
            for el in window.iloc:
                arr.extend(el.to_numpy().reshape(-1))
            table.append(arr)
    longest = len(max(table, key=len))
    return pd.DataFrame(table, columns=[a for a in range(longest)])

Is there a simpler way to create this dataset? This operation is on of the longest running in my setup.

EDIT 1:

def win(df, N):
    return pd.DataFrame(sliding_window_view(df, N, axis=0).swapaxes(1, 2).reshape(len(df)-N+1, -1), index=df.index[:len(df)-N+1])

df = pd.DataFrame( {'timestamp': {28384: pd.Timestamp('2023-09-27 14:05:50'), 28385: pd.Timestamp('2023-09-27 14:05:52'), 28386: pd.Timestamp('2023-09-27 14:05:54'), 28387: pd.Timestamp('2023-09-27 14:05:56'), 28388: pd.Timestamp('2023-09-27 14:05:58')}, 'p1l4e0': {28384: 0.8869906663894653, 28385: 0.9212895035743713, 28386: 0.9084778428077698, 28387: 0.8959079384803772, 28388: 0.9066142439842224}, 'p1l4e1': {28384: 0.3119787573814392, 28385: 0.31039634346961975, 28386: 0.3139703571796417, 28387: 0.3119153082370758, 28388: 0.30586937069892883}, 'p1l4e2': {28384: 0.9320452809333801, 28385: 0.9452565312385559, 28386: 0.9435424208641052, 28387: 0.9356696605682373, 28388: 0.9325512647628784}, 'p1l4e3': {28384: 0.10841193050146103, 28385: 0.1134769469499588, 28386: 0.11245745420455933, 28387: 0.109752357006073, 28388: 0.10924666374921799}} )

win(df, 3)

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
28384	2023-09-27 14:05:50	0.886991	0.311979	0.932045	0.108412	2023-09-27 14:05:52	0.92129	0.310396	0.945257	0.113477	2023-09-27 14:05:54	0.908478	0.31397	0.943542
28385	2023-09-27 14:05:52	0.92129	0.310396	0.945257	0.113477	2023-09-27 14:05:54	0.908478	0.31397	0.943542	0.112457	2023-09-27 14:05:56	0.895908	0.311915	0.93567
28386	2023-09-27 14:05:54	0.908478	0.31397	0.943542	0.112457	2023-09-27 14:05:56	0.895908	0.311915	0.93567	0.109752	2023-09-27 14:05:58	0.906614	0.305869	0.932551

EDIT 2:

It seems like the index is not set. This would explain why it is not working.

df = df.set_index('timestamp')
df.head().to_dict('tight')

{
'index': [28384, 28385, 28386, 28387, 28388],
'columns': ['timestamp', 'p1l4e0', 'p1l4e1', 'p1l4e2', 'p1l4e3'],
'data': ...,
'index_names': [None],
'column_names': [None]
}

EDIT 3:

After a restart of the kernel it works. A test showed that the solution gains a speedup of 1000x at minimum and 1k-10k times faster on bigger datasets. Thank you.

Solution

Why not use numpy's sliding_window_view:

from numpy.lib.stride_tricks import sliding_window_view as svw

N = 3
out = pd.DataFrame(svw(df, N, axis=0).swapaxes(1, 2).reshape(len(df)-N+1, -1),
                   index=df.index[:len(df)-N+1])

Output:

                     0   1   2   3   4   5   6   7   8   9   10  11
timestamp                                                          
2023-09-27 14:05:50   1   2   3   4   5   6   7   8   9  10  11  12
2023-09-27 14:05:51   5   6   7   8   9  10  11  12  13  14  15  16
2023-09-27 14:05:52   9  10  11  12  13  14  15  16  17  18  19  20
2023-09-27 14:05:53  13  14  15  16  17  18  19  20  21  22  23  24

Output with N=2:

                      0   1   2   3   4   5   6   7
timestamp                                          
2023-09-27 14:05:50   1   2   3   4   5   6   7   8
2023-09-27 14:05:51   5   6   7   8   9  10  11  12
2023-09-27 14:05:52   9  10  11  12  13  14  15  16
2023-09-27 14:05:53  13  14  15  16  17  18  19  20
2023-09-27 14:05:54  17  18  19  20  21  22  23  24

timings

Your code is very slow for large inputs

on 6 rows, N=3

# original approach
925 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# sliding window view
69.7 µs ± 266 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

on 60k rows, N=3

# original approach
6.66 s ± 57.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# sliding window view
852 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

on 1.2M rows, N=3

# original approach
2min 16s ± 685 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# sliding window view
28.5 ms ± 605 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Handling extra columns:

N = 3
cols = ['A', 'B', 'C', 'D']

out = (df[df.columns.difference(cols)].head(-N+1)
       .join(pd.DataFrame(svw(df[cols], N, axis=0)
                          .swapaxes(1, 2)
                          .reshape(len(df)-N+1, -1),
                          index=df.index[:len(df)-N+1])
             )
       )

Answered By - mozway

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 16, 2024

[FIXED] Pandas Dataframe to sliding window

Issue

Solution

timings

on 6 rows, N=3

on 60k rows, N=3

on 1.2M rows, N=3

0 comments:

Post a Comment

Popular Posts

Labels