Issue
I have a pandas DataFrame:
timestamp | A | B | C | D |
---|---|---|---|---|
2023-09-27 14:05:50 | 1 | 2 | 3 | 4 |
2023-09-27 14:05:51 | 5 | 6 | 7 | 8 |
2023-09-27 14:05:52 | 9 | 10 | 11 | 12 |
2023-09-27 14:05:53 | 13 | 14 | 15 | 16 |
2023-09-27 14:05:54 | 17 | 18 | 19 | 20 |
2023-09-27 14:05:55 | 21 | 22 | 23 | 24 |
To feed it to a keras autoencoder I need a windowed version of the data (eg. window=3):
timestamp | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2023-09-27 14:05:50 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
2023-09-27 14:05:51 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
2023-09-27 14:05:52 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
2023-09-27 14:05:53 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 |
I wrote a function but I think I might miss the point. I got problems in the later process and it takes an incredible amount of time (>10 hours, on a machine with 128 cores, a lot of RAM and 32 GPU cards) on my data.
def makeWindowDataFrame(df, windowSize):
table = []
for window in df.rolling(window=windowSize):
if len(window) >= windowSize:
arr = []
for el in window.iloc:
arr.extend(el.to_numpy().reshape(-1))
table.append(arr)
longest = len(max(table, key=len))
return pd.DataFrame(table, columns=[a for a in range(longest)])
Is there a simpler way to create this dataset? This operation is on of the longest running in my setup.
EDIT 1:
def win(df, N):
return pd.DataFrame(sliding_window_view(df, N, axis=0).swapaxes(1, 2).reshape(len(df)-N+1, -1), index=df.index[:len(df)-N+1])
df = pd.DataFrame( {'timestamp': {28384: pd.Timestamp('2023-09-27 14:05:50'), 28385: pd.Timestamp('2023-09-27 14:05:52'), 28386: pd.Timestamp('2023-09-27 14:05:54'), 28387: pd.Timestamp('2023-09-27 14:05:56'), 28388: pd.Timestamp('2023-09-27 14:05:58')}, 'p1l4e0': {28384: 0.8869906663894653, 28385: 0.9212895035743713, 28386: 0.9084778428077698, 28387: 0.8959079384803772, 28388: 0.9066142439842224}, 'p1l4e1': {28384: 0.3119787573814392, 28385: 0.31039634346961975, 28386: 0.3139703571796417, 28387: 0.3119153082370758, 28388: 0.30586937069892883}, 'p1l4e2': {28384: 0.9320452809333801, 28385: 0.9452565312385559, 28386: 0.9435424208641052, 28387: 0.9356696605682373, 28388: 0.9325512647628784}, 'p1l4e3': {28384: 0.10841193050146103, 28385: 0.1134769469499588, 28386: 0.11245745420455933, 28387: 0.109752357006073, 28388: 0.10924666374921799}} )
win(df, 3)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
28384 | 2023-09-27 14:05:50 | 0.886991 | 0.311979 | 0.932045 | 0.108412 | 2023-09-27 14:05:52 | 0.92129 | 0.310396 | 0.945257 | 0.113477 | 2023-09-27 14:05:54 | 0.908478 | 0.31397 | 0.943542 |
28385 | 2023-09-27 14:05:52 | 0.92129 | 0.310396 | 0.945257 | 0.113477 | 2023-09-27 14:05:54 | 0.908478 | 0.31397 | 0.943542 | 0.112457 | 2023-09-27 14:05:56 | 0.895908 | 0.311915 | 0.93567 |
28386 | 2023-09-27 14:05:54 | 0.908478 | 0.31397 | 0.943542 | 0.112457 | 2023-09-27 14:05:56 | 0.895908 | 0.311915 | 0.93567 | 0.109752 | 2023-09-27 14:05:58 | 0.906614 | 0.305869 | 0.932551 |
EDIT 2:
It seems like the index is not set. This would explain why it is not working.
df = df.set_index('timestamp')
df.head().to_dict('tight')
{
'index': [28384, 28385, 28386, 28387, 28388],
'columns': ['timestamp', 'p1l4e0', 'p1l4e1', 'p1l4e2', 'p1l4e3'],
'data': ...,
'index_names': [None],
'column_names': [None]
}
EDIT 3:
After a restart of the kernel it works. A test showed that the solution gains a speedup of 1000x at minimum and 1k-10k times faster on bigger datasets. Thank you.
Solution
Why not use numpy's sliding_window_view
:
from numpy.lib.stride_tricks import sliding_window_view as svw
N = 3
out = pd.DataFrame(svw(df, N, axis=0).swapaxes(1, 2).reshape(len(df)-N+1, -1),
index=df.index[:len(df)-N+1])
Output:
0 1 2 3 4 5 6 7 8 9 10 11
timestamp
2023-09-27 14:05:50 1 2 3 4 5 6 7 8 9 10 11 12
2023-09-27 14:05:51 5 6 7 8 9 10 11 12 13 14 15 16
2023-09-27 14:05:52 9 10 11 12 13 14 15 16 17 18 19 20
2023-09-27 14:05:53 13 14 15 16 17 18 19 20 21 22 23 24
Output with N=2
:
0 1 2 3 4 5 6 7
timestamp
2023-09-27 14:05:50 1 2 3 4 5 6 7 8
2023-09-27 14:05:51 5 6 7 8 9 10 11 12
2023-09-27 14:05:52 9 10 11 12 13 14 15 16
2023-09-27 14:05:53 13 14 15 16 17 18 19 20
2023-09-27 14:05:54 17 18 19 20 21 22 23 24
timings
Your code is very slow for large inputs
on 6 rows, N=3
# original approach
925 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# sliding window view
69.7 µs ± 266 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
on 60k rows, N=3
# original approach
6.66 s ± 57.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# sliding window view
852 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
on 1.2M rows, N=3
# original approach
2min 16s ± 685 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# sliding window view
28.5 ms ± 605 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Handling extra columns:
N = 3
cols = ['A', 'B', 'C', 'D']
out = (df[df.columns.difference(cols)].head(-N+1)
.join(pd.DataFrame(svw(df[cols], N, axis=0)
.swapaxes(1, 2)
.reshape(len(df)-N+1, -1),
index=df.index[:len(df)-N+1])
)
)
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.