Thursday, May 12, 2022

[FIXED] For loop on pandas dataframe causing slow performance - can rolling be used?

May 12, 2022 dataframe, pandas, performance, python, rolling-computation No comments

Issue

I currently have a loop in a script that is designed to process a raw test data file, and perform a bunch of calculations during the sanitised data. During the script, I need to figure out exactly how many cycles there are in each test. A cycle can be defined by when the step value contained a position i is greater than the next step, i+1. For example, the step count would reach, 4, and the next step is 1, so the next step is the beginning of a new cycle. So far I am calculating this with this simple loop:

raw_data = {'Step':[1,1,2,2,2,3,3,4,4,4,1,2,2,3,3,3,4,4,4,4,1,2,2,3,3,4,4,4]}


cycle_test = 1


for i in range(len(raw_data)-1):
    if  raw_data['Step'][i] > raw_data['Step'][i+1]:
        raw_data['CyclesTest'][i] = cycle_test
        cycle_test+=1
    else:
        raw_data['CyclesTest'][i] = cycle_test

This works fine, but the raw_data being provided is very large, and my script is taking forever on this calculation. I've used rolling before to do max and min comparisons before, but is it possible to use that to replace this for loop? I'm just getting back into programming, so every day is a school day again! Any help would be greatly appreciated.

Solution

You can do it like this:

import pandas as pd

raw_data = {'Step':[1,1,2,2,2,3,3,4,4,4,1,2,2,3,3,3,4,4,4,4,1,2,2,3,3,4,4,4]}

df = pd.DataFrame(raw_data)
df['CycleTest'] = (df['Step'].diff() < 0).cumsum() + 1

print(df)

    Step  CycleTest
0      1          1
1      1          1
2      2          1
3      2          1
4      2          1
5      3          1
6      3          1
7      4          1
8      4          1
9      4          1
10     1          2
11     2          2
12     2          2
13     3          2
14     3          2
15     3          2
16     4          2
17     4          2
18     4          2
19     4          2
20     1          3
21     2          3
22     2          3
23     3          3
24     3          3
25     4          3
26     4          3
27     4          3

Check when the value gets smaller with diff and use cumsum to cumulatively count those occurrences.

Answered By - user2246849

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, May 12, 2022

[FIXED] For loop on pandas dataframe causing slow performance - can rolling be used?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels