Tuesday, April 19, 2022

[FIXED] How to iterate over rows of each column in a dataframe

April 19, 2022 data-science, dataframe, numpy, pandas, python No comments

Issue

My current code functions and produces a graph if there is only 1 sensor, i.e. if col2, and col3 are deleted in the example data provided below, leaving one column.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)
sensors = 3
window_size = 5
dfn = df.rolling(window_size).corr(pairwise = True)

index = df.index #index of values in the data frame.
rows = len(index) #len(index) returns number of rows in the data.
sensors = 3

baseline_num = [0]*(rows) #baseline numerator, by default zero
baseline = [0]*(rows) #initialize baseline value
baseline = DataFrame(baseline)
baseline_num = DataFrame(baseline_num)


v = [None]*(rows) # Initialize an empty array v[] equal to amount of rows in .csv file
s = [None]*(rows) #Initialize another empty array for the slope values for detecting when there is an exposure
d = [0]*(rows)

sensors_on = True #Is the sensor detecting something (True) or not (False).
off_count  = 0
off_require = 8 # how many offs until baseline is updated
sensitivity = 1000

for i in range(0, (rows)): #This iterates over each index value, i.e. each row, and sums the values and returns them in list format.

    v[i] = dfn.loc[i].to_numpy().sum() - sensors


for colname,colitems in df.iteritems():
    for rownum,rowitem in colitems.iteritems():

        #d[rownum] = dfone.loc[rownum].to_numpy()
        #d[colname][rownum] = df.loc[colname][rownum]

        if v[rownum] >= sensitivity:
            sensors_on = True
            off_count = 0
            baseline_num[rownum] = 0

        else:
            sensors_on = False
            off_count += 1
            if off_count == off_require:
                for x in range(0, (off_require)):
                    baseline_num[colname][rownum] += df[colname][rownum - x]

            elif off_count > off_require:
                baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established

        baseline[colname][rownum] = ((baseline_num[colname][rownum])//(off_require)) #mean of the last "off_require" points



dfx = DataFrame(v, columns =['Sensor Correlation']) #converts the summed correlation tables back from list format to a DataFrame, with the sole column name 'Sensor Correlation'
dft = pd.DataFrame(baseline, columns =['baseline'])
dft = dft.astype(float)

dfx.plot(figsize=(50,25), linewidth=5, fontsize=40) # plots dfx dataframe which contains correlated and summed data
dft.plot(figsize=(50,25), linewidth=5, fontsize=40)

Basically, instead of 1 graph as this produces, I would like to iterate over each column only for this loop:

for colname,colitems in df.iteritems():
    for rownum,rowitem in colitems.iteritems():

        #d[rownum] = dfone.loc[rownum].to_numpy()
        #d[colname][rownum] = df.loc[colname][rownum]

        if v[rownum] >= sensitivity:
            sensors_on = True
            off_count = 0
            baseline_num[rownum] = 0

        else:
            sensors_on = False
            off_count += 1
            if off_count == off_require:
                for x in range(0, (off_require)):
                    baseline_num[colname][rownum] += df[colname][rownum - x]

            elif off_count > off_require:
                baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established

I've tried some other solutions from other questions but none of them seem to solve this case. As of now, I've tried multiple conversions to things like lists and tuples, and then calling them something like this:

baseline_num[i,column] += d[i - x,column]

as well as

baseline_num[i][column += d[i - x][column]

while iterating over the loop using

for column in columns

However no matter how I seem to arrange the solution, there is always some keyerror of expecting integer or slice indices, among other errors. See pictures for expected/possible outputs of one column on actual data.with varying input parameters (sensitivity value, and off_require is varied in different cases.) One such solution which didn't work was the looping method from this link:

https://www.geeksforgeeks.org/iterating-over-rows-and-columns-in-pandas-dataframe/

I've also tried creating a loop using iteritems as the outer loop. This did not function as well.

Below are links to possible graph outputs for various sensitivity values, and windows in my actual dataset, with only one column. (i.e i manually deleted other columns, and plotted just the one using the current program)

sensitivity 1000, window 8

sensitivity 800, window 5

sensitivity 1500, window 5

If there's anything I've left out that would be helpful to solving this, please let me know so I can rectify it immediately.

See this picture for my original df.head: df.head

Solution

Did you try,

for colname,colitems in df.iteritems():
    for rownum,rowitem in colitems.iteritems():
        print(df[colname][rownum])

The first loop iterates over all the columns, and the 2nd loops iterates over all the rows for that column.

edit:

From our conversation below, I think that your baseline and df dataframes don't have the same column names because of how you created them and how you are accessing the elements.

My suggestion is that you create the baseline dataframe to be a copy of your df dataframe and edit the information within it from there.

Edit:

I have managed to make your code work for 1 loop, but I run into an index error, I am not sure what your optimisation function does but i think that is what is causing it, take a look.

It is this part baseline_num[colname][rownum - 1], in the second loop i guess because you do rownum (0) -1, you get index -1. You need to change it so that in the first loop rownum is 1 or something, I am not sure what you are trying to do there.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)
sensors = 3
window_size = 5
dfn = df.rolling(window_size).corr(pairwise = True)

index = df.index #index of values in the data frame.
rows = len(index) #len(index) returns number of rows in the data.
sensors = 3

baseline_num = [0]*(rows) #baseline numerator, by default zero
baseline = [0]*(rows) #initialize baseline value
baseline = pd.DataFrame(df)
baseline_num = pd.DataFrame(df)
#print(baseline_num)


v = [None]*(rows) # Initialize an empty array v[] equal to amount of rows in .csv file
s = [None]*(rows) #Initialize another empty array for the slope values for detecting when there is an exposure
d = [0]*(rows)

sensors_on = True #Is the sensor detecting something (True) or not (False).
off_count  = 0
off_require = 8 # how many offs until baseline is updated
sensitivity = 1000

for i in range(0, (rows)): #This iterates over each index value, i.e. each row, and sums the values and returns them in list format.

    v[i] = dfn.loc[i].to_numpy().sum() - sensors


for colname,colitems in df.iteritems():
    #print(colname)
    for rownum,rowitem in colitems.iteritems():
        #print(rownum)
        #display(baseline[colname][rownum])
        #d[rownum] = dfone.loc[rownum].to_numpy()
        #d[colname][rownum] = df.loc[colname][rownum]

        if v[rownum] >= sensitivity:
            sensors_on = True
            off_count = 0
            baseline_num[rownum] = 0

        else:
            sensors_on = False
            off_count += 1
            if off_count == off_require:
                for x in range(0, (off_require)):
                    baseline_num[colname][rownum] += df[colname][rownum - x]

            elif off_count > off_require:
                baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established

        baseline[colname][rownum] = ((baseline_num[colname][rownum])//(off_require)) #mean of the last "off_require" points

        print(baseline[colname][rownum])


dfx = pd.DataFrame(v, columns =['Sensor Correlation']) #converts the summed correlation tables back from list format to a DataFrame, with the sole column name 'Sensor Correlation'
dft = pd.DataFrame(baseline, columns =['baseline'])
dft = dft.astype(float)

dfx.plot(figsize=(50,25), linewidth=5, fontsize=40) # plots dfx dataframe which contains correlated and summed data
dft.plot(figsize=(50,25), linewidth=5, fontsize=40)

My output looks like this,

-324.0
-238.0
-314.0
-276.0
-264.0
-306.0
-371.0
-806.0
638.0
-412.0

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    354                 try:
--> 355                     return self._range.index(new_key)
    356                 except ValueError as err:

ValueError: -1 is not in range


The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)

3 frames

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    355                     return self._range.index(new_key)
    356                 except ValueError as err:
--> 357                     raise KeyError(key) from err
    358             raise KeyError(key)
    359         return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: -1

Answered By - anarchy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, April 19, 2022

[FIXED] How to iterate over rows of each column in a dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels