Sunday, October 24, 2021

[FIXED] Max and Min values within pandas (sub)Dataframe

October 24, 2021 pandas, python, scikit-learn No comments

Issue

I have following dataframe -df :

                     crs         Band1 level
lat       lon                               
34.595694 32.929028  b''  4.000000e+00  1000
          32.937361  b''  1.200000e+01  950
          32.945694  b''  2.900000e+01  925
34.604028 32.929028  b''  7.000000e+00  1000
          32.937361  b''  1.300000e+01  950
                 ...           ...   ...
71.179028 25.679028  b''  6.000000e+01  750
71.187361 25.662361  b''  1.000000e+00  725
          25.670694  b''  6.000000e+01  1000
          25.679028  b''  4.000000e+01  800
71.529028 19.387361  b''  1.843913e-38  1000

[17671817 rows x 3 columns]

and two arrays:

lon1=np.arange(-11,47,0.25)
lat1=np.arange(71.5,34.5,-0.25)

These two arrays (lat1 , lon1 ) produce coordinate pairs spaced with 0.25 deg.

Dataframe df contains points (lat , lon ) which are densely spaced within points defined with lon1 and lat1 arrays. What I want to do is:

find(filter) all points from df within 0.125 deg from points defined with lat1,lon1
get max and min value of level from this subdataframe and store them in separate array same size as lon1 and lat1.

What I did so far is filter dataframe:

for x1 in lon1:
    for y1 in lat1:
        df3=df[(df.index.get_level_values('lon')>x1-0.125) & (df.index.get_level_values('lon')<x1+0.125)]
        df3=df3[(df3.index.get_level_values('lat')>y1-0.125) & (df3.index.get_level_values('lat')<y1+0.125)]

But this has very slow performance. I believe there is a faster one. I have tagged scikit-learn also since probably can be done with it, but I lack experience with this packate. Any help is appreceated.

Solution

Before we start, let's convert your bins to be the start of each bin instead of the center:

lon1=np.arange(-11.125,47.125,0.25)
lat1=np.arange(71.625,34.125,-0.25)

Assign latitude and longitude bins for every row (note reversed order of lat1, otherwise you need to pass ordered=False to pd.cut()).

df['latcat'] = pd.cut(df.index.get_level_values(0), lat1[::-1])
df['loncat'] = pd.cut(df.index.get_level_values(1), lon1)

For your example data we now have:

                     crs         Band1  level            latcat            loncat
lat       lon                                                                    
34.595694 32.929028  b''  4.000000e+00   1000  (34.375, 34.625]  (32.875, 33.125]
          32.937361  b''  1.200000e+01    950  (34.375, 34.625]  (32.875, 33.125]
          32.945694  b''  2.900000e+01    925  (34.375, 34.625]  (32.875, 33.125]
34.604028 32.929028  b''  7.000000e+00   1000  (34.375, 34.625]  (32.875, 33.125]
          32.937361  b''  1.300000e+01    950  (34.375, 34.625]  (32.875, 33.125]
71.179028 25.679028  b''  6.000000e+01    750  (71.125, 71.375]  (25.625, 25.875]
71.187361 25.662361  b''  1.000000e+00    725  (71.125, 71.375]  (25.625, 25.875]
          25.670694  b''  6.000000e+01   1000  (71.125, 71.375]  (25.625, 25.875]
          25.679028  b''  4.000000e+01    800  (71.125, 71.375]  (25.625, 25.875]
71.529028 19.387361  b''  1.843913e-38   1000  (71.375, 71.625]  (19.375, 19.625]

Now use groupby to get the min and max level in each region:

res = df.groupby([df.latcat.cat.codes, df.loncat.cat.codes])['level'].agg(['min', 'max'])

Which gives you:

          min   max
0   176   925  1000
147 147   725  1000
148 122  1000  1000

The first level of the index is the position in the reversed lat1 array, with -1 meaning "out of range" which some of your example data are. The second level is the position in the lon1 array.

To convert to matrices as requested:

minlevel = np.full((len(lat1), len(lon1)), np.nan)
maxlevel = np.full((len(lat1), len(lon1)), np.nan)
x = len(lat1) - res.index.get_level_values(0) - 1 # reverse to original order
y = res.index.get_level_values(1)
minlevel[x, y] = res['min']
maxlevel[x, y] = res['max']

Answered By - John Zwinck

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, October 24, 2021

[FIXED] Max and Min values within pandas (sub)Dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels