Tuesday, October 5, 2021

[FIXED] How do I map a numpy array and an indices array to a pandas dataframe?

October 05, 2021 dataframe, numpy, pandas, python, scikit-learn No comments

Issue

I have been following this tutorial on how to find nearest neighbors of a point with scikit.

However, when it comes to displaying the data, the tutorial merely mentions that "the indices can be mapped to useful values and the two arrays merged with the rest of the data"

But there's no actual explanation on how to do this. I'm not very well-versed in Pandas and I don't know how to perform this merge, so I just end up with 2 multidimensional arrays and I don't know how to map them to the original data to study the example and experiment with it.

This is the code

import numpy as np
from sklearn.neighbors import BallTree, KDTree
import pandas as pd

# Column names for the example DataFrame.
column_names = ["STATION NAME", "LAT", "LON"]

# A list of locations that will be used to construct the binary
# tree.
locations_a = [['BEAUFORT', 32.4, -80.633],
       ['CONWAY HORRY COUNTY AIRPORT', 33.828, -79.122],
       ['HUSTON/EXECUTIVE', 29.8, -95.9],
       ['ELIZABETHTON MUNI', 36.371, -82.173],
       ['JACK BARSTOW AIRPORT', 43.663, -84.261],
       ['MARLBORO CO JETPORT H E AVENT', 34.622, -79.734],
       ['SUMMERVILLE AIRPORT', 33.063, -80.279]]

# A list of locations that will be used to construct the queries.
# for neighbors.
locations_b = [['BOOMVANG HELIPORT / OIL PLATFORM', 27.35, -94.633],
       ['LEE COUNTY AIRPORT', 36.654, -83.218],
       ['ELLINGTON', 35.507, -86.804],
       ['LAWRENCEVILLE BRUNSWICK MUNI', 36.773, -77.794],
       ['PUTNAM CO', 39.63, -86.814]]

# Converting the lists to DataFrames. We will build the tree with
# the first and execute the query on the second.

locations_a = pd.DataFrame(locations_a, columns = column_names)
locations_b = pd.DataFrame(locations_b, columns = column_names)

# Creates new columns converting coordinate degrees to radians.
for column in locations_a[["LAT", "LON"]]:
    rad = np.deg2rad(locations_a[column].values)
    locations_a[f'{column}_rad'] = rad
for column in locations_b[["LAT", "LON"]]:
    rad = np.deg2rad(locations_b[column].values)
    locations_b[f'{column}_rad'] = rad

# Takes the first group's latitude and longitude values to construct
# the ball tree.
ball = BallTree(locations_a[["LAT_rad", "LON_rad"]].values, metric='haversine')

# The amount of neighbors to return.
k = 1

# Executes a query with the second group. This will also return two
# arrays.
distances, indices = ball.query(locations_b[["LAT_rad", "LON_rad"]].values, k = k)
#converting to kilometers
distances = distances * 6.371

So how do I take distances and indices and map them to my dataframe to visually see the nearest neighbor of each point?

Solution

Each integer index in indices refers to an index value (row number) of locations_a. You can use locations_a.loc[] to convert these indices to their corresponding station names as a numpy array:

nearest_station_names = locations_a.loc[indices.flatten()]['STATION NAME'].to_numpy()

(Why indices.flatten() instead of just indices? ball.query returns distances and indices as two-dimensional numpy arrays, where the second dimension (the number of columns) is 1. For indices to work in df.loc[], you need to "flatten" it into a one-dimensional array whose only dimension is the number of rows.)

Next, insert the names as a new column into locations_b:

locations_b['nearest_stn'] = nearest_station_names

Then insert distances as another new column (no need to .flatten in this case):

locations_b['nearest_stn_dist'] = distances

# Print without radian columns for brevity
print(locations_b.drop(columns=['LAT_rad', 'LON_rad']))

                       STATION NAME     LAT     LON                    nearest_stn  nearest_stn_km
0  BOOMVANG HELIPORT / OIL PLATFORM  27.350 -94.633               HUSTON/EXECUTIVE      299.198339
1                LEE COUNTY AIRPORT  36.654 -83.218              ELIZABETHTON MUNI       98.550423
2                         ELLINGTON  35.507 -86.804              ELIZABETHTON MUNI      427.798176
3      LAWRENCEVILLE BRUNSWICK MUNI  36.773 -77.794  MARLBORO CO JETPORT H E AVENT      296.458070
4                         PUTNAM CO  39.630 -86.814           JACK BARSTOW AIRPORT      496.025005

Answered By - Peter Leimbigler

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 5, 2021

[FIXED] How do I map a numpy array and an indices array to a pandas dataframe?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels