Issue
I have been following this tutorial on how to find nearest neighbors of a point with scikit.
However, when it comes to displaying the data, the tutorial merely mentions that "the indices can be mapped to useful values and the two arrays merged with the rest of the data"
But there's no actual explanation on how to do this. I'm not very well-versed in Pandas and I don't know how to perform this merge, so I just end up with 2 multidimensional arrays and I don't know how to map them to the original data to study the example and experiment with it.
This is the code
import numpy as np
from sklearn.neighbors import BallTree, KDTree
import pandas as pd
# Column names for the example DataFrame.
column_names = ["STATION NAME", "LAT", "LON"]
# A list of locations that will be used to construct the binary
# tree.
locations_a = [['BEAUFORT', 32.4, -80.633],
['CONWAY HORRY COUNTY AIRPORT', 33.828, -79.122],
['HUSTON/EXECUTIVE', 29.8, -95.9],
['ELIZABETHTON MUNI', 36.371, -82.173],
['JACK BARSTOW AIRPORT', 43.663, -84.261],
['MARLBORO CO JETPORT H E AVENT', 34.622, -79.734],
['SUMMERVILLE AIRPORT', 33.063, -80.279]]
# A list of locations that will be used to construct the queries.
# for neighbors.
locations_b = [['BOOMVANG HELIPORT / OIL PLATFORM', 27.35, -94.633],
['LEE COUNTY AIRPORT', 36.654, -83.218],
['ELLINGTON', 35.507, -86.804],
['LAWRENCEVILLE BRUNSWICK MUNI', 36.773, -77.794],
['PUTNAM CO', 39.63, -86.814]]
# Converting the lists to DataFrames. We will build the tree with
# the first and execute the query on the second.
locations_a = pd.DataFrame(locations_a, columns = column_names)
locations_b = pd.DataFrame(locations_b, columns = column_names)
# Creates new columns converting coordinate degrees to radians.
for column in locations_a[["LAT", "LON"]]:
rad = np.deg2rad(locations_a[column].values)
locations_a[f'{column}_rad'] = rad
for column in locations_b[["LAT", "LON"]]:
rad = np.deg2rad(locations_b[column].values)
locations_b[f'{column}_rad'] = rad
# Takes the first group's latitude and longitude values to construct
# the ball tree.
ball = BallTree(locations_a[["LAT_rad", "LON_rad"]].values, metric='haversine')
# The amount of neighbors to return.
k = 1
# Executes a query with the second group. This will also return two
# arrays.
distances, indices = ball.query(locations_b[["LAT_rad", "LON_rad"]].values, k = k)
#converting to kilometers
distances = distances * 6.371
So how do I take distances
and indices
and map them to my dataframe to visually see the nearest neighbor of each point?
Solution
Each integer index in indices
refers to an index value (row number) of locations_a
. You can use locations_a.loc[]
to convert these indices to their corresponding station names as a numpy array:
nearest_station_names = locations_a.loc[indices.flatten()]['STATION NAME'].to_numpy()
(Why indices.flatten()
instead of just indices
? ball.query
returns distances
and indices
as two-dimensional numpy arrays, where the second dimension (the number of columns) is 1. For indices
to work in df.loc[]
, you need to "flatten" it into a one-dimensional array whose only dimension is the number of rows.)
Next, insert the names as a new column into locations_b
:
locations_b['nearest_stn'] = nearest_station_names
Then insert distances
as another new column (no need to .flatten
in this case):
locations_b['nearest_stn_dist'] = distances
# Print without radian columns for brevity
print(locations_b.drop(columns=['LAT_rad', 'LON_rad']))
STATION NAME LAT LON nearest_stn nearest_stn_km
0 BOOMVANG HELIPORT / OIL PLATFORM 27.350 -94.633 HUSTON/EXECUTIVE 299.198339
1 LEE COUNTY AIRPORT 36.654 -83.218 ELIZABETHTON MUNI 98.550423
2 ELLINGTON 35.507 -86.804 ELIZABETHTON MUNI 427.798176
3 LAWRENCEVILLE BRUNSWICK MUNI 36.773 -77.794 MARLBORO CO JETPORT H E AVENT 296.458070
4 PUTNAM CO 39.630 -86.814 JACK BARSTOW AIRPORT 496.025005
Answered By - Peter Leimbigler
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.