Saturday, October 9, 2021

[FIXED] SciKits BallTree method gives me incorrect "nearest neighbor"

October 09, 2021 data-analysis, nearest-neighbor, python, scikit-learn No comments

Issue

I'm using code from the source given below to get the nearest "site".

Source: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html

My Code:

# Read data from a DB
test_df = pd.read_sql_query(sql, conn)

# Calculates distance between 2 points on a map using lat and long 
# (Source: https://towardsdatascience.com/heres-how-to-calculate-distance-between-2-geolocations-in-python-93ecab5bbba4)
def haversine_distance(lat1, lon1, lat2, lon2):
   r = 6371
   phi1 = np.radians(float(lat1))
   phi2 = np.radians(float(lat2))
   delta_phi = np.radians(lat2 - lat1)
   delta_lambda = np.radians(lon2- lon1)
   a = np.sin(delta_phi / 2)**2 + np.cos(phi1) * np.cos(phi2) *   np.sin(delta_lambda / 2)**2
   res = r * (2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a)))
   return np.round(res, 2)

test_df["actualDistance (km)"] = test_df.apply(lambda row: haversine_distance(row['ClientLat'],row['ClientLong'],row['actual_SLa'],row['actual_SLo']), axis=1)

test_gdf = geopandas.GeoDataFrame(test_df, geometry=geopandas.points_from_xy(test_df.ClientLong, test_df.ClientLat))
site_gdf = geopandas.GeoDataFrame(site_df, geometry=geopandas.points_from_xy(site_df.SiteLong, site_df.SiteLat))

#-------Set up the functions as shown in the tutorial-------

def get_nearest(src_points, candidates, k_neighbors=1):
    """Find nearest neighbors for all source points from a set of candidate points"""

    # Create tree from the candidate points
    tree = BallTree(candidates, leaf_size=15, metric='haversine')

    # Find closest points and distances
    distances, indices = tree.query(src_points, k=k_neighbors)

    # Transpose to get distances and indices into arrays
    distances = distances.transpose()
    indices = indices.transpose()

    # Get closest indices and distances (i.e. array at index 0)
    # note: for the second closest points, you would take index 1, etc.
    closest = indices[0]
    closest_dist = distances[0]

    # Return indices and distances
    return (closest, closest_dist)


def nearest_neighbor(left_gdf, right_gdf, return_dist=False):
    """
    For each point in left_gdf, find closest point in right GeoDataFrame and return them.

    NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
    """

    left_geom_col = left_gdf.geometry.name
    right_geom_col = right_gdf.geometry.name

    # Ensure that index in right gdf is formed of sequential numbers
    right = right_gdf.copy().reset_index(drop=True)

    # Parse coordinates from points and insert them into a numpy array as RADIANS
    left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
    right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())

    # Find the nearest points
    # -----------------------
    # closest ==> index in right_gdf that corresponds to the closest point
    # dist ==> distance between the nearest neighbors (in meters)

    closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)

    # Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
    closest_points = right.loc[closest]

    # Ensure that the index corresponds the one in left_gdf
    closest_points = closest_points.reset_index(drop=True)

    # Add distance if requested
    if return_dist:
        # Convert to meters from radians
        earth_radius = 6371000  # meters
        closest_points['distance'] = dist * earth_radius

    return closest_points

closest_sites = nearest_neighbor(test_gdf, site_gdf, return_dist=True)

# Rename the geometry of closest sites gdf so that we can easily identify it
closest_sites = closest_sites.rename(columns={'geometry': 'closest_site_geom'})

# Merge the datasets by index (for this, it is good to use '.join()' -function)
test_gdf = test_gdf.join(closest_sites)

#Extracted closest site latitude and longitude for data analysis
test_gdf['CS_lo'] = test_gdf.closest_site_geom.apply(lambda p: p.x)
test_gdf['CS_la'] = test_gdf.closest_site_geom.apply(lambda p: p.y)

The code is a replica of the tutorial link I provided. And based on their explanation it should've worked.

To verify this data I got some statistical data using .describe(), and it showed me that the tutorials method did indeed give me a mean distance that was much closer than the distance in the actual data (792 m vs the actual distance which was 1.80 km). Closest Distance generated using the BallTree method Actual Distance in the data

However when I plotted them out on a map using plotly I noticed that the BallTree method's outputs weren't closer than the "actual" distance. This is generally what the plotted data looks like (Blue: predetermined site, Red: site predicted using the BallTree method Could someone help me track down the discrepancy

Solution

I'm not sure why this works but it did. I decided to just write the code based on the docs instead of following the tutorial and this worked:

# Build BallTree with haversine distance metric, which expects (lat, lon) in radians and returns distances in radians
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(np.radians(site_df[['SiteLat', 'SiteLong']]), metric=dist)

test_coords = np.radians(test_df[['ClientLat', 'ClientLong']])
dists, ilocs = tree.query(test_coords)

Answered By - JoeShmo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 9, 2021

[FIXED] SciKits BallTree method gives me incorrect "nearest neighbor"

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels