Friday, April 1, 2022

[FIXED] Faster methods to create geodataframe from a Dask or Pandas dataframe

April 01, 2022 geopandas, geospatial, gis, pandas, python No comments

Issue

Problem

I'm trying to clip a very large block model (5.8gb CSV file) containing centroid x, y, and z coordinates with an elevation raster. I'm trying to obtain only the blocks lying just above the raster layer.

I usually do this in ArcGIS by clipping my block model points to the outline of my raster and then extracting the raster values to the block model points. For large datasets this takes an ungodly amount of time (yes, that's a technical term) in ArcGIS.

How I want to solve it

I want to speed this up by importing the CSV to Python. Using Dask, this is quick and easy:

import dask
from dask import dataframe as dd

BM = dd.read_csv(BM_path, skiprows=2,names=["X","Y","Z","Lith"])

But creating a GeoDataFrame using geopandas is not a fast process whatsoever. I thought that speeding it up using the following multiprocessing code might work:

import multiprocessing as mp
from multiprocessing import pool
import geopandas as gpd

pool=mp.Pool(mp.cpu_count())
geometry = pool.apply(gpd.points_from_xy, args=(BM.X,BM.Y,BM.Z))
pool.close()

However, I am an hour into waiting for this to process with no end in sight.

I have also tried putting the entire geodataframe together all at once in the following code but realize there are some syntax errors that I don't know how to correct, particularly with passing "geometry=" to args=:

pool = mp.Pool(mp.cpu_count())
results = pool.apply(gpd.GeoDataFrame, args=(BM,geometry=(BM.X,BM.Y,BM.Z)))
pool.close()

I was wondering if anyone had a better idea as to how I could speed this up and make this process more efficient, whether or not I am able to parallelize.

Solution

The optimal way of linking dask and geopandas is the dask-geopandas package.

import dask
from dask import dataframe as dd
import dask_geopandas

BM = dd.read_csv(BM_path, skiprows=2,names=["X","Y","Z","Lith"])
BM["geometry"] = dask_geopandas.points_from_xy(BM,"X","Y","Z")
gdf = dask_geopandas.from_dask_dataframe(BM, geometry="geometry")

This gives you partitioned dask_geopandas.GeoDataFrame. If you want to convert it to a standard geopandas.GeoDataFrame, you just call compute().

gpd_gdf = gdf.compute()

Answered By - martinfleis

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, April 1, 2022

[FIXED] Faster methods to create geodataframe from a Dask or Pandas dataframe

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels