Issue
Problem
I'm trying to clip a very large block model (5.8gb CSV file) containing centroid x, y, and z coordinates with an elevation raster. I'm trying to obtain only the blocks lying just above the raster layer.
I usually do this in ArcGIS by clipping my block model points to the outline of my raster and then extracting the raster values to the block model points. For large datasets this takes an ungodly amount of time (yes, that's a technical term) in ArcGIS.
How I want to solve it
I want to speed this up by importing the CSV to Python. Using Dask, this is quick and easy:
import dask
from dask import dataframe as dd
BM = dd.read_csv(BM_path, skiprows=2,names=["X","Y","Z","Lith"])
But creating a GeoDataFrame using geopandas is not a fast process whatsoever. I thought that speeding it up using the following multiprocessing code might work:
import multiprocessing as mp
from multiprocessing import pool
import geopandas as gpd
pool=mp.Pool(mp.cpu_count())
geometry = pool.apply(gpd.points_from_xy, args=(BM.X,BM.Y,BM.Z))
pool.close()
However, I am an hour into waiting for this to process with no end in sight.
I have also tried putting the entire geodataframe together all at once in the following code but realize there are some syntax errors that I don't know how to correct, particularly with passing "geometry=" to args=:
pool = mp.Pool(mp.cpu_count())
results = pool.apply(gpd.GeoDataFrame, args=(BM,geometry=(BM.X,BM.Y,BM.Z)))
pool.close()
I was wondering if anyone had a better idea as to how I could speed this up and make this process more efficient, whether or not I am able to parallelize.
Solution
The optimal way of linking dask and geopandas is the dask-geopandas package.
import dask
from dask import dataframe as dd
import dask_geopandas
BM = dd.read_csv(BM_path, skiprows=2,names=["X","Y","Z","Lith"])
BM["geometry"] = dask_geopandas.points_from_xy(BM,"X","Y","Z")
gdf = dask_geopandas.from_dask_dataframe(BM, geometry="geometry")
This gives you partitioned dask_geopandas.GeoDataFrame
. If you want to convert it to a standard geopandas.GeoDataFrame
, you just call compute()
.
gpd_gdf = gdf.compute()
Answered By - martinfleis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.