Issue
The xarray
package in python seems to use "lazy loading" to point to structured data on the disk (e.g. netCDF, HDF5), then loads the data into memory only "when necessary." How can I check whether a given Dataset or DataArray object in an interactive python session or in a python script is actually "loaded?"
Ideally, something like
import xarray as xr
dataset = xr.open_dataset('data.nc')
dataset.is_loaded() # is it loaded into memory?
Not sure if this is a meaningful question, but want to be able to safely and confidently control this behavior for giant datasets, so the whole file doesn't get unnecessarily read.
Solution
This isn't currently possible using the public API. This information is available using private APIs. If you look at DataArray.variable._data
with an array loaded from disk, you'll see a MemoryCachedArray
object (as of xarray v0.9) if it's being cached:
>>> xarray.DataArray([[1, 2], [3, 4]]).to_netcdf('foo.nc')
>>> array = xarray.open_dataarray('foo.nc')
>>> array.variable._data
MemoryCachedArray(array=CopyOnWriteArray(array=LazilyIndexedArray(array=ScipyArrayWrapper(array=array([[1, 2],
[3, 4]], dtype=int32)), key=(slice(None, None, None), slice(None, None, None)))))
If your data is large enough that you're concerned about caching being problematic, I definitely recommend opening any files with cache=False
, e.g., xarray.open_dataarray('foo.nc', cache=False)
. In that case, you won't see the MemoryCachedArray
object in _data
:
>>> array.variable._data
CopyOnWriteArray(array=LazilyIndexedArray(array=ScipyArrayWrapper(array=array([[1, 2],
[3, 4]], dtype=int32)), key=(slice(None, None, None), slice(None, None, None))))
If you still think you need to be able to check whether caching is possible on existing xarray objects, please raise an issue on our GitHub page to discuss potential new API.
Answered By - shoyer
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.