Issue
I am currently working on a jupyter notebook in kaggle. After performing the desired transformations on my numpy array, I pickled it so that it can be stored on disk. The reason I did that is so that I can free up the memory being consumed by the large array.
The memory consumed after pickling the array was about 8.7 gb.
I decided to run this code snippet provided by @jan-glx here , to find out what variables were consuming my memory:
import sys
def sizeof_fmt(num, suffix='B'):
''' by Fred Cirera, https://stackoverflow.com/a/1094933/1870254, modified'''
for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
if abs(num) < 1024.0:
return "%3.1f %s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f %s%s" % (num, 'Yi', suffix)
for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
key= lambda x: -x[1])[:10]:
print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))
After performing this step I noticed that the size of my array was 3.3 gb, and the size of all the other variables summed together was about 0.1 gb.
I decided to delete the array and see if that would fix the problem, by performing the following:
del my_array
gc.collect()
After doing this, the memory consumption decreased from 8.7 gb to 5.4 gb. Which in theory makes sense, but still didn't explain what the rest of the memory was being consumed by.
I decided to continue anyways and reset all my variables to see whether this would free up the memory or not with:
%reset
As expected it freed up the memory of the variables that were printed out in the function above, and I was still left with 5.3 gb of memory in use.
One thing to note is that I noticed a memory spike when pickling the file itself, so a summary of the process would be something like this:
- performed operations on array -> memory consumption increased from about 1.9 gb to 5.6 gb
- pickled file -> memory consumption increased from 5.6 gb to about 8.7 gb
- Memory spikes suddenly while file is being pickled to 15.2 gb then drops back to 8.7 gb.
- deleted array -> memory consumption decreased from 8.7 gb to 5.4 gb
- performed reset -> memory consumption decreased from 5.4 gb to 5.3 gb
Please note that the above is loosely based of monitoring the memory on kaggle and may be inaccurate. I have also checked this question but it was not helpful for my case.
Would this be considered a memory leak? If so, what do I do in this case?
EDIT 1:
After some further digging, I noticed that there are others facing this problem. This problem stems from the pickling process, and that pickling creates a copy in memory but, for some reason, does not release it. Is there a way to release the memory after the pickling process is complete.
EDIT 2:
When deleting the pickled file from disk, using:
!rm my_array
It ended up freeing the disk space and freeing up space on memory as well. I don't know whether the above tidbit would be of use or not, but I decided to include it anyways as every bit of info might help.
Solution
There is one basic drawback that you should be aware of: The CPython interpreter actually can actually barely free memory and return it to the OS. For most workloads, you can assume that memory is not freed during the lifetime of the interpreter's process. However, the interpreter can re-use the memory internally. So looking at the memory consumption of the CPython process from the operating system's perspective really does not help at all. A rather common work-around is to run memory intensive jobs in a sub-process / worker process (via multiprocessing for instance) and "only" return the result to the main process. Once the worker dies, the memory is actually freed.
Second, using sys.getsizeof
on ndarray
s can be impressively misleading. Use the ndarray.nbytes
property instead and be aware that this may also be misleading when dealing with views.
Besides, I am not entirely sure why you "pickle" numpy arrays. There are better tools for this job. Just to name two: h5py (a classic, based on HDF5) and zarr. Both libraries allow you to work with ndarray
-like objects directly on disk (and compression) - essentially eliminating the pickling step. Besides, zarr also allows you to create compressed ndarray
-compatible data structures in memory. Must ufunc
s from numpy, scipy & friends will happily accept them as input parameters.
Answered By - s-m-e
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.