Wednesday, October 27, 2021

[FIXED] OOM when allocating tensor with shape - how to get more GPU memory

October 27, 2021 jupyter, jupyter-notebook, keras, neural-network, tensorflow No comments

Issue

[Running on Jupyter Lab enviroment] When training my CNN on tensorflow:

 history = model.fit(
        train_generator,
        steps_per_epoch=3,
        epochs=5,
        verbose = 1,

I get an 'OOM when allocating tensor with shape' when I run my algorithm.

From what I understand, this means I'm not running off enough GPU memory. How can I connect with a server on Jupyter to access more memory to run my training NN?

I am using the following package and code to load the image:

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Conduct pre-processing on the data to read and feed the images from the directories into the CNN

# Re-scale data as pixels have value of 0-255
train_datagen = ImageDataGenerator(rescale=1/255)
validation_datagen = ImageDataGenerator(rescale=1/255)

# Feed training dataset images in via batches of 250
train_generator = train_datagen.flow_from_directory (
    'Users\cats-or-dogs\PetImages', # Directory with training set images
    target_size=(300, 300), # Re-size target images
    batch_size = 425, #mini-batch of 250 to make CNN more efficient
    class_mode = 'binary'
)

Solution

Please let me know if it works or not. Typically We can enable mixed-precision after importing the necessary packages as follows. It allows faster computation and also consumes less GPU memory. Thus we can also increase our batch size as well. But the hardware should support such a facility, so please check them first. The Keras mixed-precision (mp) API is available in TensorFlow 2.x. Joke asides, if you want to get more GPU memory, then add more GPU. Thus you would do multi-gpu training. But to go with single gpu, mp is one of the tricks. Otherwise, reducing the batch size may solve OOM problem.

policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)

Quoting from the official doc. A performance tips when using mixed-precision on GPUs.

Increasing your batch size

If it doesn't affect model quality, try running with double the batch size when using mixed-precision. As float16 tensors use half the memory, this often allows you to double your batch size without running out of memory. Increasing batch size typically increases training throughput, i.e. the training elements per second your model can run on.

In addition, we can also use gc.collect() after each epoch to collect the garbage that will free up some memory space, see below. Also del the unused large variable that may consume reasonable memory space.

import tensorflow as tf
import gc

class RemoveGarbaseCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs=None):
    gc.collect()
...
...
model.fit(train_generator, ...
callbacks=[RemoveGarbaseCallback()])

However, we can use clear_session() while using tf.keras, which will clean up everything. This is recommended if we create models inside a loop. Thus we can use the following code snippet at each iteration.

for _ in range(no_of_iteration):
   # With `clear_session()` called at the beginning,
   # Keras starts with a blank state at each iteration
   # and memory consumption is constant over time.
   tf.keras.backend.clear_session() # Resets all state generated by Keras

   train_generator = ...
   valid_generator = ...
   
   model =  create_model()
   history = model.fit(.., callbacks=[RemoveGarbaseCallback()])

   # free up some memory space
   del model
   del train_set, valid_set

Update

As you've encountered:

UnidentifiedImageError: 
cannot identify image file <_io.BytesIO object at 0x0000019F9BC1E950>

It happens when there are probably some un-supported files in the training directory. To check the file format, run the following function:

from collections import Counter
import os
def IMG_EXTENTION(img_path):
    extension_type = []
    file_list = os.listdir(img_path)
    
    for file in file_list: extension_type.append(file.rsplit(".", 1)[1].lower())
        
    print(Counter(extension_type).keys())
    print(Counter(extension_type).values())
    
train_dir = './images' # directory that contains training samples 
IMG_EXTENTION(img_path=train_dir)

In this case, as aspected, it should contain image file format, ie: jpg, jpeg, png etc. Now the issue is when working on jupyter environment, it autosaves .ipynb checkpoint. So, probably it's saving into the training directory with other image files in your case. And that is not supported. All you have to do is to change the project directory or change the saving location in that case. Some pointer: 1, 2

If you were using custom data generator, I would advise to use try and except to bypass the file which is not supported. Also in the flow_from_dataframe instead of flow_from_directory, we can pass x_col="id"and y_col="label" specifically, in that case we may not face such issue.

Answered By - M.Innat

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 27, 2021

[FIXED] OOM when allocating tensor with shape - how to get more GPU memory

Issue

Solution

Increasing your batch size

Update

0 comments:

Post a Comment

Popular Posts

Labels