Issue
[Running on Jupyter Lab enviroment] When training my CNN on tensorflow:
history = model.fit(
train_generator,
steps_per_epoch=3,
epochs=5,
verbose = 1,
I get an 'OOM when allocating tensor with shape'
when I run my algorithm.
From what I understand, this means I'm not running off enough GPU memory. How can I connect with a server on Jupyter to access more memory to run my training NN?
I am using the following package and code to load the image:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Conduct pre-processing on the data to read and feed the images from the directories into the CNN
# Re-scale data as pixels have value of 0-255
train_datagen = ImageDataGenerator(rescale=1/255)
validation_datagen = ImageDataGenerator(rescale=1/255)
# Feed training dataset images in via batches of 250
train_generator = train_datagen.flow_from_directory (
'Users\cats-or-dogs\PetImages', # Directory with training set images
target_size=(300, 300), # Re-size target images
batch_size = 425, #mini-batch of 250 to make CNN more efficient
class_mode = 'binary'
)
Solution
Please let me know if it works or not.
Typically We can enable mixed-precision after importing the necessary packages as follows. It allows faster computation and also consumes less GPU memory. Thus we can also increase our batch size as well. But the hardware should support such a facility, so please check them first. The Keras
mixed-precision (mp) API is available in TensorFlow 2.x
. Joke asides, if you want to get more GPU memory, then add more GPU. Thus you would do multi-gpu training. But to go with single gpu, mp is one of the tricks. Otherwise, reducing the batch size may solve OOM problem.
policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)
Quoting from the official doc. A performance tips when using mixed-precision on GPUs.
Increasing your batch size
If it doesn't affect model quality, try running with double the batch size when using mixed-precision
. As float16
tensors use half the memory, this often allows you to double your batch size without running out of memory. Increasing batch size typically increases training throughput, i.e. the training elements per second your model can run on.
In addition, we can also use gc.collect()
after each epoch to collect the garbage that will free up some memory space, see below. Also del
the unused large variable that may consume reasonable memory space.
import tensorflow as tf
import gc
class RemoveGarbaseCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
gc.collect()
...
...
model.fit(train_generator, ...
callbacks=[RemoveGarbaseCallback()])
However, we can use clear_session()
while using tf.keras
, which will clean up everything. This is recommended if we create models inside a loop. Thus we can use the following code snippet at each iteration.
for _ in range(no_of_iteration):
# With `clear_session()` called at the beginning,
# Keras starts with a blank state at each iteration
# and memory consumption is constant over time.
tf.keras.backend.clear_session() # Resets all state generated by Keras
train_generator = ...
valid_generator = ...
model = create_model()
history = model.fit(.., callbacks=[RemoveGarbaseCallback()])
# free up some memory space
del model
del train_set, valid_set
Update
As you've encountered:
UnidentifiedImageError:
cannot identify image file <_io.BytesIO object at 0x0000019F9BC1E950>
It happens when there are probably some un-supported files in the training directory. To check the file format, run the following function:
from collections import Counter
import os
def IMG_EXTENTION(img_path):
extension_type = []
file_list = os.listdir(img_path)
for file in file_list: extension_type.append(file.rsplit(".", 1)[1].lower())
print(Counter(extension_type).keys())
print(Counter(extension_type).values())
train_dir = './images' # directory that contains training samples
IMG_EXTENTION(img_path=train_dir)
In this case, as aspected, it should contain image file format, ie: jpg
, jpeg
, png
etc. Now the issue is when working on jupyter environment, it autosaves .ipynb
checkpoint. So, probably it's saving into the training directory with other image files in your case. And that is not supported. All you have to do is to change the project directory or change the saving location in that case. Some pointer: 1, 2
If you were using custom data generator, I would advise to use try
and except
to bypass the file which is not supported. Also in the flow_from_dataframe
instead of flow_from_directory
, we can pass x_col="id"
and y_col="label"
specifically, in that case we may not face such issue.
Answered By - M.Innat
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.