Tuesday, January 25, 2022

[FIXED] How to create checkpoint filenames with epoch or batch number when using ModelCheckpoint() with save_freq as interger?

January 25, 2022 epoch, python, python-3.x, tensorflow, tensorflow2.0 No comments

Issue

I have tensorflow 2 v. 2.5.0 installed and am using jupyter notebooks with python 3.10.

I'm practicing using an argument, save_freq as an integer from an online course (they use tensorflow 2.0.0 where the following code runs fine but it does work in my more recent version).

here's the link to relevant documentation without an example on using integer in save_freq. https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint

here is my code:

    import tensorflow as tf
    from tensorflow.keras.callbacks import ModelCheckpoint
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
    
    # Use the CIFAR-10 dataset
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
    x_train = x_train / 255.0
    x_test = x_test / 255.0
    
    # using a smaller subset -- speeds things up
    x_train = x_train[:10000]
    y_train = y_train[:10000]
    x_test = x_test[:1000]
    y_test = y_test[:1000]
    
    # define a function that creates a new instance of a simple CNN.
    def create_model():
        model = Sequential([
            Conv2D(filters=16, input_shape=(32, 32, 3), kernel_size=(3, 3), 
                   activation='relu', name='conv_1'),
            Conv2D(filters=8, kernel_size=(3, 3), activation='relu', name='conv_2'),
            MaxPooling2D(pool_size=(4, 4), name='pool_1'),
            Flatten(name='flatten'),
            Dense(units=32, activation='relu', name='dense_1'),
            Dense(units=10, activation='softmax', name='dense_2')
        ])
        model.compile(optimizer='adam',
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])
        return model
    
    
    # Create Tensorflow checkpoint object with epoch and batch details 
    
    checkpoint_5000_path = 'model_checkpoints_5000/cp_{epoch:02d}-{batch:04d}'
    checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
                                     save_weights_only = True,
                                     save_freq = 5000,
                                     verbose = 1)
    
    
    # Create and fit model with checkpoint
    
    model = create_model()
    model.fit(x = x_train,
              y = y_train,
              epochs = 3,
              validation_data = (x_test, y_test),
              batch_size = 10,
              callbacks = [checkpoint_5000])

I want to create and save the checkpoint filenames including the epoch and batch number. However, the files are not created and it writes 'File not found'. After I create manually the directory, model_checkpoints_5000, no files are added in.

(we can check the directory contents by running ' ! dir -a model_checkpoints_5000' (in windows), or 'ls -lh model_checkpoints_500' (in linux)).

I have also tried to change to 'model_checkpoints_5000/cp_{epoch:02d}', it still does not save the files with every epoch's number.

Then I have tried to follow the example from Checkpoint Callback options with save_freq, which saves files with me. https://www.tensorflow.org/tutorials/keras/save_and_load

yet, it is still not saving any of my files.

checkpoint_path = "model_checkpoints_5000/cp-{epoch:02d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

batch_size = 10

checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_path,
                                 save_weights_only = True,
                                 save_freq = 500*batch_size,


model = create_model()

model.fit(x = x_train,
          y = y_train,
          epochs = 3,
          validation_data = (x_test, y_test),
          batch_size = batch_size,
          callbacks = [checkpoint_5000])                                verbose = 1)

any suggestions how to make it work? other than downgrading my tensorflow.

Solution

The parameter save_freg is too large. It needs to be save_freg = training_samples // batch_size or less. Maybe try something like this:

batch_size = 10
checkpoint_5000_path = 'model_checkpoints_5000/cp_{epoch:02d}-{batch:1d}'
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
                                  save_weights_only = True,
                                  save_freq = len(x_train) // batch_size // batch_size,
                                  verbose = 1)
model = create_model()
model.fit(x = x_train,
          y = y_train,
          epochs = 3,
          validation_data = (x_test, y_test),
          batch_size = batch_size,
          callbacks = [checkpoint_5000])

Epoch 1/3
  97/1000 [=>............................] - ETA: 3s - loss: 2.2801 - accuracy: 0.1536
Epoch 00001: saving model to model_checkpoints_5000/cp_01-100
 198/1000 [====>.........................] - ETA: 3s - loss: 2.2347 - accuracy: 0.1500
Epoch 00001: saving model to model_checkpoints_5000/cp_01-200
 288/1000 [=======>......................] - ETA: 3s - loss: 2.1979 - accuracy: 0.1736
Epoch 00001: saving model to model_checkpoints_5000/cp_01-300
 397/1000 [==========>...................] - ETA: 2s - loss: 2.1337 - accuracy: 0.2020
Epoch 00001: saving model to model_checkpoints_5000/cp_01-400
 497/1000 [=============>................] - ETA: 2s - loss: 2.0952 - accuracy: 0.2197
Epoch 00001: saving model to model_checkpoints_5000/cp_01-500
 598/1000 [================>.............] - ETA: 1s - loss: 2.0496 - accuracy: 0.2395
Epoch 00001: saving model to model_checkpoints_5000/cp_01-600
 698/1000 [===================>..........] - ETA: 1s - loss: 2.0122 - accuracy: 0.2520
Epoch 00001: saving model to model_checkpoints_5000/cp_01-700
 703/1000 [====================>.........] - ETA: 1s - loss: 2.0082 - accuracy: 0.2538
...

In this example, a checkpoint is created every x steps per epoch.

Answered By - AloneTogether

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 25, 2022

[FIXED] How to create checkpoint filenames with epoch or batch number when using ModelCheckpoint() with save_freq as interger?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels