Issue
I have tensorflow 2 v. 2.5.0 installed and am using jupyter notebooks with python 3.10.
I'm practicing using an argument, save_freq as an integer from an online course (they use tensorflow 2.0.0 where the following code runs fine but it does work in my more recent version).
here's the link to relevant documentation without an example on using integer in save_freq. https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint
here is my code:
import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
# Use the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0
# using a smaller subset -- speeds things up
x_train = x_train[:10000]
y_train = y_train[:10000]
x_test = x_test[:1000]
y_test = y_test[:1000]
# define a function that creates a new instance of a simple CNN.
def create_model():
model = Sequential([
Conv2D(filters=16, input_shape=(32, 32, 3), kernel_size=(3, 3),
activation='relu', name='conv_1'),
Conv2D(filters=8, kernel_size=(3, 3), activation='relu', name='conv_2'),
MaxPooling2D(pool_size=(4, 4), name='pool_1'),
Flatten(name='flatten'),
Dense(units=32, activation='relu', name='dense_1'),
Dense(units=10, activation='softmax', name='dense_2')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
# Create Tensorflow checkpoint object with epoch and batch details
checkpoint_5000_path = 'model_checkpoints_5000/cp_{epoch:02d}-{batch:04d}'
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
save_weights_only = True,
save_freq = 5000,
verbose = 1)
# Create and fit model with checkpoint
model = create_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = 10,
callbacks = [checkpoint_5000])
I want to create and save the checkpoint filenames including the epoch and batch number. However, the files are not created and it writes 'File not found'. After I create manually the directory, model_checkpoints_5000, no files are added in.
(we can check the directory contents by running ' ! dir -a model_checkpoints_5000' (in windows), or 'ls -lh model_checkpoints_500' (in linux)).
I have also tried to change to 'model_checkpoints_5000/cp_{epoch:02d}', it still does not save the files with every epoch's number.
Then I have tried to follow the example from Checkpoint Callback options with save_freq, which saves files with me. https://www.tensorflow.org/tutorials/keras/save_and_load
yet, it is still not saving any of my files.
checkpoint_path = "model_checkpoints_5000/cp-{epoch:02d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
batch_size = 10
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_path,
save_weights_only = True,
save_freq = 500*batch_size,
model = create_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = batch_size,
callbacks = [checkpoint_5000]) verbose = 1)
any suggestions how to make it work? other than downgrading my tensorflow.
Solution
The parameter save_freg
is too large. It needs to be save_freg = training_samples // batch_size
or less. Maybe try something like this:
batch_size = 10
checkpoint_5000_path = 'model_checkpoints_5000/cp_{epoch:02d}-{batch:1d}'
checkpoint_5000 = ModelCheckpoint(filepath = checkpoint_5000_path,
save_weights_only = True,
save_freq = len(x_train) // batch_size // batch_size,
verbose = 1)
model = create_model()
model.fit(x = x_train,
y = y_train,
epochs = 3,
validation_data = (x_test, y_test),
batch_size = batch_size,
callbacks = [checkpoint_5000])
Epoch 1/3
97/1000 [=>............................] - ETA: 3s - loss: 2.2801 - accuracy: 0.1536
Epoch 00001: saving model to model_checkpoints_5000/cp_01-100
198/1000 [====>.........................] - ETA: 3s - loss: 2.2347 - accuracy: 0.1500
Epoch 00001: saving model to model_checkpoints_5000/cp_01-200
288/1000 [=======>......................] - ETA: 3s - loss: 2.1979 - accuracy: 0.1736
Epoch 00001: saving model to model_checkpoints_5000/cp_01-300
397/1000 [==========>...................] - ETA: 2s - loss: 2.1337 - accuracy: 0.2020
Epoch 00001: saving model to model_checkpoints_5000/cp_01-400
497/1000 [=============>................] - ETA: 2s - loss: 2.0952 - accuracy: 0.2197
Epoch 00001: saving model to model_checkpoints_5000/cp_01-500
598/1000 [================>.............] - ETA: 1s - loss: 2.0496 - accuracy: 0.2395
Epoch 00001: saving model to model_checkpoints_5000/cp_01-600
698/1000 [===================>..........] - ETA: 1s - loss: 2.0122 - accuracy: 0.2520
Epoch 00001: saving model to model_checkpoints_5000/cp_01-700
703/1000 [====================>.........] - ETA: 1s - loss: 2.0082 - accuracy: 0.2538
...
In this example, a checkpoint is created every x steps per epoch.
Answered By - AloneTogether
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.