Issue
I have used this implementation of mrcnn: https://github.com/matterport/Mask_RCNN. I have later changed to using this version: https://github.com/sabderra/Mask_RCNN to support TensorFlow 2.
I'm running the training code on my university Linux VM using GPU, and it doesn't save all the weights after each iteration. when running the code first with a small training set size (4 images) for 5 epochs, all the weights get saved except for epoch 4 for some reason. when running it for 10 epochs using a larger training set size of 700 images, it only saves the weights for epochs 1 and 2 while still running until finishing the last epoch (sometimes saving the weights only for epochs 1 and 3). did anybody experience this or know how to fix it? thanks!
Edit:
it uses the Keras ModelCheckpoint callback function to save the model weights using the path defined here:
# Path to save after each epoch. Include placeholders that get filled by Keras.
self.checkpoint_path = os.path.join(self.log_dir, "mask_rcnn_{}_*epoch*.h5".format(
self.config.NAME.lower()))
self.checkpoint_path = self.checkpoint_path.replace(
"*epoch*", "{epoch:04d}")
this is the entire train function which calls the keras fit function:
def train(self, train_dataset, val_dataset, learning_rate, epochs, layers,
augmentation=None, custom_callbacks=None, no_augmentation_sources=None,
patience=10):
"""Train the model.
train_dataset, val_dataset: Training and validation Dataset objects.
learning_rate: The learning rate to train with
epochs: Number of training epochs. Note that previous training epochs
are considered to be done already, so this actually determines
the epochs to train in total rather than in this particular
call.
layers: Allows selecting which layers to train. It can be:
- A regular expression to match layer names to train
- One of these predefined values:
heads: The RPN, classifier and mask heads of the network
all: All the layers
3+: Train Resnet stage 3 and up
4+: Train Resnet stage 4 and up
5+: Train Resnet stage 5 and up
augmentation: Optional. An imgaug (https://github.com/aleju/imgaug)
augmentation. For example, passing imgaug.augmenters.Fliplr(0.5)
flips images right/left 50% of the time. You can pass complex
augmentations as well. This augmentation applies 50% of the
time, and when it does it flips images right/left half the time
and adds a Gaussian blur with a random sigma in range 0 to 5.
augmentation = imgaug.augmenters.Sometimes(0.5, [
imgaug.augmenters.Fliplr(0.5),
imgaug.augmenters.GaussianBlur(sigma=(0.0, 5.0))
])
custom_callbacks: Optional. Add custom callbacks to be called
with the keras fit_generator method. Must be list of type keras.callbacks.
no_augmentation_sources: Optional. List of sources to exclude for
augmentation. A source is string that identifies a dataset and is
defined in the Dataset class.
"""
assert self.mode == "training", "Create model in training mode."
# Pre-defined layer regular expressions
layer_regex = {
# all layers but the backbone
"heads": r"(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
# From a specific Resnet stage and up
"3+": r"(res3.*)|(bn3.*)|(res4.*)|(bn4.*)|(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
"4+": r"(res4.*)|(bn4.*)|(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
"5+": r"(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
# All layers
"all": ".*",
}
if layers in layer_regex.keys():
layers = layer_regex[layers]
# Data generators
train_generator = DataGenerator(train_dataset, self.config, shuffle=True,
augmentation=augmentation)
val_generator = DataGenerator(val_dataset, self.config, shuffle=True)
# Create log_dir if it does not exist
if not os.path.exists(self.log_dir):
os.makedirs(self.log_dir)
# Callbacks
callbacks = [
keras.callbacks.TensorBoard(log_dir=self.log_dir,
histogram_freq=0, write_graph=True, write_images=False),
keras.callbacks.ModelCheckpoint(self.checkpoint_path,
verbose=1,
save_best_only=True,
save_weights_only=True,
period=1),
]
# Add custom callbacks to the list
if custom_callbacks:
callbacks += custom_callbacks
# Train
log(f"\nStarting at epoch {self.epoch}. LR={learning_rate}\n")
log(f"Checkpoint Path: {self.checkpoint_path}")
self.set_trainable(layers)
self.compile(learning_rate, self.config.LEARNING_MOMENTUM)
# Work-around for Windows: Keras fails on Windows when using
# multiprocessing workers. See discussion here:
# https://github.com/matterport/Mask_RCNN/issues/13#issuecomment-353124009
if os.name == 'nt':
workers = 0
else:
workers = multiprocessing.cpu_count()
history = self.keras_model.fit(
train_generator,
initial_epoch=self.epoch,
epochs=epochs,
verbose=1,
steps_per_epoch=self.config.STEPS_PER_EPOCH,
callbacks=callbacks,
validation_data=val_generator,
validation_steps=self.config.VALIDATION_STEPS,
max_queue_size=100,
workers=workers,
use_multiprocessing=self.config.USE_MULTIPROCESSING,
)
self.epoch = max(self.epoch, epochs)
return history
Solution
I didn't realize that save_best_only=True
was set for the callback function, so after changing it, it works.
thanks!
Answered By - Avner St
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.