Monday, June 13, 2022

[FIXED] Mask RCNN model doesn't save weights after epoch 2

June 13, 2022 gpu, keras, neural-network, python, tensorflow No comments

Issue

I have used this implementation of mrcnn: https://github.com/matterport/Mask_RCNN. I have later changed to using this version: https://github.com/sabderra/Mask_RCNN to support TensorFlow 2.

I'm running the training code on my university Linux VM using GPU, and it doesn't save all the weights after each iteration. when running the code first with a small training set size (4 images) for 5 epochs, all the weights get saved except for epoch 4 for some reason. when running it for 10 epochs using a larger training set size of 700 images, it only saves the weights for epochs 1 and 2 while still running until finishing the last epoch (sometimes saving the weights only for epochs 1 and 3). did anybody experience this or know how to fix it? thanks!

Edit:

it uses the Keras ModelCheckpoint callback function to save the model weights using the path defined here:

# Path to save after each epoch. Include placeholders that get filled by Keras.
    self.checkpoint_path = os.path.join(self.log_dir, "mask_rcnn_{}_*epoch*.h5".format(
        self.config.NAME.lower()))
    self.checkpoint_path = self.checkpoint_path.replace(
        "*epoch*", "{epoch:04d}")

this is the entire train function which calls the keras fit function:

    def train(self, train_dataset, val_dataset, learning_rate, epochs, layers,
          augmentation=None, custom_callbacks=None, no_augmentation_sources=None,
          patience=10):
    """Train the model.
    train_dataset, val_dataset: Training and validation Dataset objects.
    learning_rate: The learning rate to train with
    epochs: Number of training epochs. Note that previous training epochs
            are considered to be done already, so this actually determines
            the epochs to train in total rather than in this particular
            call.
    layers: Allows selecting which layers to train. It can be:
        - A regular expression to match layer names to train
        - One of these predefined values:
          heads: The RPN, classifier and mask heads of the network
          all: All the layers
          3+: Train Resnet stage 3 and up
          4+: Train Resnet stage 4 and up
          5+: Train Resnet stage 5 and up
    augmentation: Optional. An imgaug (https://github.com/aleju/imgaug)
        augmentation. For example, passing imgaug.augmenters.Fliplr(0.5)
        flips images right/left 50% of the time. You can pass complex
        augmentations as well. This augmentation applies 50% of the
        time, and when it does it flips images right/left half the time
        and adds a Gaussian blur with a random sigma in range 0 to 5.

            augmentation = imgaug.augmenters.Sometimes(0.5, [
                imgaug.augmenters.Fliplr(0.5),
                imgaug.augmenters.GaussianBlur(sigma=(0.0, 5.0))
            ])
    custom_callbacks: Optional. Add custom callbacks to be called
        with the keras fit_generator method. Must be list of type keras.callbacks.
    no_augmentation_sources: Optional. List of sources to exclude for
        augmentation. A source is string that identifies a dataset and is
        defined in the Dataset class.
    """
    assert self.mode == "training", "Create model in training mode."

    # Pre-defined layer regular expressions
    layer_regex = {
        # all layers but the backbone
        "heads": r"(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
        # From a specific Resnet stage and up
        "3+": r"(res3.*)|(bn3.*)|(res4.*)|(bn4.*)|(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
        "4+": r"(res4.*)|(bn4.*)|(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
        "5+": r"(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
        # All layers
        "all": ".*",
    }
    if layers in layer_regex.keys():
        layers = layer_regex[layers]

    # Data generators
    train_generator = DataGenerator(train_dataset, self.config, shuffle=True,
                                    augmentation=augmentation)
    val_generator = DataGenerator(val_dataset, self.config, shuffle=True)

    # Create log_dir if it does not exist
    if not os.path.exists(self.log_dir):
        os.makedirs(self.log_dir)

    # Callbacks
    callbacks = [
        keras.callbacks.TensorBoard(log_dir=self.log_dir,
                                    histogram_freq=0, write_graph=True, write_images=False),
        keras.callbacks.ModelCheckpoint(self.checkpoint_path,
                                        verbose=1,
                                        save_best_only=True,
                                        save_weights_only=True,
                                        period=1),
    ]

    # Add custom callbacks to the list
    if custom_callbacks:
        callbacks += custom_callbacks

    # Train
    log(f"\nStarting at epoch {self.epoch}. LR={learning_rate}\n")
    log(f"Checkpoint Path: {self.checkpoint_path}")
    self.set_trainable(layers)
    self.compile(learning_rate, self.config.LEARNING_MOMENTUM)

    # Work-around for Windows: Keras fails on Windows when using
    # multiprocessing workers. See discussion here:
    # https://github.com/matterport/Mask_RCNN/issues/13#issuecomment-353124009
    if os.name == 'nt':
        workers = 0
    else:
        workers = multiprocessing.cpu_count()

    history = self.keras_model.fit(
        train_generator,
        initial_epoch=self.epoch,
        epochs=epochs,
        verbose=1,
        steps_per_epoch=self.config.STEPS_PER_EPOCH,
        callbacks=callbacks,
        validation_data=val_generator,
        validation_steps=self.config.VALIDATION_STEPS,
        max_queue_size=100,
        workers=workers,
        use_multiprocessing=self.config.USE_MULTIPROCESSING,
    )
    self.epoch = max(self.epoch, epochs)
    return history

Solution

I didn't realize that save_best_only=True was set for the callback function, so after changing it, it works. thanks!

Answered By - Avner St

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, June 13, 2022

[FIXED] Mask RCNN model doesn't save weights after epoch 2

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels