Friday, July 15, 2022

[FIXED] Tensorflow custom training step fails with "Unexpected result of train_function"

July 15, 2022 keras, machine-learning, python, tensorflow No comments

Issue

I've subclassed the tensorflow.keras.models.Model class and written a custom train_step, following the process described here. The model takes in two 2d-arrays as input (it is a multi-input model) and produces a single float value as output.

I'm passing a TFRecord dataset to the model using the following, where parse_element_func returns a tuple of 4 items: (2d array, 2d array, float, float). The first and second items are input data, the third is the target value, and the last is a number used in a custom loss function that varies by training example. Each of these items is expanded by 1 dimension during training because they are batched.

train_dataset = tf.data.TFRecordDataset(records_train).map(parse_element_func).batch(batch_size).prefetch(tf.data.AUTOTUNE)

The class looks like this:

import tensorflow.keras.backend as K
from tensorflow.keras.metrics import Mean
from tensorflow.keras.models import Model

loss_tracker = Mean(name="loss")
custom_metric_tracker = Mean(name="custom_metric")
magic_number = 4


class CustomModel(Model):


    def __init__(self, *args, clip_global_norm: float = 1.0, **kwargs):
        super(CustomModel, self).__init__(*args, **kwargs)
        self.clip_global_norm = clip_global_norm

    def train_step(self, data):
        # unpack data
        x_input_1, x_input_2, y_true, loss_modifier = data

        with tf.GradientTape() as tape:
            # predict
            y_pred = self((x_input_1, x_input_2), training=True)
            
            # calculate loss
            weights = K.pow(K.square(loss_modifier + magic_number), -1)
            squared_error = K.square(y_pred - y_true)
            loss = K.mean(weights * squared_error, axis=0)
            
            # calculate custom metric
            num = K.sum(K.square(y_pred - y_true), axis=0)
            denom = K.sum(y_true - K.mean(y_true), axis=0)
            custom_metric_value = 1 - num / (denom + 0.000001)  # to prevent being 0            

        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        gradients, _ = tf.clip_by_global_norm(gradients, clip_norm=self.clip_global_norm)

        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        loss_tracker.update_state(loss)
        custom_metric_tracker.update_state(custom_metric_value)

        return {"loss": loss_tracker.result(), "custom_metric": custom_metric_tracker.result()}

The model builds and compiles just fine, and I've checked that all the shapes are correct using plot_model. When I test loading the data, everything is there in the correct shape and value. No matter what, I get the same ValueError:

ValueError: Unexpected result of `train_function` (Empty logs).

This is the only message I get. It doesn't tell me anything about what is wrong besides it has something to do with the training function, and it happens during model.fit. When I call it, it looks like this in my script:

    train_dataset = tf.data.TFRecordDataset(records_train).map(parse_element_func).batch(batch_size).prefetch(tf.data.AUTOTUNE)
    val_dataset = tf.data.TFRecordDataset(records_val).map(parse_element_func).batch(batch_size).prefetch(tf.data.AUTOTUNE)

    model = build_model_func(**model_build_params)
    model.compile(optimizer="adam")

    history = model.fit(
        train_dataset,
        batch_size=batch_size,
        epochs=epochs,
        validation_data=val_dataset,
    )

Whether I run it eagerly or not does not make a difference. I thought maybe my dataset passing in a tuple of 4 values might be the issue, but as far as I can through the documentation it should be fine, and even I modify the TFRecord dataset element parser to just provide inputs and outputs and no other values (so 2 values instead of 4), I still get the same error.

I've spent hours on this and just have no idea why I'm getting this error and what is wrong with this function or my process. Can anyone help figure out how to get past this error?

Solution

I finally figured it out, while creating reproducible code at M.Innat's suggestion. The error message led me to believe it had something to do with the custom training function, but it actually had to do with the TFRecordDataset.

It turns out that at some point in the script, records_train, which originally has a list of tfrecord filenames, became an empty list. So basically no data was getting passed to model.fit.

For reference, this is the line of code that produced the error:

history = model.fit(
    train_dataset,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=val_dataset,
)

But the actual source of the error, which was not in the stack trace or ever mentioned otherwise, was this line, where records_train = []:

train_dataset = tf.data.TFRecordDataset(records_train).map(parse_element_func).batch(batch_size).prefetch(tf.data.AUTOTUNE)

And this was the error message:

ValueError: Unexpected result of `train_function` (Empty logs).

A pretty unhelpful error message, but maybe this post will help someone in the future.

Answered By - Galen

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, July 15, 2022

[FIXED] Tensorflow custom training step fails with "Unexpected result of train_function"

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels