Thursday, November 17, 2022

[FIXED] Tensorflow loss is diverging in my RNN

November 17, 2022 deep-learning, neural-network, python, sequence, tensorflow No comments

Issue

I'm trying to get my hand wet with Tensorflow by solving this challenge: https://www.kaggle.com/c/integer-sequence-learning.

My work is based on these blog posts:

A complete working example - with my data - can be found here: https://github.com/bottiger/Integer-Sequence-Learning Running the example will print out a lot of debug information. Run execute rnn-lstm-my.py . (Requires tensorflow and pandas)

The approach is pretty straight forward. I load all of my train sequences, store their length in a vector and the length of the longest one in a variable I call ''max_length''.

In my training data I strip out the last element in all the sequences and store it in a vector called "train_solutions"

The I store all the sequences, padded with zeros, in a matrix with the shape: [n_seq, max_length].

Since I want to predict the next number in a sequence my output should be a single number, and my input should be a sequence.

I use a RNN (tf.nn.rnn) with a BasicLSTMCell as cell, with 24 hidden units. The output is feeded into a basic linear model (xW+B) which should produce my prediction.

My cost function is simply the predicted number of my model, I calculate the cost like this:

    cost = tf.nn.l2_loss(tf_result - prediction)

The basics dimensions seems to be correct because the code actually runs. However, after only one or two iterations some NaN starts to occur which quickly spreads, and everything becomes NaN.

Here is the important part of the code where I define and run the graph. However, I have omitted posted loading/preparation of the data. Please look at the git repo for details about that - but I pretty sure that part is correct.

cell = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, state_is_tuple=True)

num_inputs = tf.placeholder(tf.int32, name='NumInputs')
seq_length = tf.placeholder(tf.int32, shape=[batch_size], name='NumInputs')

# Define the input as a list (num elements = batch_size) of sequences
inputs = [tf.placeholder(tf.float32,shape=[1, max_length], name='InputData') for _ in range(batch_size)]

# Result should be 1xbatch_szie vector
result = tf.placeholder(tf.float32, shape=[batch_size, 1], name='OutputData')

tf_seq_length = tf.Print(seq_length, [seq_length, seq_length.get_shape()], 'SequenceLength: ')

outputs, states = tf.nn.rnn(cell, inputs, dtype=tf.float32) 

# Print the output. The NaN first shows up here
outputs2 = tf.Print(outputs, [outputs], 'Last: ', name="Last", summarize=800)

# Define the model
tf_weight = tf.Variable(tf.truncated_normal([batch_size, num_hidden, frame_size]), name='Weight')
tf_bias   = tf.Variable(tf.constant(0.1, shape=[batch_size]), name='Bias')

# Debug the model parameters
weight = tf.Print(tf_weight, [tf_weight, tf_weight.get_shape()], "Weight: ")
bias = tf.Print(tf_bias, [tf_bias, tf_bias.get_shape()], "bias: ")

# More debug info
print('bias: ', bias.get_shape())
print('weight: ', weight.get_shape())
print('targets ', result.get_shape())
print('RNN input ', type(inputs))
print('RNN input len()', len(inputs))
print('RNN input[0] ', inputs[0].get_shape())

# Calculate the prediction
tf_prediction = tf.batch_matmul(outputs2, weight) + bias
prediction = tf.Print(tf_prediction, [tf_prediction, tf_prediction.get_shape()], 'prediction: ')

tf_result = result

# Calculate the cost
cost = tf.nn.l2_loss(tf_result - prediction)

#optimizer = tf.train.AdamOptimizer()
learning_rate  = 0.05
optimizer = tf.train.GradientDescentOptimizer(learning_rate)


minimize = optimizer.minimize(cost)

mistakes = tf.not_equal(tf.argmax(result, 1), tf.argmax(prediction, 1))
error = tf.reduce_mean(tf.cast(mistakes, tf.float32))

init_op = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_op)

no_of_batches = int(len(train_input)) / batch_size
epoch = 1

val_dict = get_input_dict(val_input, val_output, train_length, inputs, batch_size)

for i in range(epoch):
    ptr = 0
    for j in range(no_of_batches):

    print('eval w: ', weight.eval(session=sess))

    # inputs batch
    t_i = train_input[ptr:ptr+batch_size]

    # output batch
    t_o = train_output[ptr:ptr+batch_size]

    # sequence lengths
    t_l = train_length[ptr:ptr+batch_size]

    sess.run(minimize,feed_dict=get_input_dict(t_i, t_o, t_l, inputs, batch_size))

    ptr += batch_size

    print("result: ", tf_result)
    print("result len: ", tf_result.get_shape())
    print("prediction: ", prediction)
    print("prediction len: ", prediction.get_shape())


    c_val = sess.run(error, feed_dict = val_dict )
    print "Validation cost: {}, on Epoch {}".format(c_val,i)


    print "Epoch ",str(i)

print('test input: ', type(test_input))
print('test output: ', type(test_output))

incorrect = sess.run(error,get_input_dict(test_input, test_output, test_length, inputs, batch_size))

sess.close()

And here is (the first lines of) the output it produces. You can see that everything become NaN: http://pastebin.com/TnFFNFrr (I could not post it here due to the body limit)

The first time I see the NaN is here:

I tensorflow/core/kernels/logging_ops.cc:79] Last: [0 0.76159418 0 0 0 0 0 -0.76159418 0 -0.76159418 0 0 0 0.76159418 0.76159418 0 -0.76159418 0.76159418 0 0 0 0.76159418 0 0 0 nan nan nan nan 0 0 nan nan 1 0 nan 0 0.76159418 nan nan nan 1 0 nan 0 0.76159418 nan nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]

I hope I made my problem clear. Thanks in advance

Solution

RNNs suffer from an exploding gradient, so you should clip the gradients for the RNN parameters. Look at this post:

How to effectively apply gradient clipping in tensor flow?

Answered By - Vincent Renkens

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 17, 2022

[FIXED] Tensorflow loss is diverging in my RNN

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels