Issue
When adding an ExponentialDecay
learning rate schedule to my Adam
optimizer, it changed the training behavior even before it should become effective. I used the following definition for the schedule:
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
1e-3, decay_steps=25, decay_rate=0.95, staircase=True)
Since I'm using staircase=True
, there should be no difference for the first 25 epochs compared to using a static learning rate of the same value. So the following two optimizers should yield identical training results for the first 25 epochs:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
However I observed that the behavior is different already before:
This is the test code I used:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout
np.random.seed(0)
x_data = 2*np.random.random(size=(1000, 1))
y_data = np.random.normal(loc=x_data**2, scale=0.05)
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
1e-3, decay_steps=25, decay_rate=0.95, staircase=True)
histories = []
learning_rates = [1e-3, lr_schedule]
for lr in learning_rates:
tf.random.set_seed(0)
model = tf.keras.models.Sequential([
Dense(10, activation='tanh', input_dim=1), Dropout(0.2),
Dense(10, activation='tanh'), Dropout(0.2),
Dense(1)
])
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
model.compile(optimizer=optimizer, loss='mse')
history = model.fit(x_data, y_data, epochs=50)
histories.append(history.history['loss'])
fig, ax = plt.subplots()
ax.set(xlabel='Epoch', ylabel='Loss')
ax.plot(histories[0], label='Static learning rate')
ax.plot(histories[1], label='Learning rate schedule')
ax.legend()
plt.show()
I'm using Python 3.7.9 and the following install of Tensorflow:
$ conda list | grep tensorflow
tensorflow 2.1.0 mkl_py37h80a91df_0
tensorflow-base 2.1.0 mkl_py37h6d63fb7_0
tensorflow-estimator 2.1.0 pyhd54b08b_0
Solution
When using ExponentialDecay
, what you're basically doing is to make a decayed learning rate like:
def decayed_learning_rate(step):
return initial_learning_rate * decay_rate ^ (step / decay_steps)
When you set staircase=True
, what happens is that step / decay_steps
is an integer division and the rate follows a staircase function.
Now, let's take a look at the source code:
# ...setup for step function...
global_step_recomp = math_ops.cast(step, dtype) # step is the current step count
p = global_step_recomp / decay_steps
if self.staircase:
p = math_ops.floor(p)
return math_ops.multiply(initial_learning_rate, math_ops.pow(decay_rate, p), name=name)
And we can see that we have a variable p
that updates at every multiple of decay_steps
, so at step 25, 50, 75 and so on... Basically, the learning rate is constant for every 25 steps, not epochs - which is why it updates before the first 25 epochs. A good explanation on the difference can be read at What is the difference between steps and epochs in TensorFlow?
Answered By - qedk
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.