Issue
I have a deep sarsa algorithm which work great on Pytorch on lunar-lander-v2 and I would use with Keras/Tensorflow. It use mini-batch of size 64 which are used 128 time to train at each episode.
There are the results I get. As you can see, it work great with Pytorch but not with Keras / Tensorflow... So I think I do not correctly implement the training function is Keras/Tensorflow (code is below).
It seems that loss is oscillating in Keras because epsilon go to early to slow value but it work very great in Pytorch...
Do you see something that could explain why it do not work in Keras/Tensorflow please? Thanks a lot for your help and any idea that could help me...
Network information:
It use Adam optimizer, and a network with two layers : 256 and 128, with relu on each:
class Q_Network(nn.Module):
def __init__(self, state_dim , action_dim):
super(Q_Network, self).__init__()
self.x_layer = nn.Linear(state_dim, 256)
self.h_layer = nn.Linear(256, 128)
self.y_layer = nn.Linear(128, action_dim)
print(self.x_layer)
def forward(self, state):
xh = F.relu(self.x_layer(state))
hh = F.relu(self.h_layer(xh))
state_action_values = self.y_layer(hh)
return state_action_values
For keras/Tensorflwo I use this one:
def CreationModele(dimension):
entree_etat = keras.layers.Input(shape=(dimension))
sortie = keras.layers.Dense(units=256, activation='relu')(entree_etat)
sortie = keras.layers.Dense(units=128, activation='relu')(sortie)
sortie = keras.layers.Dense(units=4)(sortie)
modele = keras.Model(inputs=entree_etat,outputs=sortie)
return modele
Training code
In Pytorch, the training is done by:
def update_Sarsa_Network(self, state, next_state, action, next_action, reward, ends):
actions_values = torch.gather(self.qnet(state), dim=1, index=action.long())
next_actions_values = torch.gather(self.qnet(next_state), dim=1, index=next_action.long())
next_actions_values = reward + (1.0 - ends) * (self.discount_factor * next_actions_values)
q_network_loss = self.MSELoss_function(actions_values, next_actions_values.detach())
self.qnet_optim.zero_grad()
q_network_loss.backward()
self.qnet_optim.step()
return q_network_loss
And in Keras/Tensorflow by:
mse = keras.losses.MeanSquaredError(
reduction=keras.losses.Reduction.SUM)
@tf.function
def train(model, batch_next_states_tensor, batch_next_actions_tensor, batch_reward_tensor, batch_end_tensor, batch_states_tensor, batch_actions_tensor, optimizer, gamma):
with tf.GradientTape() as tape:
# EStimation des valeurs des actions courantes
actions_values = model(batch_states_tensor) # (mini_batch_size,4)
actions_values = tf.linalg.diag_part(tf.gather(actions_values,batch_actions_tensor,axis=1)) # (mini_batch_size,)
actions_values = tf.expand_dims(actions_values,-1) # (mini_batch_size,1)
# EStimation des valeurs des actions suivantes
next_actions_values = model(batch_next_states_tensor) # (mini_batch_size,4)
next_actions_values = tf.linalg.diag_part(tf.gather(next_actions_values,batch_next_actions_tensor,axis=1)) # (mini_batch_size,)
cibles = batch_reward_tensor + (1.0 - batch_end_tensor)*gamma*tf.expand_dims(next_actions_values,-1) # (mini_batch_size,1)
error = mse(cibles, actions_values)
grads = tape.gradient(error, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return error
Error function and Optimizer code
The optimizer is Adam in Pytorch and Tensorflow with lr=0.001. In Pytorch:
def __init__(self, state_dim, action_dim):
self.qnet = Q_Network(state_dim, action_dim)
self.qnet_optim = torch.optim.Adam(self.qnet.parameters(), lr=0.001)
self.discount_factor = 0.99
self.MSELoss_function = nn.MSELoss(reduction='sum')
self.replay_buffer = ReplayBuffer()
pass
In Keras / Tensorflow:
alpha = 1e-3
# Initialise le modèle
modele_Keras = CreationModele(8)
optimiseur_Keras = keras.optimizers.Adam(learning_rate=alpha)
Solution
Ok I finnaly foud a solution by de-correlate target and action value using two model, one being updated periodically for target values calculation
Answered By - rdpdo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.