Thursday, April 21, 2022

[FIXED] The DQN model cannot correctly come out the expected scores

April 21, 2022 deep-learning, dqn, openai-gym, pytorch, reinforcement-learning No comments

Issue

I am working on a DQN training model of the game "CartPole-v1". In this model, the system did not remind any error information in the terminal. However, The result evaluation got worse.This is the output data:

episode: 85 score: 18 avarage score: 20.21 epsilon: 0.66
episode: 86 score: 10 avarage score: 20.09 epsilon: 0.66
episode: 87 score: 9 avarage score: 19.97 epsilon: 0.66
episode: 88 score: 14 avarage score: 19.90 epsilon: 0.65
episode: 89 score: 9 avarage score: 19.78 epsilon: 0.65
episode: 90 score: 10 avarage score: 19.67 epsilon: 0.65
episode: 91 score: 14 avarage score: 19.60 epsilon: 0.64
episode: 92 score: 13 avarage score: 19.53 epsilon: 0.64
episode: 93 score: 17 avarage score: 19.51 epsilon: 0.64
episode: 94 score: 10 avarage score: 19.40 epsilon: 0.63
episode: 95 score: 16 avarage score: 19.37 epsilon: 0.63
episode: 96 score: 16 avarage score: 19.33 epsilon: 0.63
episode: 97 score: 10 avarage score: 19.24 epsilon: 0.62
episode: 98 score: 13 avarage score: 19.17 epsilon: 0.62
episode: 99 score: 12 avarage score: 19.10 epsilon: 0.62
episode: 100 score: 11 avarage score: 19.02 epsilon: 0.61
episode: 101 score: 17 avarage score: 19.00 epsilon: 0.61
episode: 102 score: 11 avarage score: 18.92 epsilon: 0.61
episode: 103 score: 9 avarage score: 18.83 epsilon: 0.61

I'll show my code here. Firstly I constructed a neuron network:

import random
from torch.autograd import Variable
import torch as th
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import gym
from collections import deque

# construct a neuron network (prepare for step1, step3.2 and 3.3)
class DQN(nn.Module):
    def __init__(self, s_space, a_space) -> None:

        # inherit from DQN class in pytorch
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(s_space, 360)
        self.fc2 = nn.Linear(360, 360)
        self.fc3 = nn.Linear(360, a_space)


    # DNN operation architecture
    def forward(self, input):
        out = self.fc1(input)
        out = F.relu(out)
        out = self.fc2(out)
        out = F.relu(out)
        out = self.fc3(out)
        return out

Instead of newing an agent class, I directly created the select function, which is used for select the corresponding action according to epsilon, and the back propagation function by gradient globally:

# define the action selection according to epsilon using neuron network (prepare for step3.2)
def select(net, epsilon, env, state):

    # randomly select an action if not greedy
    if(np.random.rand() <= epsilon):
        action = env.action_space.sample()
        return action
    # select the maximum reward action by NN and the given state if greedy
    else:
        actions = net(Variable(th.Tensor(state))).detach().numpy()
        action = np.argmax(actions[0])
        return action

This is the back propagation function and the decreasing of epsilon:

# using loss function to improve neuron network (prepare for step3.3)
def backprbgt(net, store, batch_size, gamma, learning_rate):
   
    # step1: create loss function and Adam optimizer
    loss_F = nn.MSELoss()
    opt = th.optim.Adam(net.parameters(),lr=learning_rate)

    # step2: extract the sample in memory
    materials = random.sample(store, batch_size)

    # step3: Calculate arguments of loss function:

    for t in materials:

        Q_value = net(Variable(th.Tensor(t[0])))

        # step3.1 Calculate tgt_Q_value in terms of greedy:
        reward = t[3]
        if(t[4] == True):
            tgt = reward
        else:
            tgt = reward + gamma * np.amax(net(Variable(th.Tensor(t[2]))).detach().numpy()[0])
        # print(tgt)
        # tgt_Q_value = Variable(th.Tensor([[float(tgt)]]), requires_grad=True)

        # print("Q_value:",Q_value)
        Q_value[0][t[1]] = tgt
        tgt_Q_value = Variable(th.Tensor(Q_value))
        # print("tgt:",tgt_Q_value)

        # step3.2 Calculate evlt_Q_value
        
        # index = th.tensor([[t[1]]])
        # evlt_Q_value = Q_value.gather(1,index)  # gather tgt into the corresponding action
        evlt_Q_value = net(Variable(th.Tensor(t[0])))
        # print("evlt:",evlt_Q_value)


        # step4: backward and optimization
        loss = loss_F(evlt_Q_value, tgt_Q_value)
        # print(loss)
        opt.zero_grad()
        loss.backward()
        opt.step()

# step5: decrease epsilon for exploitation
def decrease(epsilon, min_epsilon, decrease_rate):
    if(epsilon > min_epsilon):
        epsilon *= decrease_rate

After that, the parameters and training progress are like this:

# training process

# step 1: set parameters and NN
episode = 1500
epsilon = 1.0
min_epsilon = 0.01
dr = 0.995
gamma = 0.9
lr = 0.001
batch_size = 40
memory_store = deque(maxlen=1500)

# step 2: define game category and associated states and actions
env = gym.make("CartPole-v1")
s_space = env.observation_space.shape[0]
a_space = env.action_space.n

net = DQN(s_space, a_space)
score = 0

# step 3: trainning
for e in range(0, episode):

    # step3.1: at the start of each episode, the current result should be refreshed

    # set initial state matrix
    s = env.reset().reshape(-1, s_space)

    # step3.2: iterate the state and action
    for run in range(500):

        # select action and get the next state according to current state "s"
        a = select(net, epsilon, env, s)
        obs, reward, done, info = env.step(a)

        next_s = obs.reshape(-1,s_space)
        s = next_s

        score += 1

        if(done == True):
            reward = -10.0
            memory_store.append((s,a,next_s,reward,done))
            avs = score / (e+1)
            print("episode:", e+1, "score:", run+1, "avarage score: {:.2f}".format(avs), "epsilon: {:.2}".format(epsilon))
            break

        # safe sample data
        memory_store.append((s, a, next_s, reward, done))

        if(run == 499):
            print("episode:", e+1, "score:", run+1, "avarage score:", avs)

    # step3.3 whenever the episode reach the integer time of batch size, 
    # we should backward to implore the NN
    if(len(memory_store) > batch_size):
        backprbgt(net, memory_store, batch_size, gamma, lr) # here we need a backprbgt function to backward
        if(epsilon > min_epsilon):
            epsilon = epsilon * dr

In the entire progress of training, there was no error or exception reminds. However, instead of the score increasing, the model performed lower score in the later steps. I think the theory of this model is correct but cannot find where the error appears although I tried lots of methods improving my code, including rechecking the input arguments of network, modifing the data structure of two arguments of loss function, etc. I paste my code here and hope to get some help on how to fix it. Thanks!

Solution

Check out the code. For most parts it's the same as in snippet above, but there is some changes:

for step in replay buffer (which is called in code memory_store) namedtuple is used, and in update it's much easier to read t.reward, than looking what every index doing in step t
class DQN has method update, it's better to keep optimizer as attribute of class, than create it every time when calling function backprbgt
usage of torch.autograd.Variable here is unnecessary, so it's also was taken away
update in backprbgt taken per batch
decrease size of hidden layer from 360 to 32, while increase batch size from 40 to 128
updating network once in 10 episodes, but on 10 batches in replay buffer
average score prints out every 50 episodes based on 10 last episodes
add seeds

Also for RL it's take a long time to learn anything, so hoping that after 100 episodes it'll be close to even 100 points is somewhat optimistic. For the code in link averaging on 5 runs results in following dynamics

X axis -- number of episodes (yeah, 70 K, but it's like 20 minutes of real time)

Y axis -- number of steps in episode

As can be seen after 70K episodes algorithm achieves reward comparable to highest possible in this environment (highest -- 500). By tweaking hyperparameters faster rate can be achieved, but also remember it's DQN without any modification.

Answered By - draw

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, April 21, 2022

[FIXED] The DQN model cannot correctly come out the expected scores

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels