Issue
For my basic evaulation of learning algorithms I defined a custom environment. Now with standard examples for stable baselines the learning seems always to be initiated by stable baselines automatically (by stablebaselines choosing random actions itsself and evaluating the rewards). The standard learning seems to be done like this:
model.learn(total_timesteps=10000)
and this will try out different actions and optimize the action-observation-relations while learning.
I would like to try out a really basic approach: for my custom environment I would try to generate lists of examples, which actions should be taken according to some situations of relevance (so there is a list of predefined observation-action-rewards).
And I would like to train the model with this list.
What would be the most appropriate way to implement this with stablebaselines3 (using pytorch)?
Additional information: maybe the sense of the question could be compared to the idea in case of an atari game, not to always train a whole game sequence at once (from start to end of game, and then restart again until training ends), but instead to train the agent only with some more specific, representative situations of importance. Or in chess: it seems to be a huge difference to let an agent select randomly self choosen or randomly selected moves or to let him follow moves played by masters in particular interesting situations.
Maybe one could put the lists as main part of the environment reaction (so e.g. train the agent with environment 1 for e.g. 1000 steps, then train with environment 2 for e.g. 1000 steps and so on). This could be a solution.
But the problem would be, that stable baselines would choose actions itsself, so that it could not learn a complete sequence of "correct" or like in chess masterful choosen steps in sequence.
So again the practical question is : is is possible, ot important to bring stablebaselines to train predefined actions instead of self choosen ones while training/learning?
Solution
Imitation Learning is essentially what you are looking for. There is an imitation library that sits on top of baselines that you can use to achieve this.
See this example on how to create a policy that mimics expert behavior to train the network. The behavior in this case comes from a set of action sequences or rollout. In the example the rollout comes from an expertly trained policy, but you can probably create a hand written one. See this on how to create a rollout.
Answered By - Bhupen
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.