[Lecture 8] (2/2) Planning and Models
https://www.youtube.com/watch?v=FKl8kM4finE&list=PLqYmG7hTraZDVH599EItlEWsUOsJbAodm&index=8
Let's discuss concrete instantiation of Dyna, because this idea is very general but of course we could plug many different algorithms to this and apply very different updates to both the simulated and real experience.
So what does this actually mean in practice? A popular version of this idea is, what is typically be referred to as Dyna-Q where we use Q-learning on both the real and simulated experience.
First we need to initialize both value function like in Q-learning but also a model because we'll be training it alongside.
And then on each step, we can select actions using an epsilon-greedy policy like in standard Q-learning, and after excuting this action, we get from the environment successor state and a reward. With this real transition, we can directly apply Q-learning by updating the current Q values towards the usual target constructed by summing the reward and the discounted max Q value in the next step.
But we can also update the model in a supervised fashion. In a tabular deterministic environment, this would just amount to storing the next state and the reward.
The next point is where the secret saurce in some sense of Dyna comes in. So for each real step in the environment, we now perform N planning updates where we take some past state action pair that we have encounterd during training and we use the current model to sample a possible next state and a possible subsequent reward and we use this imagined transition to perform an additional Q-learning update.
If we increase the number of imagining steps that we do, we can reduce our reliance on the real transitions and become more data-efficient as we were hoping.
Interestingly, there are many different ways of extending or implementing this ideas. For instance, we could use a different model-free algorithm instead of Q-learning, also the algorithm written in this case for the tabular setting, but we could use linear functions or deep neural networks to represent both the values or the models.
Alternatively, we could also vary the amount of planning that we do over time. The model typically is improved and updated along the way, so it's hopefully getting better over time. So maybe we could imagine to perform more planning updates as training progresses and we become more confident in our model.
Ideally, we would even have some form of uncertainty estimates that we can use to choose how much planning to perform, how much to trust our planning updates depending on the accuracy of our learned models.
But regardless of the details and potential enhancement, the important thing is that, even this basic algorithm already instantiates the most fundamental properties that we were looking for. So we can sink in more compute to learn more efficiently. it's really critical if the data is slow or expensive or unsafe to collect. And it's really often the case in many real world applications.
Let's take a look at what happens when we actually run these algorithms on a simple problem just to gain more understanding and intuition about planning for credit assignment.
To do so, we will consider a simple maze where on each episode, we use the agent starts on the cell marked with an s, and from any cell the agent can take one of four actions which correspond to moving up down left and right respectively. The dark cells are walls meaning that if we try to move into one of these darker cells, the action will have no effect and stay where we are. The reward is zero everywhere except for a single positive reward that you collect when you reach the goal cell which is the one marked with a G.
Under this setting, it's fairly obvious that the optimal policy for any discount factor smaller than one is going to be a policy that leads you from S to G following the shortest path. Because any choice of actions that will take you along the longer path will result in the final reward be discounted more strongly.
And this means that we can evaluate the quality of the policy learned by an agent by plotting the number of steps that the agent takes in order to reach the goal as a function of the number of episodes that it has experienced.
And this plot just shows three instantiations of the algorithm that were ran with different amounts of planning. Specifically it was ran with zero steps of planning meaning that the algorithms just regarded to vanilla learning or with five steps of planning or with 50 steps of planning.
What is apparent from this plot is, that vanilla Q-learning takes many tens of episodes to converge to the optimal solution and instead by just adding a tiny bit of planning so just doing five additional updates on for every real update, the agent manages to converge to the optimum much quicker in less than 10 episodes. And as we increase to 50 planning steps, the agent converges in just a couple of episodes to the optimal policy which is really impressive.
I'm trying to give you an opportunity to peek behind the scenes and see what happens to the Q values estimate as we run this algorithm and specifically let's consider what happens on the very last step of the first episode. So this is the first time the agent observes the rewards.
We will assume that Q values were initialized as zero everywhere. Under this assumption, without planning, the first episode is fairly boring so on all intermediate steps, the values where zero when the agent enters the cell and stays zero because the TD error is always zero, our estimates are zero, there rewards are zero, the max Q value we bootstrap on is also zero. So no update is done throughout all the first episodes.
Only a single Q value is updated on the last transition. And this is the Q values corresponding to moving upwards from the cell immediately below the goal. And when the agent is doing this final transition of the first episode, that Q value is correctly updated towards the goal reward and this immediately makes a greedy policy optimal in that one cell because all the other action values stay at zero. But this action value of moving up from the state just below the goal becomes has a non-zero value. So all other states that we have seen throughout the first episode have not been updated. Q values stay all at zero.
When we run the same algorithm with 50 steps of planning for just one episode, you can see that the Q values are now have been updated throughout the maze. So it's not just the final state that has realized what the optimal action is but the estimate of the Q values that the agent has updated throughout this first episode show a meaningful direction for many different other states and the information of how to move from the state to goal has propagated almost all the way to the start which is quite impressive.
And the reason is that, when you do update that last cell in the episode, you then start sampling 50 more updates on imagined transitions from the model which means that on each of these other 50 planning updates, we will be able to propagate information from the new updated Q values backwards in state space and this means that it doesn't quite solve the problem in a single episode but it will just take a couple more episodes to correctly propagate all information for the state space and figure out what the optimal policy is.
This is not all roses. There are always certain critical issues with model-based RL including the fact that in some cases, our model will be wrong. And in this experiment, I'm trying to give you some intuition of how wrong models can affect learning in a model-based RL system.
Let's use again the same algorithm to investigate how an incorrect model may affect training and we will do so by simulating this model being wrong by having a sudden change in the dynamics of the problem.
This means that the model learned from the agent in the first half of training will suddenly be wrong because the world itself has changed. But the effect of this on learning will be quite different depending on the type of error.
In the first setting, we assume that at the beginning of training, the environment looks like on the grid on the left, where there is a short path between the state marked as with S as usual and the goal marked as G in the top right corner and once the Dyna-Q agent has learned this policy correctly which is shown in the plot by the cumulative reward becoming linear in the number of steps. At that point, we change the environment to the one on the right which is very similar grid but now the optimal policy is not anymore to move to the right to reach the goal but rather the agent needs to take a detour to circum-navigate that wall that is now blocking the path.
When you do this change in environment, suddenly, the cumulative reward collected by Dyna-Q flattens because that at the beginning the agent will try to continue move toward through the same path that is had learned at the beginning but that path is now blocked. This means that for a few steps, the agent will be unable to collect any rewards but if you look at the plot, for Dyna-q after a while the agent does start collecting more rewards because the agent will update the model to now correctly understand that it cannot move through tha cell.
And through planning, it will then re-route the policy of the agent through by updating the Q values in order to take a different route and which now results in the reward again growing linearly with the number of steps.
There are several specific agents correspond to slightly different instantiations of Dyna. The middle one is Dyna-Q which is the canonical one, while Dana-Q+ adds some small exploration box which means that is able to re-route slightly faster.
The core message remains that if the model is inaccurate in this way, where the world is harder than the model was initially assuming, the behavior of the agent is such that it will lead to real transitions that will correct for the model error and therefore the learning dynamic is overall quite benign. It's good to realize that this is not always the case.
Let's consider a veery similar setting with a slight difference. So now again the world changes but the change in environment is easier. So at the beginning, you have the situation on the left where the agent needs to take this detour around the wall in order to reach the goal but after changing, there is now the detour is still available to the agent but another path has opend up. So the agent could take pass to the right of the wall which is a slightly shorter path which would result in higher rewards.
What happens is that, once the Q agent has learned to correctly take this detour, long detour around the wall in order to reach the goal, there is nothing that pushes the agent to explore this other path.
The way in which the model is incorrect, the fact that there is a shorter path, is not reflected in the data that the agent sees. Because the error in the model doesn't result in data that can correct for the model error itself.
So what happens is that, once the agent has converged on the solution of taking the long detour, it might stay just stick with this behavior and will take a very very long time in order to realize that there was an error.
The important thing to understand is that, thinking deeply about exploration is still important even model-based RL. Because the way the certain types of errors in your model might not necessarily be discovered unless you have something that pushes your agent to execute this behavior that results in useful data.
The Dyna algorithms is a beautiful instantiation of a system that integrates learning and planning. So model-free and model-based updates. But it's not the only one.
In this section, I want to discuss Experience Replay and its connection to more conventional model-based algorithms such as Dyna.
Traditionally, RL algorithms did not explicitly store experience.
So it was trivial to place them in one of two groups. So model-free methods that do not attept to explicitly model the dynamics of the environment and directly update values and policy online with the latest data.
A model-based methods learn transitions and reward model and then plan value functions using such a model.
But this kind of sharp distinction is blurred not just by algorithms like Dyna that do both, but by other developments in the space of model-free algorithms.
Many modern RL systems now store transitions in an experience replay buffer, and then apply model-free algorithms not just to the latest transitions but also to past transitions in the buffer.
This is a fairly common technique for model-free agents but is also can be seen as just instantiating Dyna but with an experience replay playing the role of a non-parametric model.
This is not just a superficial similarity because both systems actually share a fundamental property like which is their ability to sink in compute with no additional environment interactions to improve values and policies.
In this case, the experience replay systems would do so by making many updates on past experience for every new real steps in the environment.
In this plot, I can show that how this scalability property of planning algorithms is instantiated by both Dyna and Replay Q-learning agent in a simple grid world.
You start in the state S and you need to reach the goal G as fast as possible, because there is some discount strictly smaller than one.
The total number of steps required to reach the goal on the top right 25 times if you every time you start from the cell S on the left as a function of the amount of planning, so of the amount of updates that you do for each real step in the environment.
And you can see that regardless of whether the entire transition is replayed like an experience replace system or whether a model is used to infer the subsequent rewards and state given a state action pair, the profile looks very similar.
The more additional updates you do for every real step in the environment, the more the performance improves and so the fater you reach the goal, which is the lower the total number of steps that are required to reach the goal 25 times.
You might be even tempted to take this to the extreme. So our experience replay system and algorithms based on learned parametric models completely equivalent.
In the tabular settings, we can derive exact equivalences between certain model-based and model-free algorithms. Even more in general, if we had a perfect model that outputs exactly the same true reward and successor state as the non-parametric replay system, the two would be exactly equivalent by contsruction.
In general though, any model that we use will not be perfect. So one question that you may ask is, will it ever be better? Could the inaccuracy that almost certainly come from using a parametric model provide benefits on top of a non-parametric replay system?
This seems unlikely and it's fair to say that, if we only use the model to reconstruct rewards and successor states corresponding to an observed state action pair, it seems hard for a parametric model to improve over just replaying those transitions.
But the reason that algorithms like Dyna and model-based RL that use parametric learned models are so important is, a parametric model allows flexibility that goes way beyond that.
For instance, we could use the model to plan for action selection. Because if you can query a parametric model for actions that you could want to take in the future, then you can use this to construct a plan to decide how to act. And this is something that you cannot do if you're just have a non-parametric replay system that can only tell you which action you took in the past in a given state.
Additionally, with a parametric model, you could do things like counterfactual planning. So query models for actions that you could have taken in the past but did not.
And related to counterfactual planning, you could do what is called backward planning. Instead of modeling the actual dynamics of the problem, you model the invers dynamics. So given a state and an action, the model tries to predict what was the previous reward and previous state. Then you can use this to assign credit to states that could have led to a certain outcome, in addtion to assigning credit as in standard RL to the states, that actually preceded that outcome.
Finally, if you have a parametric model, you could train this model to add different time skills. like an algorithm like Dyna need not to be restricted to train a model exactly on one step transitions, on the true native time scale of the environment.
You could train a model to do jumpy predictions, what is called jumpy planning.
In conclusion, it's worth noting that there are computational differences between the parametric model and experience replay. This might also play an important role in choosing between the two.
For instance querying a replay buffer is very cheap. Given a state action pair, it gives you immediately what was the reward and what was the successor state that you observed in the past.
While, given a state action pair, generating a subsequent reward in state using a learned model could be very expensive if the learned model is parametrized by a big neural network.
And if you look at memory, things change and the memory requirements of a replay buffer can be quite harsh because the memory will scale linearly with it capacity. While, a parametric model could achieve good accuracy with a fixed and comparably smaller memory footprint.
The key takeaway is that, both are important and powerful ideas that implement this core principle of planning. This capability of sinking in compute in order to improve our learning and our algorithms.
It's more important to think deeply about the problem that you want to solve and whether a parametric model can be the best fit for that specific problem.
So far we discussed how planning can be used to improve our estimates of a global value function or a global policy that is applicable in all the states of our MDP.
Now, I want to tell you about a different form of planning where we sink in compute as usual without requiring additional interactions with the environment, but for the purpose of selecting a single action in a single state.
It may seem a special case of the previous problem and of course in a sense, it is. Because if we could get perfect values and perfect policies everywhere, then we could just use these perfect values in any one state.
But the motivation for investigating planning for action selection is that, sometimes it's easier to make a very accurate local value function than it is to fit the global value function. The reason is that, in planning, if a value function for the current state we only need to take into account the distribution of states that are reachable from the current state and that might be a very small portion of the overall MDP.
Additionally, planning for action selection has a few other appealing properties like the fact that, suppose you have some inaccuracy in your model in some states, that will only affect the local throwaway values that you're using to select an action right now. But it won't pollute a shared global value function that is going to be reused everywhere. So it might result in selecting a sub-optimal action in certain poorly understood states but maybe that just leads to reasonable exploration and behaviors that can help you correct the model itself.
The simplest form of planning for the purpose of action selection is what is called forward search. This approach allows you to select the best action in any given state by building the entire search tree that has the current state as root. And then follows all possible trajectories from the current state onwards until episode terminations.
This approach is amounts to representing as a tree the entire sub-MDP of reachable states from the current one. And in some cases, this sub-MDP might be fairly tiny and then you can maybe solve that in full every time you need to select an action.
In general, this might not be the case. So in general, the number of the states in the tree will grow exponentially with the depth. So even with a fairly small action space, it might be computationally intractable if the horizon that you need to look goes beyond a few handful of steps.
But this is still a reasonable thing to consider at least from a conceptual perspective. Because then we have a problem with branching but we have dealt with branching in the past. So similarly, how we did for learning global value functions, we can again use sampling also for solving these local MDPs just for the purpose of action selection. This results in what is called simulation-based search.
Here, we construct a partial tree. So we start as usual from the current state as the root of some tree, but then we construct a subset of the full tree by simulating multiple episodes of experience that all start from the current state. And then use the model to roll forward possible sequences of actions.
If you construct such a partial tree, then you can just apply model-free RL to the simulated episodes to estimate the values in the root.
You could instantiate simulation-based search by using Monte Carlo prediction to estimate the state value of a root node by using two components. You have some learned model and some simulation policy. And then the state value in the root will be estimated by sampling using the model and the policy k episodes referred and then averaging this to estimate the state value of the root.
If we're interested in planning for action selections, just the state value in the root might not be that useful. So often we will want to construct action values. The same principles apply also to plan the local action value functions.
In this case, we'll need k times the number of actions episodes to ensure we sample each action k times in the root but apart from that, then we can use the same mechanism to generate complete episodes from the model and the simulation policy and then just again use averaging to estimate the value of each action in the root as the average of the returns that did execute that action immediately in the root. And then followed the simulation policy afterwards.
If we construct such a local action value function, then it's trivial to turn this into a mechanism for action selection. Because we could in each state always pick the action with the highest value according to our local search.
In the simulation-based search algorithms that I just described, each simulation is independent of the previous ones. This means that we might not be making the best use of the available compute. So it has some computational advantages because it's fully parallelizable.
At the same time we are not using what we have learned by rolling the model forward until the end of the episode to guide our behavior in the next simulation.
So I want to discuss a different approach where we build the tree incrementally, so that we can leverage knowledge from previous simulation in order to focus our computational resources where they matter most.
The algorithm is called the Monte Carlo Tree Search and was at the heart of the famous Alpha zero system and Alphago. The algorithm is fairly simple because it's just based on repeating the following four steps until you have exhausted the computational time or computational resources that you have allocated to the action selection.
On each step, for each simulated episode, what you do is, just you start from the root and you use your current estimates of the Q values based on the previous simulations to walk all the way from the root to a leaf node in the tree. Then what you do is, you add the node to the tree by expanding the action with the highest value in the leaf node. You then use a rollout until episode termination using a fixed simulation policy to get a complete evaluation of that path in the tree. And then you walk backwards all the way to the root updating all the Q values of all the ancestor nodes in the tree.
And once you have exhausted the available time and resources, you just as in the previous algorithm, select the action in the root that has the highest value.
The important feature of this approach is that, effectively we have two policies. We are not always expanding nodes and building the tree using a fixed simulation policy. Instead, we have a tree policy that we use to navigate from the root to a leaf that is based on all the previous simulations. So it's guided by all that we have learned in the previous simulations in the state.
And then we have a rollout policy that is this indeed fixed that we use to get the Monte Carlo estimate of the return from the leaf onwards.
The advantage is that, the rollout policy might be very cheap. So we could just have random policy even that can be surprisingly effective. But because we are iterating on the tree policy that we use to navigate all the way to the leaf, the system allocates resources and compute quite effectively, and can result in very good value estimates.
Let's consider a situation where we have some states that we are interested to select an action in and we have always two actions in every state. So at the beginning, the tree is empty. So what we do is, the initial state is a leaf itself and we just use a default policy to roll out until an episode termination. This will give us, if we roll out the policy in the model until the model tells us that the episode has terminated, maybe we get a reward a return of one. Then we update our value of the root node to be one because it's the average of the existing simulations.
Then we expand one node. We have no additional knowledge, let's say we simple randomly choose to pick the right action. Then what we do is again we have reached the leaf we use the default rollout policy to get in the entire episode evaluation. This time we observe that we get an absolute return of zero but now we go back and we update not just the node we had just expanded but all the ancestors which not included the root node. So the evaluation of the node we just expanded will be zero, but the evaluation of the value of the root will have been updated to one half.
Since the value in the root is higher thatn the value of the actions that we have selected, then let's try to expand the different action. And again after reaching this leaf node, we use the rollout policy to get an evaluation. This time, we observe a return of one. So we update the value of this node. We just expanded but also the root. So root is now has the value of two thirds and each of the actions in the root hae an estimate of one and zero respectively.
Since the value on the left side is higher, let's now again we navigate until the leaf we expand one node. We do an evaluation. This time, we observe a zero, we back up all the updates. So the newly added node has a value of zero, and his parent has now a value of one half and the root node now have also a value of one half because two out of four episodes that started in the root had the reward of one and the other two had no reward at all.
Again, we navigate until the leaf, we pick the action with the highest value, and we expand that node. We roll out until the end. We get an evaluation of one. Now again we have to update all the values in the parent nodes to include this the latest information that we have generated with this rollout.
We start in the node, you follow the Q values to reach the leaf by always selecting the highest value and then expand again.
And if you iterate this process, you get a very highly selective best-first search where states are evaluated dynamically and we are using sampling to break the curse of dimensionality.
And while still being computational efficient because the algorithm is anytime. So you can iterate until you have computational resources, until you have time to think. Then at any point, you get a best estimate of what the agent can assume the values in the root to be, given the policy and the model that it had available.
An important thing to realize though is, that Monte Carlo tree search, all the simulation-based and even forward search are essentially table lookup approaches. But based not on instantiating the table of all possible states but instantiating a partial table that only includes the most likely state or that are directly reachable from the current state.
So we discussed quite extensively that for model-free RL, table lookup is maybe a bit naive. So you cannot possibly store values for all states, because you don't get any generalization. You will likely not have trained the value or a policy for any new state that you encounter when interacting with the environment.
But for simulation-based search, the table lookup is less naive. Because you only need to store values for the easily reachable states that are likely under the policy and under your Q value estimate. So even if you don't have extensive generalization, this can actually be a system that works quite well in practice.
At the same time, there are limits. It is still a partial instantiation of a table and that could grow to be very large. So for very big search spaces, it will still be useful to combine these idea from MCTS with for instance value function approximation. This is what actually the Alphago system did. It actually used value functions and policies to guide the search and in its later iterations even to replace the rollouts to get a Monte Carlo evaluation by rolling a leaf all the way to the end replacing that with just a learned value function to make the process more efficient.