[Lecture 11] (1/2) Off-Policy and Multi-Step Learning

RL/RL_DeepMind 2024. 8. 8. 18:30

https://storage.googleapis.com/deepmind-media/UCL%20x%20DeepMind%202021/Lecture%2011-%20Off-policy%20and%20multi-step.pdf

https://www.youtube.com/watch?v=u84MFu1nG4g&list=PLqYmG7hTraZDVH599EItlEWsUOsJbAodm&index=11

Why is this important? There's a number of reasons. First, what is Off-policy learning? Off-policy learning is learning about a different policy then, that is used to generate the data.

There's a number of reasons why you might want to do that. BUt in general, it means you're posing what-if questions. Hyppothetical, conterfactual what-if questions. What if I would have done this rather than this other thing that I actually did.

Use cases in which this is useful include you might want to learn about the greedy policy but you might not want to follow greedy policy all the time because that doesn't explore sufficiently.

Q-learning is an example of an algorithm that does this and it's often named as a canonical example of an Off-policy learning algorithm. And this is true but this is not the only Off-policy learning ou could do. In addition or instead, you could also learn about many other policies. This could be useful because you care about these policies but also because you might just want to learn a lot about the world and condense all of that information into the agent somehow to reuse this later.

The greedy policy is typically a tempting thing to approximate because then, you can do something like policy iteration, but it doesn't mean it's the most effective thing to only learn about the greedy policy all the time.

In addition to that, you might have certain other constraints for instance, you might want to learn from observed data for instance, logs that you've stored or from information gatherd from other agents and learn maybe about the greedy policy or about a different policy and this could just be because you're interested in a certain application, I have this data store collected by watching humans interact with the system and seeing which actions the humans took and you might actually want to then ask what-if questions. What if they have done this other thing, what would have been the value or what would have then been my predictions.

And of course, this also includes learning from your own past policies because that's also just a data store. So you could imagine doing something called replay where you collect a lot of data, but you keep on changing your policy. Because you want to keep on getting better, but that means that the old data that you've collected doesn't correspond to your current policy anymore and then you might want to do Off-policy learning from that because the behavior that generated the old data doesn't match the current behavior that you're interested in. So that might be a different reason to want to do Off-policy learning. Because you want to do extensive replay in order to be very data efficient because you can then reuse all the data that you've generated so far.

Now, there's different reason why Off-policy learning is important rather than just trying to learn about many things or trying to learn about greedy policy or something like that in a value-based method. It's also very important in policy gradients to correct for mismatching data distributions.

If we want to do Off-policy learning about any different policy pi, so we're following some policy mu, it might be different from the policy pi that we're actually interested in, now one thing you could do is, you could use this general Q-learning or expected SARSA update.

This is straightforward application of bootstrapping, where we take a certain action a or we consider taking a certain action A_t in a state S_t, and then we see what happens if you take that action.

But in the next state, we consider following this policy pi and pi does not have to match the behavior. If could match the behavior, in that case, you're just doing On-policy expected SARSA. But pi could also be greedy in which case you're doing Q-learning. But it could also just be something else.

So in expected SARSA, it was meant to be the current behavior policy.

In SARSA, this is also interesting to point out, this still corresponds to this update but then pi is a stochastic policy which assigns all probability mass on the action that you actually pick in the next stage. That is an On-policy algorithm, but in general you could just plug in anything there and learn about whatever policies of interest to you.

We might want to do multi-step updates and then we can't use this trick of bootstrapping, because now we have to consider what happens on a longer trajectory. So just bootstrapping is not enough anymore. And then we introduced importance-sampling corrections.

If we have a trajectory Tau which spans from S_t to S_T, maybe that's the end of the episode, then we can correct for the distribution that used to generate this by multiplying the observed return G_t with the probability of generating that trajectory that we saw under the target policy that we're interested in, and dividing that by the probability of that same trajectory under the behavior policy that we're interested in.

If we do this, then the expectation of this return will be equal to the expectation of the return under the target policy.

So note that G_t is the uncorrected return and G_t with a hat is the correct return.

And what this is saying is that, if you make this correction and then you take the expectation under the behavior policy mu, then with this correction, you're guaranteed the expectation of this whole entity here which we call the G_hat will be equal to the uncorrected return under the target policy.

So this provides us a method that allows us to take any return generated under any behavior, and then correct for the distribution of generating that trajectory that led to the return to something that has the right expected value.

So let me make that a little bit more concrete. What could happen is, you had a trajectory that was actually quite unlikely under your behavior policy, but it just happen to happened, but it's one that would happen all the time under your target policy. Now what would happen here is that, effectively then you'd have a number that is relatively high, divided by a number that is relatively low. So we would upweight that return and that makes sense. Because this return would be more common, this trajectory under the target policy.

Conversely, we could generate a trajectory under the behavior that is quite likely to happen, but would be very unlikely under the target policy, we down weighs. And this means that we're re-weighting the samples in such a way that in the end, the expectation is the same to the thing that we want.

Now, the probability of a trajectory under a policy divided by the probability of that same trajectory under a different policy can be written down as the multiplication of the individual importance sampling ratios for each step. Transition probabilities which are also contained in this probability of the trajectory are the same above and below the line, so they cancel out and what we're left with is just the probabilities of selecting each of the actions in the trajectory.

Why do we consider importance-sampling ratios? Because this allows us to do multi-step updates and this can be beneficial to trade-off bias and variance.

If you consider that a certain path has been taken, if you would just do one step updating like with one-step SARSA, you would only ever update the previous state before reaching the goal and then, you'd have to wait all the way until the next episode before we can propagate the information from this newly updated action.

It might be quite slow and then we argue that sometimes it's better to do a multi-step update, maybe not full Monte Carlo, maybe not all the way ultil where we started, but maybe more than one step to propagate the information a little bit faster.

And one way to interpret this is that, this is a bias-variance trade-off where doing multi-step returns increases the variance but it decreases the bias of your update because we're not relying on estimated state values as much.

And in addition to that, you can see that the information will propagate faster in an episode when you do this.

Typically the trade-off is that a multi-step return tends to be more effective, but full Monte Carlo or full TD tend to be worse than doing a couple of steps. So neither full Monte Carlo or one-step bootstrapping is the best trade-off.

This is also quite important for policy gradients. And this turns out to be really the case in practice.

So recall for policy gradient methods, we want to sample something similar to the equation. We get an expectation where the expectation is over the dynamics of the Markov Decision Process that we're in and over the policy that we're currently following which is pi.

And then we want to estimate this term here which is the expectation of the actual value of the policy q_pi. This is not an estimate. This is the actual value, times the gradient of the logarithm of the policy at that state and action.

If we could sample this, this would give us an unbiased estimate, a stochastic estimate to the true policy gradient and would allow us to make our policy better by following that gradient.

Now On-policy we could sample a return, a multi-step return maybe, even a Monte Carlo return such that the expectation of that return would be equal or approximately equal to q_pi.

Now if G_t here is a Monte Carlo return and we're On-policy, then this would just be inequality. Full Monte Carlo would be too high variance and this is the common case we might actually want to bootstrap which is why I put an approximate equal here. Because G_t might be a multi-step return instead of being a full Monte Carlo return. And in that case, the expectation is not quite exactly equal to q_pi. But it's typically close and that's good enough.

But what if we're Off-policy? Then this isn't necessarily the case anymore. If we're following mu, then we have to correct G somehow to get a good estimate for q_pi.

So if the behavior is different, we need to do something to correct because otherwise, the gradient won't be pointing in the right direction. And that's important for policy gradient algorithms. Because if we get a wrong estimate for the return here, it might just mean that, we're no longer following any gradients or even following the gradient of the wrong objective and that means that your policy could get worse rather than better.

That's not just a hypothetical theoretical thing. It turns out to be the case that policy gradients are quite sentsitive to Off-policy updates and you have to really be careful that you correct for them. Because if you're too Off-policy, a little bit Off-policy might be okay but if you're too Off-policy with policy gradient updates, the gradient is quite likely to point in a direction that's not going to make your policy better and sometimes even makes it worse.

So that's another motivation for Off-policy learning, specific to policy gradients where it may be even more important to correct completely for the Off-policiness than in value-based methods.

Now we're going to talk about a couple of issues in Off-policy learning. These issues can be grouped into two main categories.

The first one is high variance and this is especially true when using multi-step updates.

A seperate issue is that, sometimes you can get divergent and inefficient learning because of a subtle property that if you're going to go Off-policy and you're going to correct for that, even if you're correct for that in the targets, there's still maybe a mismatch in your state visitation distribution.

The first issue that we're going to talk about is variance. This is specific to importance sampling corrections. But it's also somewhat inherent to Off-policy learning.

One way to think about this is that, we can consider a one-step reward and then we can look at the expectation of the importance sampling reward. So for simplicity, we're looking at the one-step case but some of the conclusions will transfer to the multi-step case. So we're going to just verify that this is true.

So we're going to sample an action under a behavior policy mu. And then, we're going to correct for that. So we're looking at the probability of selecting that one action and we're going to divide the probability under the target policy that we're interested in whatever that is, might be greedy, might be something else altogether by the probability of actually selecting it under behavior mu.

So this is just to verify that indeed the expectation is the same if we do the importance sampling corrections. But the variance is not necessarily the same. And sometimes the difference can be quite large.

Now it turns out that in some cases, the variance of importance-weighting return can even be infinite.

We're going to talk about multiple different ways in which we can reduce the variance. So we will discuss specifically three different strategies.

One is Per-decision importance weighting which is an older idea. It's some 20 years old now. But it's quite useful in reinforcement learning. It gets rid of some unnecessary variance in our updates.

We also talk about Control variates. We will also talk about bootstrapping. Specifically adaptive bootstrapping which will allow us to pick automatically by looking at the importance sampling ratios when to bootstrap, when it's important to bootstrap to minimize the variance.

What does this term mean, per-decision importance weighting?

We're going to consider some states s, some arbitrary states. And we're going to look at some random variable X. So think of X here as a reward.

Now we're going to observe that for any X that does not correlate with the action, we don't actually need to importance sample.

The expectation of this random variable X for instance, your reward under the target policy pi, this is the thing that we want, is equal to the expectation of X under the policy mu as long as we multiply with this imporantce sampling ratio.

But now what we're claiming is that, if X is uncorrelated with the action, then this thing is actually equal to the expectation of the uncorrected random variable X.

Intuitively, this makes sense. If X does not depend on the policy, we don't need to correct, because the expectation will be about everything else that is random. It will be about your MDP for instance. But if it doesn't depend on the action, then the policy shouldn't matter. We shouldn't need to correct.

Now why is this a good thing? The means that, we might be able to have lower variance than otherwise by not adding this additional variance that we do these corrections and just ignoring the random variables that don't correlate with the actions under consideration.

In order to correct this return, if this return was generated according to some behavior policy mu, but we're interested in target policy pi, then we can multiply with this important sampling ratio for the full duration of the trajectory.

So in this case, the return spans a trajectory from time step t to T and that means that, we're going to correct with all of the ratios multiply together.

So this gives us a very condensed notation where this is the total ratio which captures the probability of the trajectory under the target policy divided by the probability of the trajectory under the behavior policy and we multiply the return in order to make it an unbiased estimate for the return under the target policy.

However, we can also note that this return is a summation of individual rewards. So instead of perceiving this ratio as correcting the full return, we can also push it into this summation and apply to each of the individual rewards within that return.

This raises a really interesting question. So is this the best importance sampling ratio to be applied to each of these individual rewards? And turns out maybe we can do a little bit better.

So what we're going to do is, we're going to get rid of some terms that we don't need.

Remind you, Rho t to T-1 is the importance sampling ratio for the full trajectory that this return spans. And we could push it inside the summation that this return is, by multiplying each of the individual rewards.

But now we're going to recall this fact that earlier rewards cannot depend on later actions. When you've received a reward, you can't change that reward in hindsight by picking actions later on. This just violates basic causality. So this means that, if the policies are fixed which we're considering here, we have a fixed target policy and a fixed behavior policy, then earlier rewards cannot correlate with later actions. Later actions have no influence on these earlier rewards. This means that for each of these individual importance sampling ratios multiplying each of these rewards, we can actually truncate them up until the action that was just before the reward.

We're cutting off all of the actions that happened after we've already received this specific reward.

What does this mean? This means that this is an equality if they don't correlate. The actions after time step k cannot influence the reward we've received at k+1. This means that these expectations will be equal even if we ignore those ratios. Because of this earlier property that we talked about where the ratio if it's uncorrelated with the random variable that you're multiplying, then you could just pull them apart and the expected ratio will be equal to one. So it disappeares from the expectation.

But this doesn't mean that these have the same properties. In fact, this term here might have much lower variance in some cases. Because it's a shorter partial trajectory that we're considering.

Now interestingly and maybe aesthetically pleasing as well is that, this new thing can be written down recursively quite nicely as shown here.

We're considering Monte Carlo returns now, and then this term here can be written down as follows, where at time step t, the importance sampled return will apply an important sampling ratio but just one, just the importance sampling ration is corresponding to that time step to the reward, and to the next return at the next time step.

Recall Rho t here is the division of the target policy at that time step only, with behavior policy at that time step. So we're saying is, we get a reward we need to correct for the fact that, we took this one action under the behavior policy but it might have been a different distribution behavior policy than the target policy. So we need to correct for that and we also need to correct the rest of the return.

But the subsequent corrections on all of the future time steps t+1, t_2 and so on, they're recursively within this term. At the next time step only, we're going to apply Rho t+1 and then Rho t+2 at the time step after and so on.

So instead of multiplying this first reward with the importance sampling ratio that would correspond to the full trajectory, we're only multiplying it with this one step and this can reduce variance quite a bit and especially if we later combine this with bootstrapping and this is called per-decision importance weighting. Because we're only adding each importance sampling ratio when we need it and no earlier.

So we have this recursive definition and now we can prove that the expectation of this return will be an unbiased estimate for v_pi. The Rho corrects the mismatch between the target policy pi and the behavioral policy mu. So the assumption here is that, we're generating the data under this behavior policy mu, but then G_rho will be an unbiased estimate for v_pi. So we can use this in a TD update for instance, or Monte Carlo update.

We could also do the same for action value. There's a slight subtlety there, because for action values, we don't need to correct for the first action because we will already be conditioning on having taken that action. If we want to learn action values, we typically say, okay we have some states s_t, we took some action a_t and we want to update the value of that conditioned on the fact that we were in that state and already took that first action.

What does this mean, we need a slightly different return. One that doesn't apply the first importance sampling ratio, because the first action is already given when we update our action values. This means that we should only apply the importance sampling ratio a little bit later.

So this one is unbiased for v_pi, and this one will be unbiased for q_pi for the state action that we already took.

'RL > RL_DeepMind' 카테고리의 다른 글

[Lecture 12] (1/2) Deep Reinforcement Learning (0)	2024.08.09
[Lecture 11] (2/2) Off-Policy and Multi-Step Learning (0)	2024.08.09
[Lecture 9] (2/2) Policy Gradients and Actor Critics (0)	2024.08.05
[Lecture 9] (1/2) Policy Gradients and Actor Critics (0)	2024.08.04
[Lecture 8] (2/2) Planning and Models (0)	2024.08.04

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

'RL > RL_DeepMind' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'RL > RL_DeepMind' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바