[Lecture 9] (1/2) Policy Gradients and Actor Critics

RL/RL_DeepMind 2024. 8. 4. 17:08

https://storage.googleapis.com/deepmind-media/UCL%20x%20DeepMind%202021/Lecture%209-%20Policy%20gradients%20and%20actor%20critics.pdf

https://www.youtube.com/watch?v=y3oqOjHilio&list=PLqYmG7hTraZDVH599EItlEWsUOsJbAodm&index=9

One should not solve a more general problem as an intermediate step.

If you are going to solve a more general problem, then this is going to almost necessarily be harder. It's better to solve the thing that you actually care about directly. If we apply this in reinforcement learning, there's a question that we can ask ourselves.

If we care about optimal behavior in the end, why do we not learn the policy directly?

Let's compare to other approaches to AI where we've talked about model-based reinforcement learning. This has some benefits including the fact that it's relatively easy to learn a model in the sense that, this is a well-understood process. It's supervised learning. So when you learn a model, typically we learn either an observation to observation or a state to state model and we learn a reward model and at least the mechanisms which we could use to learn these are fairly well understood. Because these are just supervised learning. Of course the model inself could be very complicated and this can be a problem.

The other benefit from learning a model is that in some sense you are going to extract all that you can from the data. You could imagine that if you see a transition in which nothing too exciting happens in terms of rewards, and maybe there's not much to learn about your policy directly, it could still be useful to condense some information from that transition into some structures in your head, some knowledge. If you learn a model about, if you're trying to learn about everything, then at least you're extracting this information. But of course there's also downsides including the fact that you might spend quite a bit of computation and capacity on irrelevant details.

In addition, even if you have a very good model and you were able to learn, maybe even a perfect model from the environment, you would still have to compute a policy. This we could call planning. And this is typically non-trivial and can be quite expensive in terms of computation. Because especially if you want to have a very accurate model of a very complicated world, you could imagine that this internally also will be quite a complicated thing and therefore even computing one step into the future, one imagine step can be quite computationally heavy.

We talked a lot about value-based. So let's also list some properties of this pros and cons. First, it's a lot easier to generate policy, if you have a value function, in particular, if we've learned an action value function. And in the case where we have a discrete set of actions, then picking the greedy action with respect to these action values is very easy and that's this is a valid policy, a greedy policy.

We could also consider a soft greedy policy or other things, but generally it tends to be relatively easy to generate a policy. We don't need to have a very slow planning process to extract the policy from the values.

In addition, this is fairly close to the true objective, closer than the model. Because at least when we learn values, that's what we were wanting to optimize with our policy. So learning values is less likely to capture all sorts of irrelevant details and maybe it's closer aligned with the true objective.

It's also very well-understood. Because there's been lots of research in value-based reinforcement learning and very good algorithms do exist although sometimes there's caveats or things that are a little bit less well-understood or even understood not to work very well and then sometimes solutions for this are proposed as well. But it's still fairly well-understood. It's maybe a little bit harder than supervised learning in general, but we do have good algorithms.

However, it's still not the true objective and you might still focus on capacity irrelevant details. For instance, if you're trying to learn a value function, you might see quite a bit of function approximation capacity in learning the acurate values for one action versus another action. Whereas maybe the difference in values could be huge and so for the optimal policy, it's clear much earlier that one action is strictly better than the other one. And then sinking in more data and compute in function approximation capacity in figuring out exactly how much the difference is might be irrelevant in terms of which policy you're going to follow.

In addition to this, because the objectives are, although they're aligned, they're not truly aligned, fully aligned. A small value error can sometimes lead to a larger policy error. This is particularly the case when you have a value function approximation errors, because your capacity is limited or because your data is limited. So that means, what if we can't actually accurately model the policy values for all actions? In that case, you're going to have to have some trade-offs. And these trade-offs, these function approximation errors might sometimes lead to a different action seeming to have a higher value, but in fact it might be that is really not a good action, and it's just because of generalization or function approximation error that you think that has a good value, and this might lead to a fairly large policy error.

So, in this lecture, we're going to talk about policy-based reinforcment learning which at least is the right objective. If we're interested in finding the best policy, maybe you can optimize this exactly. But we'll talk about more properties, pros and cons of this approach.

All of these different approaches generalize in different ways.

Sometimes learning a model is easier. For instance, it's dynamics or particularly simple you could think of a very simple game in which you just move in a grid. And it's very predictable what will happen if you pick a certain action. It might not take you a lot of data to figure out. In this grid, if my action is to move right, I will literally move right a step unless if there is a wall, then I'll just stay where I am. That might be a model that is relatively easy to learn. And it could be very local in some sense, where you only need to know a little bit about the immediate vicinity of where you are in order to accurately model what will happen next. And then maybe you could learn a model that is easy to learn and you could use that to plan very deep into the future.

But of course, sometimes learning a policy is easier. If we consider the real world, that's hard to model. We find it very hard to predict exactly what will happen and it's a very messy stream of observations that you get for through your eyes or a robot could get through its camera, and there will be some noise perhaps. And in general, it's just very hard to model the whole real world obviously. But the policy could still be very very easy in some case at laest. It could be that you just always need to go forward. You're a robot, you just need to go forward and maybe that's the optimal thing to be doing and maybe that you could learn that quite easily without having to worry about all of the intricate details of the world.

We've approximated value functions parametrically. So v_pi and q_pi here denote the true values of a policy pi and v_w and q_w then denote the approximations. So in this case, we can then generate a policy by picking it greedily. But now we're going to do something different.

We're going to have a different set of parameters theta and these will be used to directly parameterize the policy.

So the previous lectures, we still had a policy because you always have to pick your actions in some way, but the policy was inferred from your value function.

Now, we're just going to have some sort of a function that will outout the policy parameters.

For instance, this could be a neural network or a linear function or something as a form where theta could then be the weights of the neural network or the parameters of your linear function.

Value-based reinforcement learning use values but typically when people say value-based, they mean that the policy is implicit. Instead, if you just have a policy, you could then call that policy-based reinforcement learning.

If both of those are there, there's the value function and the policy, people typically use the terminology of Actor-Critic where the actor refers to the policy and critic refers to the value function.

The reason to use that word is that, the value function is then used to update the policy, to critique the policy in a sense. So this is where those terms come from.

One of the prime advantages is that, it's the true objective. If we really try to optimize the policy, why not tackle it head-on.

It's also turns out relatively easy to extend the algorithms that we'll talk about to high-dimensional or even continous action spaces. For instance, if you think of a robot, a robot doesn't pick between three different actions or five different actions. Instead, it cound send electrical signals to its motors and maybe in an almost continuous fashion.

In addition, another benefit from parameterizing the policy directly is that, we can parameterize this in such a way that the policy is stochastic which means that it's a random policy rather than a fully deterministic one.

We saw the stochastic policies before the greedy policy is an example of a deterministic policy, epsilon greedy is an example of a stochastic policy where there's some randomness in which action you pick.

In addition, sometimes policies are very simple while values and models are complicateed. This has multiple benefits. One is sometimes it just means that learning the policy can be a lot easier, and in addition to this, it's sometimes also much easier to represent the policy. You could imagine that a value function could be really complicated because maybe it matters a lot how close or far you are from a certain goal. And then even if you have a lot of data and you learn really well, you have really good algorithms, it could be that if you pick a certain function class, it just doesn't fit. You can't actually represent the whole value function. But it could be that in exactly the same problem that the policy is relatively simple and that you can pick a function class that is relatively simple and still fit it exactly the optimal policy.

There's also disadvantages. For one, if we do policy search, there's different ways to do that, but typically what we do is, we do some sort of hill climbing. We use gradient algorithms or even when we use something else like evolutionary algorithms, those tend to be local in the sense that you're considering certain policies and you're searching around those policies for better ones. and then it turns out you could get stuck in a local optimum which means that then you find a policy that is relatively good but there could have been much better policies but they're too different for you to incrementally move towards from your current policy.

In addition to this, the obtained knowledge can be very specific which means it doesn't always generalize well. This is related to the point about model-based reinforcement in capturing everything that you possibly can about the environment.

Policy-based reinforcement planning doesn't do that. In fact, it tries to reduce the data to the most narrow thing that you could need for optimizing your behavior which is the behavior itself. But that means that if up until now, in all of the situations you were, it was always optimal to move forward, and then suddenly it isn't, because you maybe move into a new room, and now you have to do something else, it could be that, by that time, the policy has learned to completely ignore its observations and just move forward, and then it can be hard to unlearn this. It might not generalize well in that sense.

And this is related to this property that, at a high level, if you learn a policy directly, you're not extrating all the useful information from the data and that means that it might be harder to adapt to new situations.

Why would we need stochastic policies? You might know that, in Markov Decision Process, there is always an optimal deterministic policy. But turns out most problems are not fully observable. For instance, like consider a robot with a camera looking in front of it, it cannot see what's behind it, that would be a not fully observable Markov Decision Process or a partially observable Markov Decision Process.

This is the common case and in fact, even if you have a Markov Decision Process, if you could say, the world is the Markov Decision Process, it's just a really complicated one. If you're using function approximation, then the agent can still maybe not distinguish between states which means you're still in a partially observable setting. even if the world itself is Markovian doesn't mean that the laerning algorithm can fully exploit this and if that is the case, then the optimal policy itself might be stochastic.

In addition to this, the search space can be smoother for stochastic policies because you could imagine for a deterministic policy in every state, you have a choice, that might be hard to optimize space because you have this discrete choice of either doing this or either doing that, so there's a big conbinatorial search problem there, if you're only searching within deterministic policies.

If however you're thinking about smoothly changing the probability of selecting one action above the other, that could be a much smoother surface and that turns out to be important for optimization.

In particular we can then use gradients which are very successful at optimizing deep neural networks and similar structures. And these tend to lead to successful and impactful learning algorithms.

Finally, having a stochastic policy can be beneficial because you automatically get some exploration. Maybe this is not exactly the right type of exploration. Just picking actions a little bit randomly might not actually give you enough coverage, it might not seek out important information about the world. But still in many cases, it's better than nothing. And it might still lead you to reasonably explore.

Especially if your policy is stochastic in some states and less stochastic in others, this might be reflective of the fact that you don't really know yet which actions are correct in this one state, but in the other state you've seen enough data that you know now, this state I need to do this and then that might sometimes lead to an appropriate amount of exploration.

Now I'm going to show you an example of an aliased grid world to show you that the stochastic policy can be optimal in a partial observable setting.

You're in this grid world where we start somewhere in the top corner, and the gray states look the same. It could be the case that you have features that represent the states and these features only look at where are the walls.

Now in the top left, the feature for the wall above you and a wall to your left would both be on and the features for a wall to your right or a wall below you would be off. And if we would have this feature representation, note that these two gray states would indeed have exactly the same features and therefore, would be fully aliased. They would look alike.

Now in the top corner, you can just move back and forth. You can move left and right, and you can move up but you would just bump into the wall and you could move down if there's a wall, you would bump into it, you would stay in the same place, but if you're above one of these three other states, you would go into a terminal state and either you die or you get the money. For simplicity, we could assume that in either case, the episode ends. So you either die or you get the money.

Now in this setting, we could imagine doing learning but we don't even have to think about a specific learning algorithm here. We're just going to consider what the policies should be. Particularly, we're going to compare deterministic and stochastic policies.

Here's an example of a deterministic policy. If you're above the money, you go down. If you're above the skull and bones, you can see by looking, in which corner you are in, which direction you should go, you should always move away from the wall. Obviously if you're in a corner, never go down because that's bad. You don't need to go up because that doesn't help you. Also don't want to bump into the wall, so instead, you move the other direction.

If you would start at the top right corner, you'd move left left down and you win the game. However, you either begin in this corner or in the other corner randomly every episode, if this would be your deterministic policy, because it's deterministic policy and because these two states look exactly alike, as far as the policy is concerned, if this is all that it knows, it cannot distinguish between these two states, so the policy must be the same in both cases.

If you do function approximation and if the world is complicated, you cannot assume to have a fully observable representation. And then we can see that an optimal deterministic policy will either move left in both gray's or will move right in both gray states and neither policy is optimal.

What else could we do? What if we would have a stochastic policy that randomly moves left or right with equal probability in each of these gray states.

With this stochastic policy, if you start in one corner or in the other, it doesn't matter in expectation. It will take you the same number of steps to get to the goal and reliably every episode ends by getting you to the gold and you would never end enter any of the bad states but you would also never get stuck indefinitely.

And you could assign rewards to this.

This is an example to show that sometimes having a stochastic policy can be beneficial because even the optimal policy could be stochatic.

Importantly, the stochastic policy can be learned if we run the policy parameters directly instead of just learning value functions.

In addition to this, note that this is just an example in which we happen to have equal probability across actions in some of these states, but the example can be extended to having non-equal probability.

You could also have situations in which it's optimal to pick on action with say, 75 percent, a different one with say, 12 percent. And it could be that the stochastic policies arbitrarily stochastic where it could be almost greedy or it could be very uniform or anything in between.

The goal at high level is just to find a policy and if we parameterize the policy with just some parameters thetas, then this translates into the goal of finding these parameters theta. But how do we measure the quality of the policy? In episodic environments, we can use the average total return per episode. In continuing environments that do not have terminations, we could use the average reward per step.

We introduce some function J with G where G corresponds to the return and the definition of the objective could be this expectation where the expectation is taken over a potentially random distribution on the start states, so we sample some s0 from some random distribution d0, and then from that state onwards, we take actions according to our policy. So the actions are random because the policy is random, and in addtion to that, there will be some randomness due to the Markov Decision Process, so the transitions themselves might be random.

So note that, the expectation is not conditioned on anything else. It's conditioned on the distribution of start state, implicitly on the MDP and then on the policy.

And then our goal is to maximize the discounted return from the starting state.

So we only look at this initial state, and then we are going to roll forward throughout the whole episode until it ends.

We want to optimize our parameters theta in such a way that the actual value v of the policy that is parameterized for theta will be maximized under the distribution that generates the starting state.

For simplicity, we could consider a special case where the starting state distribution is a dirac, a deterministic process which always picks the same state. Then our objective is just the value of your starting state.

The expectation here is no longer over a start state distribution because we're considering a continuing setting now. Where we're just going to take actions indefinitely long, and it's never going to stop. You're never going to start a new episode. And we're interested in the long-term average reward.

This can be considered an expectation where the state is now drawn from a different distribution which is the distribution that you're in a state under this policy and implicitly in the Markov Decision Process. This is the long-term probability of being in a state.

So think of it in this way. Even in a continuing setting, you might still start in some states according to some distribution or deterministic in some state. But if it's a continuing setting, and you're going to be in there indefinitely long under some mild assumptions, there's going to be some frequency in which you visit states and this frequency in the long term won't depend on where you started. It will depend on your policy and the dynamics of the Markov Decision Process.

So people will assume things such as that the Markov Decision Process is mixing or ergodic which essentially means that, this distribution exists and that you always can recover states that you visited before, that the MDP is in some sense connected, and if that is the case, then this distribution is well defined, and it will just be the frequency of how often you are in each state.

So we can consider this average expected reward, to be an expectation where the starting state is drawn or the state you're in is drawn according to this what they call steady-state distribution and then in that state, conditioned on that state, we're going to draw an action according to our policy and observe the immediate reward. And that's the thing that we're interested in now.

We can write this differently with a explicit summation where we are summing over states and we're looking at the probability of being in each state under this stationary distribution. If the state space is continuous, we could just write an integral there, and everything would be similar.

And we're summing over action. So in each state, we're looking at how likely are we to be in that state, then we're going to look at how likely are we going to pick each of these actions given that we're in this state. And then we're going to look at the probability of, after picking this action, getting each reward. And then you multiply that with the reward. So this is writing out this expectation explicitly in a summation.

Now we're going to talk about how to optimize either of these objectives. And to do so, we're going to use gradient-based algorithms which are called policy-gradients.

Policy-based reinforcement learning is an optimization problem. We want to optimize something. We want to find the theta that maximizes this J(theta) where J(theta) is one of the two objectives.

We will focus on stochastic gradient descent because this is a powerful method which is often quite efficient and it's easy to use with deep neural networks which are also a very good tool in this context.

The policy gradient is defined as updating your parameters theta in a way that corresponds to the gradient.

So we have this gradient of J, and we are going to update theta. So, delta theta refers to the change that we're going to do to theta with some small step size times this gradient.

This raises the question. So how does the expectation of R actually depend on theta? It's not immediately obvious and we'll dig into this.

We're going to start in the contextual bandit case. So consider now a one step episode such that the average reward is well-defined and we are talking about the average reward but we're going to only be interested in the reward, because we can assume now that the distribution of states does not depend on your policy. This is why we go for the contextual bandit. It makes some things a little bit easier because normally our distributional states will depend on our policy, but if we're in a contextual bandit, and if in addition the state itself is a pure function of the observation, so it's not a parameterized agent update function, but the state could just be the observation, then the distribution of those states does not depend on your policy. That's a property of the contextual bandit. So it's a more limited case and we're starting there. Because it's simpler to reason about.

So the expectation here is over actions and over states but the distribution d does not depend on pi. Later we will consider the case where it does depend on pi, so this is just a temporary assumption.

We'll see some context S, this will be out of our control but then, we want to pick an action and then we'll see a reward R which depends on the state and the action. And then we want to optimize the policy such that the rewards become higher.

We can't sample the reward and then take the gradients, because the reward is just a number, doesn't depend on theta.

We're going to use this following identity where the gradient of the expected reward turns out to be equal to the expectation of the reward times the gradient of the logarithm of the policy.

The importantce of this equality is that the thing on the RHS can be sampled whereas the thing on the LHS you can't just sample. If we rewrite this as an expectation of a gradient, then we can sample this expectation and get an unbiased estimate for the gradient that we're actually interested in. And this will give use concrete algorithms.

This is the expectation of the reward R_sa times the gradient of the logarithm of the policy.

Now we have something we can sample, and then our stochastic policy-gradient update can be this update where we update our parameter theta by adding a small step size times the reward times the gradient of the logarithm of the action that we selected A_t.

In expectation, this is an unbiased algorithm and therefore this is pure unbiased stochastic gradient ascent.

The intuition is, if the reward is high, you will change the parameters such that the policy goes up, more specifically, the logarithm of the policy goes up, but the logarithm is itself an increasing function, it's a monotonically increasing function, so increasing the logarithm of the probability of selecting an action is equal to increasing the probability of selecting that action itself.

If all of your rewards are positive, this means that whatever action you select, you will make that action more likely to be selected whenever you perform this update. It turned out that most of the time, increasing the probability of one action means decreasing the probability of the other actions. So if the rewards are all positive, if you perform this update, you would always push up the probability of selecting the action that you selected. However if the rewards are not equal for all actions, you would push up the probabilities of actions with high rewards more so, than the probabilities of actions with low rewards. And in the limit, then you would still find the actions with the highest rewards.

Now let's introduce a little trick to reduce the variance.

We can pick any b which doesn't depend on your actions and then if you multiply b with the gradient of the logarithm of your policy, we could go through these steps.

This implies that we can subtract the baseline to reduce the variance. So what we're going to do is, we're going to allow something to be part of the update which won't take change the expected value of the update but it's allowed to vary per state. That's important. It's kind of a covariate. By picking this smartly, you can pick something that reduces the variance of the updates and it won't change the expectation. So we're still doing a valid stochastic gradient ascent algorithm. The only difference is that, the variance might be lower which is a benefit now.

Intuitively this also makes sense, because all your rewards might be positive or let's do a different example. Let's say that the reward is +1 if you win a game and its 0 otherwise. The algorithm would only update when you win because if the reward is zero, your parameters will not be updated, it means that if you lose a lot, you will not learn anything from those losing games. It's only whenever you win that you learn to change your policy in the direction that will help you improve.

If you only win very occasionally, like one percent of the times, you're only ever going to update your policy parameters one percent of the times. So this is an example of a high variance update. Because most of the time, you're not doing anything.

If you would introduce a baseline, you could do something else where you could pick the baseline to be equal to a half, if it's equal to a half, that means that whenever you win, you update whatever you lose, you also update but in the opposite direction. And this means you can now also learn from the games that you lose. We haven't changed the expected value. In expectation, we're doing the exact same thing. The only thing that we've changed is the variance of the updates. But it makes a real practical difference.

We can parameterize the policy by parameterizing h (preferences). And a parameterized policy could be the exponentiation of each of these preferences. And then taking the one corresponding to the action and dividing it by the normalization term.

So note that the division by the summation of the exponentiated action values implies that the total sum of this thing over all actions will be equal to one. This is a well-defined policy. The thing that we're dividing by is simply a normalization term to make sure that we sums to one.

Then if we take the gradient, it turns out that the gradient of the logarithm of this quantity will look like this.

Where we will have the gradient of the preference for the action that we've selected, so this is for a gradient of logarithm of the policy of A_t in S_t, and this will be equal to the gradient of the preference for St, At.

Minus the expected gradient under the policy, so this is the gradient of all the other actions including the A_t but also all the other ones.

And this turns out to be the grad log pi term for the softmax.

Now we're going to go into the sequential case. And we're going to look at the policy gradient theorem which is a generic theorem that proves what policy gradients can look like and how you could use them as an update.

So we're going to go now to the model the multi-step Markov Decision setting and the difference is now that, the state distribution will also depend on our policy. This was different from the contextual bandit case. And we're not going to consider the immediate reward anymore.

In the contextual case, only the immediate reward depends on your policy but the next state doesn't. So you can simplify things a little bit.

But now we're going to go back to the full case in which not just the immediate reward but also your next state depends on your action and therefore, in your policy. And this will make things slightly more complicated.

There's these two different objectives, the average return per episode and the average reward per step.

The average return per episode is applicable to episodic problems, so whenever you have an episodic problem, that's the one you should be using.

But if you have non-episodic problem in which there are no terminal states, then the more appropriate objective is, the average reward per step.

You shouldn't use the average reward per step for an episodic task, and let me give you a very simple exaple for why that might be the case. Let's consider you have a maze and you get a -1 reward on every step. Now your goal is to exit the maze as quickly as possible. The -1 reward is a penalty for every step and that means that, the optimal policy according to the average return per episode would be to exit the maze as quickly as possible. So a policy would be better if it exits the maze faster because the episodic return would then be higher. However, if we would consider the average reward per step, nothing matters none of your policies will differ. You will always just have a -1 per step. So that is an example to show that you shouldn't use the average reward as an objective for an episodic task.

The inverse is also true. If you have a continuing task with no episode terminations, then you shouldn't use the average return per episode. Because you will only ever be in one episode. It's not a well-defined objective for that case. And instead, then you should use the average reward for step.

We're going to start in the episodic case and here is the policy gradient theorem for the episodic case.

Now we're in the full MDP setting. We're going to have some differentiable policy pi with parameters theta, we're also going to have some initial starting state distribution d, so every episode starts somewhere, where the episodes starts does not depend on your policy. The trajectory during episode depends on your policy, actions you select along the way depending your policy, but when you terminate, you transition back to some starting state distribution or maybe a deterministic starting state, and that state does not depend on your policy, because you haven't yet taken any actions in that episode. So do we have this d0 which needs to be given and our objective will be as written here.

The expected return, the expectation depends on the MDP dynamics and on your policy and its conditioned on the starting state being samples from the starting state distribution.

Now if we have this objective, then it turns out that its gradient can be written as follows.

So we have an expectation under your policy and it's also conditioned on the starting state being sampled from the starting state distribution.

And then, we have a summation over the whole episode. So T is the last step in the episode, so we have an episode that lasted T+1 steps.

And then we sum over these terms in the episode, where there's a discount power t (gamma power t), times, the value of the action that you took on that time step t, times, the gradient of the logarithm of your policy.

the value q is your discounted return from that point in time.

This is the policy gradient theorem. It says that, if you would walk through an episode, and you would accumulate all of these terms within this sum, and then at the end of the episode, you apply these to your parameters, this would give you an unbiased estimate to the actual policy gradient.

Now you might think, okay but it's a long epoisode, maybe I don't want to accumulate everything and wait all the way until the end of the episode and only then apply it.

So what people often do is, instead of summing it over the whole episode and then applying this thing at the end, you could also just look at every single step within the episode at this term and just use that to update your parameters. Then you get a slight bias to your gradients.

Because it might be that, you end up in a certain state, because your policy was a certain way, but then you update your policy in such a way that you would actually never run up in that state in the first place. And then if you continue updating from that point, in the policy you would have a slightly biased gradient with respect to your current policy, but it's kind of okay. People do this all the time. And it's quite common to update during the journey episode.

Similarly, this term, discount to the power t, this means that the further you are in the episode, the less you're going to update your policy and this makes sense because, if you're in an episodic task, but you also still have a discount, in some sense the farther you are from the starting point according to this objective, the less it matters. Because we are considering a discounted objective.

So one could argue that maybe in the episodic case, the most natural thing to do is not to discount at all because your episode in going to end in finite steps anyway and then this term would also disappear.

In practice, people do discounts, because the algorithms they tend to work a little bit better if you have a discount factor, tt's easier to estimate the values. But people often drop this term discount to the power t and turns out the algorithms typically still work well. But I do want you to be aware that that will give you a biased gradient and then sometimes it can point in the wrong direction in some edge cases.

We're going to point out that, actually the policy gradient does not need to know the Markov Decision Process dynamics. You don't need to know the transition dynamics. And that's a little bit surprising.

Should we know how the policy influences the states? and you should. It's captures implicitly in this value estimate. So this value estimate does capture how your policy influences states.

We're going to introduce Tau, to be a random variable which captures your whole trajectory. So Tau is defined as the inital state and then the action in that state, the reward, then the next state and so on and so on.

And then, we can write the return as a function of this random trajectory. It's easy to prove this that, the gradient of your objective which is defined as this term, the gradient of the expected return will be equal to the expectation of the return times the gradient of the logarithm of the probability of the full trajectory.

This is just using the score funciton trick.

Here we could write it as an integral over all possible trajectories or sum over all possible trajectories. And then in that sum, we could have the probability of each trajectory times the return if you saw that trajectory. And then we can use the score function trick (the log likelihood trick).

The gradient of the logarithm of the probability of trajectory will be equal to the gradient of the logarithm of this probability written out. So the probability of a specific trajectory happening is equal to the probability of the initial state in that trajectory happeing times the probability of the action that you took in that state times the probability of then transitioning to the next state that you actually saw in this trajectory tau given that you took that action in that state and so on and so on.

So note that these are all probabilities that are all values between 0 and 1. This means that this total multiplication of things is probably a very small number but that makes intuitive sense because the probability of any specific trajectory is also probably going to be very very low. There's many different trajectories that could have happened. You happened to see one specific one, this is the probability of that specific trajectory happening.

We can push this logarithm inside this multiplication, and turn into summations. And we can see that it's the gradient of a big sum. Now some of these terms do not depend on the parameters of our policy, so the gradient of that will just be zero.

(e.g. the probability of transitioning to a state S1 given that we were in state S0 and took action A0 does not depend on our policy parameters because we're already conditioning on the action). So the initial start distribution and each of the transition terms that do not depend directly on our policy parameters. So the gradient with respect to those will be zero.

The gradient with respect to the next policy step, so the probability of selecting action A1 in S1 does depend on our policy parameters. So that one sticks.

We can get rid of all the other terms of which the gradient is zero. And just write it like this gradient of a summation of the logarithm of the policy.

And we replaced this the gradient of the logarithm of the probability of the trajectory with this summation of the logarithm of the probability of taking each action along the way.

So we see that the dynamics of the MDP do not have to be estimated here directly. Because they drop out. We don't need them for the policy gradient.

Now we're going to continue a little bit farther.

First push the summation to the LHS, and we're going to plug in the definition of this return. What is this return? the return is a discounted sum of rewards. This is by definition, the return of the trajectory is the discounted sum of all of the rewards within that trajectory.

Now we notice that we have a nested sum, we have a summation from t 0 to T and the summation inside from k zero to T. And we have this grad log pi term.

If something doesn't depend on your actions, then this thing times grad log pi, the expectation of that will be zero. That means that, all of the rewards that happened before for each of these time steps are uncorrelated with this probability of taking that action, action cannot influence those rewards because they happened in the past, and the expectation of all of the rewards on all of these previous time steps will be zero. So the inner sum, we can start this at t rather than zero, and the expectation will be the same.

And now we can then pull out the discount factor because there's something here that looks a lot like the return from time step t but there's too much discounting happening. So instead we can start this summation at time step t but only start discounting at that time step. So we start now the summation is now starting at time step t. k is t and the first reward we're not going to discount and the second reward we're going to discount once and so on and so on. In order to to do that, we need to pull a term out which is equal to gamm to the power t.

And then we can finally rewrite this summation is just equal to the return Gt. Because we pulled out the discount to the power t, this term inside is just your reward from time step t+1, plus the discounted reward at t+2 plut the twice discounted reward at t+3 and so on and so on, which is by definition the return.

So by doing these steps, we can rewrite the thing that we had at the top, to something that has this discount to the power t and the return at time step t.

And the reason to do this, why would we do it, why didn't we just stick to the original thing which was the return at time step zero is that, it could be higher variance, because we've gotten rid of some terms on this step, that would add to the variance but not necessarily help us get a better estimate.

So it's just to rewrite the thing at the top, and then we get the equation back that we had before except that we now have Gt rather than q.

But because we're within the expectation, we can replace the random return Gt with the expected return q_pi.

Now we can sample this if we have a whole episode, and turn that into an algorithm.

People typically pull out the summation and you split it up in separate gradients for every time step.

So you get some term on every time step and if you would add all of those together, then you'd get an unbiased estimate for the policy gradient.

If instead of adding them all together for the whole episode and then applying them, if instead you apply them on every step, you don't get an unbiased estimate for the policy gradient. But it's typically still okay and it allows you to start learning during the episode already.

And people typically ignore gamma to the power t term. Similarly, this will bias your policy gradient. But in practice, discounts are quite close to one, and it turns out that this is also kind of okay.

Now there's a different policy gradient theorem for the average reward case. So again we're going to assume that there's a differentiable policy pi and the policy gradient of the reward given pi and this is the long-term reward where we are assuming that our policy does also change our state distribution.

This can then be written as follows as the expectation of the q value times the gradient of the logarithm of pi.

There's difference in the definition of this q value. It's the average reward value. So what is the average reward value? It's undiscounted and it adds rewards together and subtracts the average reward.

The Rho here is the average reward under your policy. q here is defined as the summation of rewards over time. But on every step, also subtracting the average reward.

So you might think, doesn't that mean that, this is just zero? Because aren't we just subtracting the expected reward from the expectation of the reward, so isn't this just zero? No, it's not zero, because in the q value, we're conditioning on a specific state and action, and rho is across all states and actions under the steady state distribution.

So the difference here is that, your q value captures if I'm in the specific state and action, is my average reward conditioned on being in that state and action, is it going to be a little bit lower or a little bit higher than the overall average?

Alternatively, we can state the same theorem that we have here in a different way. And this might be slightly more intuitive in some cases.

So this is exactly the same objective. We're still considering a differentiable policy and the policy gradient will still be defined as the expected reward given that policy and all of the consequences that this policy has including on the state visitation distribution.

Now if you have the same objective, then you can rewrite the thing that we had before as the instantaneous reward times the summation into the past of the gradient of the logarithm of your policy where the expectation is again over states and actions.

The difference is that, previously we had this value which is a summation into the future. So we have a summation of rewards into the future times the gradient of the logarithm of your policy. Here, we have a single reward and a summation into the past.

If you remember the eligibility traces, when we were talking about value learning, this might look a little bit familiar and, this term is called "characteristic eligibility". (gradient of logarithm of pi)

The fact that, we're summing into the past makes it a trace.

This is an equivalent different way to write down the average reward case, and let me give you one intuition of why this is the case. In some sense, this trace captures the steady state distribution. This trace of these policies going into the past capures how does my state visitation depend on my policy parameters. So you can view this as similar to this probability of the trajectory we saw.

We have the gradient of the probability of the trajectory up to that point in time and that turns out to be via a similar reasoning as we had before. You can write this as the summation into the past.

오묘하다 오묘해. Eligibility Trace 다시 보고 왔는데도, 아직 완전히 받아들여지지가 않음.. ㅎㅎ future 방향으로 rollout하는 것만 익숙했었나봐. 하긴 past trace가 distribution을 말해줄 수 있겠지?

'RL > RL_DeepMind' 카테고리의 다른 글

[Lecture 11] (1/2) Off-Policy and Multi-Step Learning (0)	2024.08.08
[Lecture 9] (2/2) Policy Gradients and Actor Critics (0)	2024.08.05
[Lecture 8] (2/2) Planning and Models (0)	2024.08.04
[Lecture 8] (1/2) Planning and Models (0)	2024.08.03
[Lecture 7] (2/2) Function Approximation (0)	2024.08.03

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

'RL > RL_DeepMind' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'RL > RL_DeepMind' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바