An Introduction to Approximate Solution Methods for Reinforcement Learning

series on Reinforcement Learning (RL), following Sutton and Barto's famous book “Reinforcement Learning” [1].
In a previous post we finished dissecting Part I of this book, which introduces the basic solution techniques that form the basis of many RL methods. These are: Dynamic Programming (DP), Monte Carlo (MC) methods again Temporal Difference Learning (TD). What separates Part I from Part II of Sutton's book, and justifies the difference, is the barrier to the size of the problem: while in Part I the tabular methods were covered, now we dare to go deeper into this interesting topic and include the limitations of the work.
To be clear, in Part I we assumed that the area of the investigated problems is small enough to be able to represent it and the solutions obtained with a simple table (think of a table that states a certain “goodness” – value – of each state). Now, in Part II, we let go of this thought, and thus we are able to deal with the problems indifferently.
And this modified setup is very necessary, as we can see for ourselves: in the previous post we managed to learn to play Tic Tac Toe, but we failed already in Connect Four – since the number of states here is in the order of 10²⁰. Or, consider the RL problem learning a function based on camera images: the number of possible camera images is greater than the number of atoms in the known universe. [1].
These numbers should convince everyone that dispute resolution mechanisms are indeed necessary. In addition to enabling to deal with such problems, they also provide generalization: in tabular methods, two close, but different states are treated completely separately – while in finite solution methods, we can hope that the assumptions of our work can find such close situations and be generalized.
With that, let's get started. In the next few paragraphs, we will:
- give an introduction to job limitations
- generating ways to solve such problems
- discuss the different options for measurement functions.
Introduction to Work Limitations
In contrast to the table solution methods, we used a table to represent eg value functions, now we use a parameterized function.
with the weight vector

v it could be anything, like a linear function of input values, or a deep neural network. Later in this post we will discuss the different possibilities in detail.
Usually, the number of weights is much smaller than the number of regions – which produces generalization: when we update our work by adjusting certain weights, we don't just update one entry in the table – but this will affect (probably) all other measurements, too.
Let's review the revision rules in a few ways that we saw in the previous post.
MC methods assign the observed return G as an average of the region value:

TD(0) bootstraps the mean value for the following condition:

While DP uses:

From now on, we will interpret updates of the form s -> u as input / output pairs of the function we would like to estimate, and in this case we will use machine learning techniques, in particular: supervised learning. Functions where numbers (u) must be approximated are known as function approximations, or regressions.
To solve this problem, we can use any method to perform such a task. We will discuss this in a bit, but we must say that there are certain requirements for such methods: first, they must be able to handle incremental changes and data sets – as in RL we often build knowledge over time, which is different, eg classical supervised learning tasks. In addition, the chosen method must be able to handle static targets – which we will discuss in the next subsection.
The purpose of prediction
Throughout Part I of Sutton's book, we never needed a predictive intent or the like – after all, we could always get along and work very well explaining the value of each state completely. For the reasons mentioned above, this is not possible – which requires us to define the objective, the cost function, that we want to develop.
We use the following:

Let's try and understand this. This is expected over the difference between predicted and actual values, that is, with reasonable understanding and common in supervised learning. Note that this requires us to define a distribution µ, which specifies how much we care about certain regions.
In general, this is simply a measure of how often states are visited – the policy distribution, which we will focus on in this section.
However, note that it is not really clear whether this is the right goal: in RL, we care about finding good policies. Our alternative may be more objective at best, but still fail to solve the problem at hand – eg if the policy spends too much time on undesirable situations. However, as discussed, we need one such objective – and for lack of other possibilities, we simply develop this.
Next, let's introduce a way to reduce this target.
Reducing the Predictive Purpose
Our tool of choice for this task is Stochastic Gradient Descent (SGD). Unlike Sutton, I don't want to go into too much detail here, and I'm only focusing on the RL part – so I'd like to refer the interested reader. [1] or any other SGD course / intensive study.
But, in principle, SGD uses groups (or small groups) to calculate the tendency of the goal and to update the weights a small step in the direction of reducing this goal.
So, this gradient is:

Now the interesting part: consider that v_π is not the true objective, but some (noisy) approximation of it, say U_t:

We can show that if U_t is unbiased for v_π, then the solution obtained by SGD converges to a local optimum – simple. Now we can use e.g. return MC as U_t, and we get our very first RL method:

It is also possible to use other U_t methods, in particular and use bootstrapping, i.e. use previous measurements. When we do, we lose these guarantees of convergence – but often empirically this still works. Such methods are called semi-gradient methods – since they only consider the effect of changing the weights on the stimulus value, but not on the target.
Based on this we can introduce TD(0) by function approximation:

A natural extension of this, and likewise an extension of the corresponding n-step tabular method, is in-step semi-gradient TD:

Work Estimation Methods
In the remainder of Chapter 9 Sutton describes different ways of representing the approximate function: most of the chapter covers the approximation of a linear function and the characteristic design of this, and the nonlinear function of the approximation of artificial neural networks is presented. We will cover these topics briefly, since in this blog we mainly work with (deep) neural networks and not simple linear approximations, and we suspect that the intelligent reader is already familiar with the basics of deep learning and neural networks.
Linear Function Estimation
Anyway, let's briefly discuss the linear approximation. In this case, the country's value function is approximated by the inner product:

Here, the state is defined by a vector

– and, as we see, this is a line combination of weights.
Due to the simplicity of the representations, there are good formulas (and closed representations) for the solution, and convergence guarantees.
Characterization of Linear Paths
The limitation of the above presented measurement is that each factor is used separately, and no combination of factors is possible. Sutton cites the pole of the chariot problem as an example: here, a large angular velocity can be good or bad, depending on the context. When the pole is properly centered, one should avoid quick, jerky movements. However, when the pole is close to falling, a faster speed may be required.
So there is a separate branch of research about designing efficient feature representations (although one might argue, that due to the increase in deep learning, this is becoming less important).
One such presentation polynomials. As an introductory example, assume that the state vector consists of two components, s_1 and s_2. So we can define the element position:

Then, using this template, we can still perform linear function estimation – i.e. apply the four weights to the four newly created features, and in total we still have a linear function that sums the weights.
In general, i polynomial basis factors of order n+1 can be represented by

where oc are integers in {0 … n}.
Other commonly used basis functions are the Fourier basis, coarse and tile codes, and radial basis functions – but as mentioned we won't go too deep at this point.
The conclusion
In this post we made a significant step over the previous post in sending RL algorithms “into the wild”. In the previous post, we focused on presenting the main methods of RL, although it is a table method. We have seen that they quickly reach their limits when they are sent to big problems and thus realize that solutions are probably needed.
In this post we have presented the basics of this. In addition to enabling the handling of large, real-world problems, these methods also introduce generalization – a strong requirement for any successful RL algorithm.
We started by introducing the ideal goal of forecasting and ways to improve this.
We then present our gradient RL and semi-gradient RL algorithms for the purpose of prediction – that which learns the value function of a given policy.
Finally we discussed the different ways to do the balancing act.
As always, thanks for reading! And if you're interested, stay tuned for the next post where we'll dive into the associated control problem.
Another Post in this thread
References
[1]
[2]



