TP Approximate Reinforcement Learning

Introduction: Environment and Algorithm

In this new episode of “Fun with Reinforcement Learning” (do you have the reference ?), we are about to explore approximate methods for Reinforcement Learning, especially for continuous-state environements. We will still be using our virtual python environment with the OpenAI gymnasium modules.

First step is setting the virtual python environment, or reuse the one from the privious episode/TP. You can skip the next section if reusing the environment, provided you remember how to activate and deactive this environment.

Setting up the python3 virtual environment

Goal: set up a proper python3 virtual programming environment, with the gym module from OpenAI.

IF you already have a working virtual environment from previous Labwork, you only need to activate it with source ./rlenv/bin/activate and update the requirements. For this, grab this requirements.txt file. Then use it to install all needed modules with

(rlenv) mylogin@mymachine:~$ pip install -r requirements.txt

IF you do not have a virtual environment or prefer to start from scratch, follow these instructions.

Using virtualenv, the following commands will help you create a basic virtual environment (not to be confused with the environment or model in Reinforcement Learning) called rlenv with python3 as default python. In this virtual environment, you can add/remove/use any python module or version without damaging your default python environment.

First, you need to be sure that virtualenv for python3 is installed on your computer.

mylogin@mymachine:~$ sudo apt install python3-venv

Create a new directory (say oh_what_a_nice_TP) and cd into it

mylogin@mymachine:~$ mkdir oh_what_a_nice_TP
mylogin@mymachine:~$ cd oh_what_a_nice_TP

Then create a virtual environment named rlenv, with python3 being the default python in it.

mylogin@mymachine:~$ virtualenv -p python3 rlenv

Activate this rlenv

mylogin@mymachine:~$ source ./rlenv/bin/activate

You should notice a (rlenv) in front of your prompt, as a reminder.

We will now install all required python modules. For this, grab this requirements.txt file. Then use it to install all needed modules with

(rlenv) mylogin@mymachine:~$ pip install -r requirements.txt

You can check what has been locally installed in this virtual env by

(rlenv) mylogin@mymachine:~$ pip list

hopefully, openAI gym will be listed :o)

Check also that the default python is python3.

(rlenv) mylogin@mymachine:~$ python --version
Python 3.7.3

When you are finisehd, at the end of this TP, to get out of this virtual environment, just do

(rlenv) mylogin@mymachine:~$ deactivate
mylogin@mymachine:~$

notice that the (rlenv) has disappeared from before your prompt.

Warning: if you just deactivated the rlenv virtual environment, activate it again now as it is needed for the rest of the TP. (You know, the source ./rlenv/bin/activate command we just saw).

The MountainCar problem

Goals:

set up and understand the peculiarity of the Mountain Car environment
test with simple policy taking random action
test with “classical” RL algo like Q-Learning
explore a naïve discretization of the state space.

You need these files:

interactive_environment.py: small script allowing you (Human) to try to solve problems by hands (meaning with the keys of your keyboard) => it is nearly a game :o)
test_rl_mcar.py: main python script that should put you on tracks to infer what should be done. => you might need to complete this file.
algo.py: implements a Policy class and valid version of some RL algorithms (along with previous Dynamic Programming Algorithms) => you might have to complete this file (BEWARE, it will not work with algo.py of the first labwork).
coding.py: module for encoding continuous variable to discrete ones => you will only use this file, not modifications.
plotting.py: module for plotting QValues, Policy and Trajectories => you will only use this file, not modifications.
if you are curious, mountain_car.py (not needed) shows you how the problem has been programmed in the OpenAI gymnasium library.

0) Set up and try “Moutain Car”

The “Moutain Car” problem is described by Sutton & Barto (yes, again!) as:

As usual, this environement make use of the reset() and step() functions. But, it has a continuous observation space.

(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 0

You can try your skill using the keyboard to choose action in real time. Run:

(rlenv) mylogin@mymachine:~$ python interactive_environment.py

^C quits and ‘a’, ‘z’, ‘e’ keys choses actions.

__Mountain Car___________
- a : accel LEFT
- z : NOTHING
- e : accel RIGHT

You can have a look at interacive_environment.py, it simply uses the play() util from the gymnasium.

1) Test using a simple policy.

By using the following command, you can test a simple policy on the Mountain Car problem.

(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 1

❓ Question (1.1) ❓ What is the policy used in the last command ?

❓ Question (1.2) ❓ On average, how long does it take for that policy to reach the target ? Update the program around question 1 in test_rl_mcar.py to compute the empirical value of the time needed to reach the target. For example by averaging on 20 episodes.

Remark, by default an episod terminates after 200 steps, if you want to change this limit, use, for example:

env._max_episode_steps = 5000

2) Find the optimal policy using Q-Learning.

Thanks to the hard work of the previous TP session, we have the Q-Learning algorithm at our disposal (in fact, algo.py includes a correct version of this algorithm). Test it with:

(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 3

❓ Question (1.3) ❓ What is wrong ? Why can’t you compute this optimal policy using Q-Leraning ?

Discretization of continuous space (Coarse coding)

0) Understand `digitize()`

We will make use of the numpy.digitize() function which, given a list of values, gives you the index of the interval. Pay attention to the right argument of this function.

You can try it with

(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 4

1) Complete the `Encoder`

The idea is to discretize the input values (array of 2 floats) and then to generate a unique code_index that will be the future state of our QLearning algorithm.

❓ Question (1.4) ❓ Complete the class Encode so that inputs can be normalized with

(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 5

and then encoded as a code_index using

(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 6

❓ Question (1.6) ❓ Check that the minimum index (for minimal input values) is 0 ? And what is the maximum index ?

2) Wrap the continuous environment into a discrete one (and use QLearning)

As we know how to encode our continuous state into a discrete one, we are one step away to be able to use Q-Learning on the Mountain Car problem. As our algorithm needs a discrete Env, we will build one by wrapping our continuous Env : i.e. providing a new environment where the continuous obs has been transformed to a discrete one.

If you have a proper Encoder with the encode() methode of part 6 of test_rl_mcar.py, the following lines do the trick

## Create a discretized_env with discrete observation space
from gymnasium.wrappers import TransformObservation
discretized_env = TransformObservation( env, encoder.encode )
## need also to change its observation_space
discretized_env.observation_space = gym.spaces.Discrete( encoder.encode( obs_max )+1 )

And you should now be able to run Q-Learning using:

(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 7

❓ Question (1.7) ❓ You should not have any errors. But does it “learns” ? Are you sure all Qvalues are not zero ? Is there a problem with gamma ? What do you think ?

4) Test and monitor learning performances

To monitor the learning algorithm, one way is to test, from time to time, the performance of a greedy policy relative to the current estimation of the QValue. Plotting the Value (or QValue) function and this greedy policy can also help.

In this part, you will be able to use some methods/class devoted to this kind of plotting. For that, will need to provide two function to the plotting module : one to compute the Value at a continuous state (encoded_value_f()) and one to compute the greedy action at a state (encoded_policy_f()).

Your job is to evaluate, from time to time, the current greedy policy. You have to implement some part of method test_greedypolicy() that returns a trajectory generated by the greedy policy, and the reward cumulated along this trajectory.

❓ Question (1.8) ❓ Implement missing parts of test_greedypolicy(). Then, the following line should allow you to monitor the learning progress. Do not expect any serious progress before 7.000 or 8.000 episodes.

This pyhton command should allow you to monitor, using plots, the performances of the algorithm. Position the plots, and be patient : plot refresh is only every 500 episodeds, and is sometimes a bit awkward.

(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 8

Value Function plots the Value function for every state in \([-1.2, 0.6] \times [-0.07, 0.07]\), and, in red, the trajectory generated by test_greedypolicy().

Policy plots the Greed Action (0 or 1 or 2) for every state in \([-1.2, 0.6] \times [-0.07, 0.07]\), and, in red, the trajectory generated by test_greedypolicy().

Learning Plot plots in blue the sum of the reward accumulated for each learning episode and, in red, the reward accumulated when testing using the GreedyPolicy.

❓ Question (1.8) ❓ Still not learning ? The trick is to use an EpsilonGreedyPolicy instead of a random policy to explore so as to begin to exploit the current estimation of the QValue. You only have line of code to change, between while... and policy = algo.EpsilonGreed.... in part 8.

Gradient ascent to the optimal QValue

0) QLearning with Linear approximation and gradient ascent.

The idea is to use a linear representation of the QValue function. Say we have \(M\) feature functions \(\Phi_i(s)\) and \(M\times|A|\) weights (\(M\) weights for each action), then we will write

\[ \hat{Q}(s,a) = \sum_{i=0}^{M-1} w_{a,i} \times \Phi_i(s)\].

After a transition \(s, a \rightarrow s', r\), we can compute the Temporal Difference error \(\delta = (r + \gamma \max_{a'} \hat{Q}(s',a') - \hat{Q}(s,a))\) and update the weights \(w_{a,i}\) using the gradient of this approximation (\(\alpha\) being a learning coefficient).

\[\forall i \in {0,...,M-1}; \;\;\;\;\; w_{a,i} \leftarrow w_{a,i} + \alpha \frac{\partial \hat{Q}(s,a)}{\partial w_{a,i}} \delta\]

❓ Question (2.1) ❓ By the way, what is \(\frac{\partial \hat{Q}(s,a)}{\partial w_{a,i}}\) ? Isn’t that easy to use in a program then ? Can you garante that the algo will converge to the optimal QValue function ?

1) RBF

In that part, we use Gaussian Kernel as feature in a simple linear approximation scheme. The center of the Gaussian \(k\) are on a grid \(c_k=(i,j)\) where \(i,j \in {0,...,N-1}\). The value of each feature is computed as

\[ \Phi_k(s) = \exp\left( \frac{- ||s - c_k||_2^2}{2\sigma^2} \right)\].

❓ Question (2.2) ❓ Can you make a schematic view of the way he algoritm work, from obs given by the continuous environment to the learning and selection of a “good” action.

2) Let’s test it

Everything is ready (or nearly) in question_10 of test_rl_car.py. You just have to implement the crucial qval_approx() method. Then you will be able to play with

(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 10

and the parameters of the algorithm (tiling_size, sigma2, alpha, etc).

❓ Question (2.3) ❓ Complete qval_approx() in question_10 of test_rl_car.py.

❓ Question (2.4) ❓ When we look at the Value or Policy plot, it seems that the Value Function is defined everywhere, which was not the case whith the discretized QLearning. Is it the result of a better exploration ?

Least-Square Policy Iteration with RBF linear approximation

0) Another look at the same linera approximation

Linear combination of features can also be written, with the same number of parameters \(w\) as

\[ \hat{Q}(s,a) = \sum_{i=0}^{|A|.(M-1)} w_{i} \times \Phi_i(s,a)\].

Under that form, the LSTD-Q algorihm that compute an estimation of the \(Q^\pi\) function for policy \(\pi\) can be written as

**Algorithm: Least Square Temporal Differences - Q**

1) Implement LSPI

Least Square Policy Iteration (LSPI) alternates between two steps:

evaluate \(Q^\pi\) for the actual policy \(\pi\), using LSTD-Q (see class LSTDQSimple)
improve this policy using a greed operator. Because \(Q^\pi\) is a continuous function of \(s\), the policy cannot be stored as a vector of action. Instead, we use class EpsilonGreedyApproxPolicy (in algo.py). Each time an action is asked, this class will get values for each action thanks to a qvalue_func function and thus compute, on the fly, the epsilon-best action. If qvalue_func is computed using the weights w comming from solving with LSTDQSimple, this greedy operator is implicitely defined in EpsilonGreeddyApproxPolicy.

❓ Question (3.1) ❓ Have you noticed something odd in class GaussianFeatures when generating \(\phi(s)\) or \(\phi(s,a)\) ? Why is that ?

❓ Question (3.2) ❓ Complete class LSTDQSimple.add_sample() so that you can run LSPI with

(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 12

2) Play with LSPI

❓ Question (3.3) ❓ List the parameters, hyper-parameters and notions you can play with to improve the efficiency of LSPI. Hint: this list is not limited to nb_rbf, max_episod and max_sample.

Play with these different aspects of LSPI to get better solutions, in less iterations. If possible…

Solutions

Complete working code is here : tp_approxRL.tgz