License : Creative Commons Attribution 4.0 International (CC BY-NC-SA 4.0)
Copyright :
CentraleSupelec
Last modified : April 15, 2024 10:31
Link to the source : index.md
In this new episode of “Fun with Reinforcement Learning” (do you have the reference ?), we are about to explore approximate methods for Reinforcement Learning, especially for continuous-state environements. We will still be using our virtual python environment with the OpenAI gymnasium
modules.
First step is setting the virtual python environment, or reuse the one from the privious episode/TP. You can skip the next section if reusing the environment, provided you remember how to activate and deactive this environment.
Goal: set up a proper python3 virtual programming environment, with the gym
module from OpenAI.
IF you already have a working virtual environment from previous Labwork, you only need to activate it with source ./rlenv/bin/activate
and update the requirements. For this, grab this requirements.txt file. Then use it to install all needed modules with
(rlenv) mylogin@mymachine:~$ pip install -r requirements.txt
IF you do not have a virtual environment or prefer to start from scratch, follow these instructions.
Using virtualenv
, the following commands will help you create a basic virtual environment (not to be confused with the environment or model in Reinforcement Learning) called rlenv
with python3 as default python. In this virtual environment, you can add/remove/use any python module or version without damaging your default python environment.
First, you need to be sure that virtualenv
for python3
is installed on your computer.
mylogin@mymachine:~$ sudo apt install python3-venv
Create a new directory (say oh_what_a_nice_TP
) and cd
into it
mylogin@mymachine:~$ mkdir oh_what_a_nice_TP
mylogin@mymachine:~$ cd oh_what_a_nice_TP
Then create a virtual environment named rlenv
, with python3 being the default python in it.
mylogin@mymachine:~$ virtualenv -p python3 rlenv
Activate this rlenv
mylogin@mymachine:~$ source ./rlenv/bin/activate
You should notice a (rlenv
) in front of your prompt, as a reminder.
We will now install all required python modules. For this, grab this requirements.txt file. Then use it to install all needed modules with
(rlenv) mylogin@mymachine:~$ pip install -r requirements.txt
You can check what has been locally installed in this virtual env by
(rlenv) mylogin@mymachine:~$ pip list
hopefully, openAI gym will be listed :o)
Check also that the default python is python3.
(rlenv) mylogin@mymachine:~$ python --version
Python 3.7.3
When you are finisehd, at the end of this TP, to get out of this virtual environment, just do
(rlenv) mylogin@mymachine:~$ deactivate
mylogin@mymachine:~$
notice that the (rlenv)
has disappeared from before your prompt.
Warning: if you just deactivated the rlenv
virtual environment, activate it again now as it is needed for the rest of the TP. (You know, the source ./rlenv/bin/activate
command we just saw).
Goals:
You need these files:
Policy
class and valid version of some RL algorithms (along with previous Dynamic Programming Algorithms) => you might have to complete this file (BEWARE, it will not work with algo.py
of the first labwork).OpenAI gymnasium
library.The “Moutain Car” problem is described by Sutton & Barto (yes, again!) as:
As usual, this environement make use of the reset()
and step()
functions. But, it has a continuous observation space.
(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 0
You can try your skill using the keyboard to choose action in real time. Run:
(rlenv) mylogin@mymachine:~$ python interactive_environment.py
^C
quits and ‘a’, ‘z’, ‘e’ keys choses actions.
__Mountain Car___________
- a : accel LEFT
- z : NOTHING
- e : accel RIGHT
You can have a look at interacive_environment.py
, it simply uses the play()
util from the gymnasium
.
By using the following command, you can test a simple policy on the Mountain Car problem.
(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 1
❓ Question (1.1) ❓ What is the policy used in the last command ?
❓ Question (1.2) ❓ On average, how long does it take for that policy to reach the target ? Update the program around question 1 in test_rl_mcar.py
to compute the empirical value of the time needed to reach the target. For example by averaging on 20 episodes.
Remark, by default an episod terminates after 200 steps, if you want to change this limit, use, for example:
env._max_episode_steps = 5000
Thanks to the hard work of the previous TP session, we have the Q-Learning algorithm at our disposal (in fact, algo.py
includes a correct version of this algorithm). Test it with:
(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 3
❓ Question (1.3) ❓ What is wrong ? Why can’t you compute this optimal policy using Q-Leraning ?
digitize()
We will make use of the numpy.digitize()
function which, given a list of values, gives you the index of the interval. Pay attention to the right
argument of this function.
You can try it with
(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 4
Encoder
The idea is to discretize the input values (array of 2 floats) and then to generate a unique code_index that will be the future state of our QLearning algorithm.
❓ Question (1.4) ❓ Complete the class Encode
so that inputs can be normalized with
(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 5
and then encoded as a code_index using
(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 6
❓ Question (1.6) ❓ Check that the minimum index (for minimal input values) is 0 ? And what is the maximum index ?
As we know how to encode our continuous state into a discrete one, we are one step away to be able to use Q-Learning on the Mountain Car problem. As our algorithm needs a discrete Env
, we will build one by wrapping our continuous Env
: i.e. providing a new environment where the continuous obs
has been transformed to a discrete one.
If you have a proper Encoder with the encode()
methode of part 6
of test_rl_mcar.py
, the following lines do the trick
## Create a discretized_env with discrete observation space
from gymnasium.wrappers import TransformObservation
discretized_env = TransformObservation( env, encoder.encode )
## need also to change its observation_space
discretized_env.observation_space = gym.spaces.Discrete( encoder.encode( obs_max )+1 )
And you should now be able to run Q-Learning using:
(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 7
❓ Question (1.7) ❓ You should not have any errors. But does it “learns” ? Are you sure all Qvalues
are not zero ? Is there a problem with gamma
? What do you think ?
To monitor the learning algorithm, one way is to test, from time to time, the performance of a greedy policy relative to the current estimation of the QValue. Plotting the Value (or QValue) function and this greedy policy can also help.
In this part, you will be able to use some methods/class devoted to this kind of plotting. For that, will need to provide two function to the plotting module : one to compute the Value at a continuous state (encoded_value_f()
) and one to compute the greedy action at a state (encoded_policy_f()
).
Your job is to evaluate, from time to time, the current greedy policy. You have to implement some part of method test_greedypolicy()
that returns a trajectory generated by the greedy policy, and the reward cumulated along this trajectory.
❓ Question (1.8) ❓ Implement missing parts of test_greedypolicy()
. Then, the following line should allow you to monitor the learning progress. Do not expect any serious progress before 7.000 or 8.000 episodes.
This pyhton command should allow you to monitor, using plots, the performances of the algorithm. Position the plots, and be patient : plot refresh is only every 500 episodeds, and is sometimes a bit awkward.
(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 8
Value Function plots the Value function for every state in \([-1.2, 0.6] \times [-0.07, 0.07]\), and, in red, the trajectory generated by test_greedypolicy()
.
Policy plots the Greed Action (0 or 1 or 2) for every state in \([-1.2, 0.6] \times [-0.07, 0.07]\), and, in red, the trajectory generated by test_greedypolicy()
.
Learning Plot plots in blue the sum of the reward accumulated for each learning episode and, in red, the reward accumulated when testing using the GreedyPolicy.
❓ Question (1.8) ❓ Still not learning ? The trick is to use an EpsilonGreedyPolicy instead of a random policy to explore so as to begin to exploit the current estimation of the QValue. You only have line of code to change, between while...
and policy = algo.EpsilonGreed....
in part 8.
The idea is to use a linear representation of the QValue function. Say we have \(M\) feature functions \(\Phi_i(s)\) and \(M\times|A|\) weights (\(M\) weights for each action), then we will write
\[ \hat{Q}(s,a) = \sum_{i=0}^{M-1} w_{a,i} \times \Phi_i(s)\].
After a transition \(s, a \rightarrow s', r\), we can compute the Temporal Difference error \(\delta = (r + \gamma \max_{a'} \hat{Q}(s',a') - \hat{Q}(s,a))\) and update the weights \(w_{a,i}\) using the gradient of this approximation (\(\alpha\) being a learning coefficient).
\[\forall i \in {0,...,M-1}; \;\;\;\;\; w_{a,i} \leftarrow w_{a,i} + \alpha \frac{\partial \hat{Q}(s,a)}{\partial w_{a,i}} \delta\]
❓ Question (2.1) ❓ By the way, what is \(\frac{\partial \hat{Q}(s,a)}{\partial w_{a,i}}\) ? Isn’t that easy to use in a program then ? Can you garante that the algo will converge to the optimal QValue function ?
In that part, we use Gaussian Kernel as feature in a simple linear approximation scheme. The center of the Gaussian \(k\) are on a grid \(c_k=(i,j)\) where \(i,j \in {0,...,N-1}\). The value of each feature is computed as
\[ \Phi_k(s) = \exp\left( \frac{- ||s - c_k||_2^2}{2\sigma^2} \right)\].
❓ Question (2.2) ❓ Can you make a schematic view of the way he algoritm work, from obs
given by the continuous environment to the learning and selection of a “good” action.
Everything is ready (or nearly) in question_10 of test_rl_car.py
. You just have to implement the crucial qval_approx()
method. Then you will be able to play with
(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 10
and the parameters of the algorithm (tiling_size, sigma2, alpha, etc).
❓ Question (2.3) ❓ Complete qval_approx()
in question_10
of test_rl_car.py
.
❓ Question (2.4) ❓ When we look at the Value or Policy plot, it seems that the Value Function is defined everywhere, which was not the case whith the discretized QLearning. Is it the result of a better exploration ?
Linear combination of features can also be written, with the same number of parameters \(w\) as
\[ \hat{Q}(s,a) = \sum_{i=0}^{|A|.(M-1)} w_{i} \times \Phi_i(s,a)\].
where \(\Phi(s,a)\) is a vector of size \(|A|\times|\Phi(s)|\), filled with 0 except for coordinates \([idxa\times|\Phi(s)|, (idxa+1)\times|\Phi(s)|(\) which are set to \(\Phi(s)\).
Under that form, the LSTD-Q algorihm that compute an estimation of the \(Q^\pi\) function for policy \(\pi\) can be written as
Least Square Policy Iteration (LSPI) alternates between two steps:
class LSTDQSimple
)class EpsilonGreedyApproxPolicy
(in algo.py
). Each time an action is asked, this class will get values for each action thanks to a qvalue_func
function and thus compute, on the fly, the epsilon-best action. If qvalue_func
is computed using the weights w
comming from solving with LSTDQSimple
, this greedy operator is implicitely defined in EpsilonGreeddyApproxPolicy.❓ Question (3.1) ❓ Have you noticed something odd in class GaussianFeatures
when generating \(\phi(s)\) or \(\phi(s,a)\) ? Why is that ?
❓ Question (3.2) ❓ Complete class LSTDQSimple.add_sample()
so that you can run LSPI with
(rlenv) mylogin@mymachine:~$ python test_rl_mcar.py 12
❓ Question (3.3) ❓ List the parameters, hyper-parameters and notions you can play with to improve the efficiency of LSPI. Hint: this list is not limited to nb_rbf
, max_episod
and max_sample
.
Play with these different aspects of LSPI to get better solutions, in less iterations. If possible…
Complete working code is here : tp_approxRL.tgz