License : Creative Commons Attribution 4.0 International (CC BY-NC-SA 4.0)
Copyright :
CentraleSupelec
Last modified : April 15, 2024 10:31
Link to the source : index.md
In the serie of TP (Practical Exercices), we are going to use the very trendy OpenAI Gymnasium API. Unfortunately for Mr Frezza-Buet (and for me, in fact) it means we are going to use the python langage…
Let’s be serious now. In the following parts, we are going to separate the (models/environments) from the (algorithms/agents). The main idea is to be able to use a given algorithm on any model (if that algorithm is compatible with the model characteristics, as you were about to tell me, of course…)
This TP is organized in three major parts, each subdivided in numbered sections. Of course, this advice is silly and not needed, but I strongly advise you to read an entire section before answering the questions that are at the end of each section.
Let’s start. For this, we first need a proper development context and to become familiar with the gym.Env
class that we will extend to be used with Dynamic Programming algorithms.
Goal: set up a proper python3 virtual programming environment, with the gym
module from OpenAI.
IF you already have a working virtual environment from previous Labwork, you only need to activate it with source ./rlenv/bin/activate
.
IF you do not have a virtual environment or prefer to start from scratch, follow these instructions.
Using virtualenv
, the following commands will help you create a basic virtual environment (not to be confused with the environment or model in Reinforcement Learning) called rlenv
with python3 as default python. In this virtual environment, you can add/remove/use any python module or version without damaging your default python environment.
First, you need to be sure that virtualenv
for python3
is installed on your computer.
mylogin@mymachine:~$ sudo apt install python3-venv
Create a new directory (say oh_what_a_nice_TP
) and cd
into it
mylogin@mymachine:~$ mkdir oh_what_a_nice_TP
mylogin@mymachine:~$ cd oh_what_a_nice_TP
Then create a virtual environment named rlenv
, with python3 being the default python in it.
mylogin@mymachine:~$ virtualenv -p python3 rlenv
Activate this rlenv
mylogin@mymachine:~$ source ./rlenv/bin/activate
You should notice a (rlenv
) in front of your prompt, as a reminder.
We will now install all required python modules. For this, grab this requirements.txt file. Then use it to install all needed modules with
(rlenv) mylogin@mymachine:~$ pip install -r requirements.txt
You can check what has been locally installed in this virtual env by
(rlenv) mylogin@mymachine:~$ pip list
hopefully, openAI gym will be listed :o)
Check also that the default python is python3.
(rlenv) mylogin@mymachine:~$ python --version
Python 3.7.3
When you are finisehd, at the end of this TP, to get out of this virtual environment, just do
(rlenv) mylogin@mymachine:~$ deactivate
mylogin@mymachine:~$
notice that the (rlenv)
has disappeared from before your prompt.
Warning: if you just deactivated the rlenv
virtual environment, activate it again now as it is needed for the rest of the TP. (You know, the source ./rlenv/bin/activate
command we just saw).
Goals:
gym.Env
with methods for Dynamic Programming for the n-Chain problem.Policy
class), and compute the value of this policyYou need these files:
gym.Env
class and its main methods (especially in the class description). Copied from the OpenAI git repository. => you will not have to use of modify this file, it is only given for information.Policy
class and a nearly working implementation of Policy Evaluation (the complete algorithm is given below). => you will need to complete this file.gym.Env
Run this part with (the ‘0’ argument means you are using question 0)
(rlenv) mylogin@mymachine:~$ python test_nchain.py 0
Remark: In the text, we still use gym
where in fact the gym
modules are now the gymnasium
modules. In the first lines of test_nchain.py
, you will find a classical “trick” to update “old” scripts to the new gymnasium
. These lines are:
# many "old" code are written using `gym`, this trick allow a transparent
# use of the newer `gymnasium`
import gymnasium as gym
Go on and open test_nchain.py
, the first part of the script (before ‘question 1’) shows the classical use of the gym.Env
: create, initialize (using reset
) and simulate using step
. Leverage on this file to understand the basic methods of the gym.Env
class and their use (read the explanations in the comments of the classe Env
in basic_env.py
). Indeed, if you mastered the first TP on “car rental”, this part should be familiar with all this. Once this is understood, you can attack questions 1.1 and 1.2 below.
gym.Env
for ‘n-Chain’ and complete eval_policy()
Run this part with (the ‘1’ argument means you are using question 1)
(rlenv) mylogin@mymachine:~$ python test_nchain.py 1
By running test_chain.py
we should be able to create an gym.Env
for the the n-Chain problem illustrated in the Figure below and compute the value of a randomly initialized policy (once complete, as, for now it raises a NotImplementederror 😈 ). This file is abundantly commented and by answering the questions below, at the end of this section, you should be able to run Iterative Policy Iteration on the n-Chain problem. The next two Figures and the code/comments in environments.py
, algo.py
and test_nchain.py
will get you going.
❓ Question (1.1) ❓ Why is the raw gym.Env
not adapted to Dynamic Progamming algorithms like Policy Iteration or Value Iteration ?
❓ Question (1.2) ❓ Complete the proper methods in class NChainEnv
(in environments.py
) to allow the query for the missing information, and update method policy_evaluation()
in algo.py
in consequence.
❓ Question (1.3) ❓ OPTIONNAL Can you guess or estimate (without simulation, by analytic calculus) what is the maximum (and minimum) values of the value function of an optimal policy ? That could help you when testing your code…
Run this part with (the ‘X’ is the question number).
(rlenv) mylogin@mymachine:~$ python test_nchain.py 4
We are now concerned with Question 4 in test_nchain.py
. Knowing how to find the value function of a policy, we “only” need a greedy update of this policy using this value function to design a newer and better policy. This is the essence of the (mixte) Policy Iteration algorithm of the next Figure. The next two questions will lead you through two phases (as always, the idea is to build one brick after another, testing small bits is easier than debugging large parts).
❓ Question (1.4) ❓ Implement the method policy_improvement()
of algo.py
. The signature of the function is given, and test_nchain.py
tests its working.
Now, we can run the program for question 5.
(rlenv) mylogin@mymachine:~$ python test_nchain.py 5
You have all needed parts to implements Policy Iteration. You will see that the function signature of policy_iteration()
in algo.py
asks for a max_iterations
, it is one crude way to stop the algorithm when the bug mentionned below the algorithm is encountered.
❓ Question (1.5) ❓ Implement the method policy_iteration()
of algo.py
.
❓ Question (1.6) ❓ OPTIONNAL Why is this algorithm called “mixte” Policy Iteration ? (the answer is not in any part of the TP).
You can run this part with
(rlenv) mylogin@mymachine:~$ python test_nchain.py 7
ONLY IF you feel you are “ahead” of time, you can implement Value Iteration. The signature of the method is given in algo.py
and the algorithm, as written by Sutton and Barto, is in the following figure.
❓ Question (1.7) ❓ Implement Value Iteration and discuss its performance compared to Policy Iteration.
Goals:
We will now address the problem of reinforcement learning with algorithms that can be used even when the learning agent has no access to the transition probabilities. In this part, the central notion of the action-value function, denoted Q(s,a), will be stored in a tabular way. And I strongly suggest you use numpy.array to do so.
So, on a more technical point of view. The tabular ‘Q’ function can be created and initialized to 0 with:
import numpy as np
qvalue = np.zeros( (nb_states, nb_actions) )
Then, Q(s,a) is qvalue[state_index, action_index]
where state_index
and action_index
are valid int
, meaning there are in the state and actions spaces. For example, one can set and get q-values using:
qvalue_of_s_and_a = qvalue[s,a]
qvalue[s,a] = 2 + qvalue[next_s,a] * 0.1
qvalue[s,a] *= 0.2
# etc...
You need a new file: test_rl_nchain.py
You can run this part with
(rlenv) mylogin@mymachine:~$ python test_rl_nchain.py 1
In ‘test_rl_nchain.py’, the idea is to run Q-Learning on the previous n-Chain problem. The “only” thing to do is to implement the function q_learning()
in ‘algo’, the pseudo code algorithm is given in the Figure below.
Once this algorithm has computed the optimal action-value function ‘Q’, you can derive the optimal greedy policy using class EpsilonGreedyPolicy
in ‘algo.py’. An example is given in ‘test_rl_nchain.py’ (and, yes, you can ask yourself, and ourselves, why a 0.0 epsilon is chosen).
❓ Question (2.1) ❓ Implement Q-Learning in ‘algo.py’. Check on the ‘n-Chain problem’ that the optimal policy is the same one as the one computed using Dynamic Programming algorithms. What is the fasted method ? Any comment ?
❓ Question (2.2) ❓ At this time, you can/should be curious and see the evolution of the optimal policy when you increase the length of the chain (between 5 and 10 for example). For a chain of length 7, with what parameter of the algorithm can you play for the policy which always choose the action “forward” (action 0) to be one of the optimal policy. Explain why.
While Q-Learning is an off-policy algorithm, SARSA (see Figure below) is an on-policy algorithm.
Now, surprisingly (!), you can use:
(rlenv) mylogin@mymachine:~$ python test_rl_nchain.py 3
❓ Question (2.3) ❓ Implement the method sarsa()
in file ‘algo.py’. Compare Q-Learning and SARSA on the nChain problem of length 10, with a max_iteration=20
. What happens ? Why ?
Goals
You will need a new file: test_rl_cliff.py in this part.
In Sutton & Barto’s book on Reinforcement Learning (2nd edition), the problem is described as:
[As depicted in the next Figure,] this is a standard [gridworld] undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. Reward is −1 on all transitions except those into the the region marked “The Cliff.” Stepping into this region incurs a reward of −100 and sends the agent instantly back to the start.
Using
(rlenv) mylogin@mymachine:~$ python test_rl_cliff.py 0
you can see how the CliffWalking environment is created and how it can be displayed.
Then for the next question, use
(rlenv) mylogin@mymachine:~$ python test_rl_cliff.py 1
❓ Question (3.1) ❓ Implement the step()
method of class CliffWalking
in file ‘environments.py’. Optionnaly, you can use the p_slip
parameter of the class to set up a non deterministic environment: at each step, the agent has a probability p_slip
to slip and take an random action (while still believing it tried its chosen action).
and
(rlenv) mylogin@mymachine:~$ python test_rl_cliff.py 2
❓ Question (3.2) ❓ Using file ‘test_rl_cliff.py’, see what kind of trajectory a random policy can generate on this environment. Can it get easily to the goal ? On average, how many iterations is needed by this policy to get to the goal ?
This is question 3.3 and further so
(rlenv) mylogin@mymachine:~$ python test_rl_cliff.py 3
This should allow you to run SARSA and Q-Learning on the ‘Cliff Walking’ problem for 500 episodes. Both use EspsilonGreedyPolicy( epsilon=0.1)
as an “exploratory” policy. Then, a plot giving the evolution of the sum of the reward for each learning episode is generated.
❓ Question (3.4) ❓ Analyze the generated plot ? What seems to be the better algorithm ? Why ? Or, if you cannot decide, describe the advantages and limitations of both.
❓ Question (3.5) ❓ Is it better to learn with longer episodes or shorter episodes (play with the max_iteration
parameter of the algorithm) ? Can you explain why ?
In fact, the python code of question (3.4) is not “fair” in the sense that these algorithms do not have the same goal. Q-Learning was not made to optimize reward during learning, in its original form, it could even be run with a uniform random policy until convergence (remember it is off-policy). And, once it has converged, the policy to test it should be the deterministic greedy policy.
❓ Question (3.6) ❓ Use Q-Learning (with wathever exploration policy you want) to learn the optimal policy. Display and test this optimal deterministic policy. In fact, can you make a plot where the current greedy policy learned by Q-Learning is tested every 1O episode until convergence ?
❓ Question (3.7) ❓ How can you use SARSA to learn the same “true” optimal policy ? Do it ! :o)
In Sutton & Barto book on Reinforcement Learning (2nd edition), they introduce the ‘Cliff Waiking’ example to “highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods”. They present the Figure below and write:
The lower part of the figure shows the performance of the Sarsa and Q-learning methods with ε-greedy action selection, ε = 0.1. After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cliff. Unfortunately, this results in its occasionally falling off the cliff because of the ε-greedy action selection. Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the grid. Although Q-learning actually learns the values of the optimal policy, its on-line performance is worse than that of Sarsa, which learns the roundabout policy.
❓ Question (3.8) ❓ What surprise you in this plot, compared to the plot you had in Question (3.3) ? Can you find an explanation ?
No more questions, but you can improve many parts of your code or play with different environments. Some suggestions are given below.
Value Iteration. If you did not have the enough time during the TP session to implement Value Iteration, it is now the time to do that.
Boltzman Q-Learning one of the most effective algorithm (without explicit focus on exploration) is a version of Q-Learning where Gibbs sampling is used to choose actions. In short, the exploration policy choses an action a
in state s
with probability
\[P(a | s) = \frac{\exp( \beta Q(s,a))}{\sum_{a'}{\exp( \beta Q(s,a'))}}\].
where \(\beta\) is a slowly increasing. Maybe you could define a BoltzmanPolicy
class ?
Toy-Text environment from OpenAI Gym (see Toy_text) have discrete space. So, you can use your algorithms on them. For example, if you want to try to optimize your taxi driver skill, try
import gymnasium as gym
env = gym.make('Taxi-v3')
...
Complete working code is here : tp_basicRL.tgz