TP Classical Dynamic Programming and Reinforcement Learning

Introduction: Environment and Algorithm

In the serie of TP (Practical Exercices), we are going to use the very trendy OpenAI Gymnasium API. Unfortunately for Mr Frezza-Buet (and for me, in fact) it means we are going to use the python langage…

Let’s be serious now. In the following parts, we are going to separate the (models/environments) from the (algorithms/agents). The main idea is to be able to use a given algorithm on any model (if that algorithm is compatible with the model characteristics, as you were about to tell me, of course…)

This TP is organized in three major parts, each subdivided in numbered sections. Of course, this advice is silly and not needed, but I strongly advise you to read an entire section before answering the questions that are at the end of each section.

Let’s start. For this, we first need a proper development context and to become familiar with the gym.Env class that we will extend to be used with Dynamic Programming algorithms.

Setting up the python3 virtual environment

Goal: set up a proper python3 virtual programming environment, with the gym module from OpenAI.

IF you already have a working virtual environment from previous Labwork, you only need to activate it with source ./rlenv/bin/activate.

IF you do not have a virtual environment or prefer to start from scratch, follow these instructions.

Using virtualenv, the following commands will help you create a basic virtual environment (not to be confused with the environment or model in Reinforcement Learning) called rlenv with python3 as default python. In this virtual environment, you can add/remove/use any python module or version without damaging your default python environment.

First, you need to be sure that virtualenv for python3 is installed on your computer.

mylogin@mymachine:~$ sudo apt install python3-venv

Create a new directory (say oh_what_a_nice_TP) and cd into it

mylogin@mymachine:~$ mkdir oh_what_a_nice_TP
mylogin@mymachine:~$ cd oh_what_a_nice_TP

Then create a virtual environment named rlenv, with python3 being the default python in it.

mylogin@mymachine:~$ virtualenv -p python3 rlenv

Activate this rlenv

mylogin@mymachine:~$ source ./rlenv/bin/activate

You should notice a (rlenv) in front of your prompt, as a reminder.

We will now install all required python modules. For this, grab this requirements.txt file. Then use it to install all needed modules with

(rlenv) mylogin@mymachine:~$ pip install -r requirements.txt

You can check what has been locally installed in this virtual env by

(rlenv) mylogin@mymachine:~$ pip list

hopefully, openAI gym will be listed :o)

Check also that the default python is python3.

(rlenv) mylogin@mymachine:~$ python --version
Python 3.7.3

When you are finisehd, at the end of this TP, to get out of this virtual environment, just do

(rlenv) mylogin@mymachine:~$ deactivate
mylogin@mymachine:~$

notice that the (rlenv) has disappeared from before your prompt.

Warning: if you just deactivated the rlenv virtual environment, activate it again now as it is needed for the rest of the TP. (You know, the source ./rlenv/bin/activate command we just saw).

n-Chain Problem and Dynamic Programming

Goals:

update gym.Env with methods for Dynamic Programming for the n-Chain problem.
test with a simple policy (from the Policy class), and compute the value of this policy
find the optimal policy using (mixte) Policy Iteration
develop Value Iteration

You need these files:

basic_env.py: describes the gym.Env class and its main methods (especially in the class description). Copied from the OpenAI git repository. => you will not have to use of modify this file, it is only given for information.
environments.py: describes the n-Chain problem illustrated below. Slightly adapted from OpenAI git repository. => you will need to complete this file.
algo.py: implements a Policy class and a nearly working implementation of Policy Evaluation (the complete algorithm is given below). => you will need to complete this file.
test_nchain.py: main python script that should put you on tracks to infer what should be done. => you might need to complete this file.

0) Familiarize yourself with OpenAI `gym.Env`

Run this part with (the ‘0’ argument means you are using question 0)

(rlenv) mylogin@mymachine:~$ python test_nchain.py 0

Remark: In the text, we still use gym where in fact the gym modules are now the gymnasium modules. In the first lines of test_nchain.py, you will find a classical “trick” to update “old” scripts to the new gymnasium. These lines are:

# many "old" code are written using `gym`, this trick allow a transparent
# use of the newer `gymnasium`
import gymnasium as gym

Go on and open test_nchain.py, the first part of the script (before ‘question 1’) shows the classical use of the gym.Env: create, initialize (using reset) and simulate using step. Leverage on this file to understand the basic methods of the gym.Env class and their use (read the explanations in the comments of the classe Env in basic_env.py). Indeed, if you mastered the first TP on “car rental”, this part should be familiar with all this. Once this is understood, you can attack questions 1.1 and 1.2 below.

1) Update the `gym.Env` for ‘n-Chain’ and complete `eval_policy()`

Run this part with (the ‘1’ argument means you are using question 1)

(rlenv) mylogin@mymachine:~$ python test_nchain.py 1

By running test_chain.py we should be able to create an gym.Env for the the n-Chain problem illustrated in the Figure below and compute the value of a randomly initialized policy (once complete, as, for now it raises a NotImplementederror 😈 ). This file is abundantly commented and by answering the questions below, at the end of this section, you should be able to run Iterative Policy Iteration on the n-Chain problem. The next two Figures and the code/comments in environments.py, algo.py and test_nchain.py will get you going.

n-Chain problem: n states, 2 actions, a probability p of slipping (the effect of the action is the opposite of what is intended), rewards are denoted in red as +X — **n-Chain problem**: n states, 2 actions, a probability p of slipping (the effect of the action is the opposite of what is intended), rewards are denoted in red as +X

❓ Question (1.1) ❓ Why is the raw gym.Env not adapted to Dynamic Progamming algorithms like Policy Iteration or Value Iteration ?

❓ Question (1.2) ❓ Complete the proper methods in class NChainEnv (in environments.py) to allow the query for the missing information, and update method policy_evaluation() in algo.py in consequence.

❓ Question (1.3) ❓ OPTIONNAL Can you guess or estimate (without simulation, by analytic calculus) what is the maximum (and minimum) values of the value function of an optimal policy ? That could help you when testing your code…

2) Implement ‘(mixte) Policy Iteration’ to find the optimal policy

Run this part with (the ‘X’ is the question number).

(rlenv) mylogin@mymachine:~$ python test_nchain.py 4

We are now concerned with Question 4 in test_nchain.py. Knowing how to find the value function of a policy, we “only” need a greedy update of this policy using this value function to design a newer and better policy. This is the essence of the (mixte) Policy Iteration algorithm of the next Figure. The next two questions will lead you through two phases (as always, the idea is to build one brick after another, testing small bits is easier than debugging large parts).

Policy Iteration This pseudo-code has a small bug, the algorithm could switch between two different optimal policies without ever terminating. — **Policy Iteration** This pseudo-code has a small bug, the algorithm could switch between two different optimal policies without ever terminating.

❓ Question (1.4) ❓ Implement the method policy_improvement() of algo.py. The signature of the function is given, and test_nchain.py tests its working.

Now, we can run the program for question 5.

(rlenv) mylogin@mymachine:~$ python test_nchain.py 5

You have all needed parts to implements Policy Iteration. You will see that the function signature of policy_iteration() in algo.py asks for a max_iterations, it is one crude way to stop the algorithm when the bug mentionned below the algorithm is encountered.

❓ Question (1.5) ❓ Implement the method policy_iteration() of algo.py.

❓ Question (1.6) ❓ OPTIONNAL Why is this algorithm called “mixte” Policy Iteration ? (the answer is not in any part of the TP).

3) [OPTIONNAL] Value Iteration

You can run this part with

(rlenv) mylogin@mymachine:~$ python test_nchain.py 7

ONLY IF you feel you are “ahead” of time, you can implement Value Iteration. The signature of the method is given in algo.py and the algorithm, as written by Sutton and Barto, is in the following figure.

❓ Question (1.7) ❓ Implement Value Iteration and discuss its performance compared to Policy Iteration.

Tabular Reinforcement Learning

Goals:

learn an optimal action-value function using Q-Learning
use an on-policy learning algorithm (SARSA)

We will now address the problem of reinforcement learning with algorithms that can be used even when the learning agent has no access to the transition probabilities. In this part, the central notion of the action-value function, denoted Q(s,a), will be stored in a tabular way. And I strongly suggest you use numpy.array to do so.

So, on a more technical point of view. The tabular ‘Q’ function can be created and initialized to 0 with:

import numpy as np
qvalue = np.zeros( (nb_states, nb_actions) )

Then, Q(s,a) is qvalue[state_index, action_index] where state_index and action_index are valid int, meaning there are in the state and actions spaces. For example, one can set and get q-values using:

qvalue_of_s_and_a = qvalue[s,a]
qvalue[s,a] = 2 + qvalue[next_s,a] * 0.1
qvalue[s,a] *= 0.2
# etc...

1) basic tabular Q-Learning

You need a new file: test_rl_nchain.py

You can run this part with

(rlenv) mylogin@mymachine:~$ python test_rl_nchain.py 1

In ‘test_rl_nchain.py’, the idea is to run Q-Learning on the previous n-Chain problem. The “only” thing to do is to implement the function q_learning() in ‘algo’, the pseudo code algorithm is given in the Figure below.

Q-Learning algorithm: an off-policy TD control algorithm — **Q-Learning algorithm**: an off-policy TD control algorithm

Once this algorithm has computed the optimal action-value function ‘Q’, you can derive the optimal greedy policy using class EpsilonGreedyPolicy in ‘algo.py’. An example is given in ‘test_rl_nchain.py’ (and, yes, you can ask yourself, and ourselves, why a 0.0 epsilon is chosen).

❓ Question (2.1) ❓ Implement Q-Learning in ‘algo.py’. Check on the ‘n-Chain problem’ that the optimal policy is the same one as the one computed using Dynamic Programming algorithms. What is the fasted method ? Any comment ?

❓ Question (2.2) ❓ At this time, you can/should be curious and see the evolution of the optimal policy when you increase the length of the chain (between 5 and 10 for example). For a chain of length 7, with what parameter of the algorithm can you play for the policy which always choose the action “forward” (action 0) to be one of the optimal policy. Explain why.

2) SARSA, on-policy Reinforcement Learning

While Q-Learning is an off-policy algorithm, SARSA (see Figure below) is an on-policy algorithm.

SARSA algorithm: an on-policy TD control algorithm — **SARSA algorithm**: an on-policy TD control algorithm

Now, surprisingly (!), you can use:

(rlenv) mylogin@mymachine:~$ python test_rl_nchain.py 3

❓ Question (2.3) ❓ Implement the method sarsa() in file ‘algo.py’. Compare Q-Learning and SARSA on the nChain problem of length 10, with a max_iteration=20. What happens ? Why ?

Exploration / Exploitation dilemma

Goals

implement the ‘Cliff Walking’ problem, a more complex setting
try various exploration schemes with SARSA and Q-Learning
discuss the results of Sutton & Barto

You will need a new file: test_rl_cliff.py in this part.

1) The ‘Cliff Walking’ problem

In Sutton & Barto’s book on Reinforcement Learning (2nd edition), the problem is described as:

[As depicted in the next Figure,] this is a standard [gridworld] undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. Reward is −1 on all transitions except those into the the region marked “The Cliff.” Stepping into this region incurs a reward of −100 and sends the agent instantly back to the start.

Cliff Walking Problem: from Sutton & Barto book on Reinforcement Learning (2nd edition) — **Cliff Walking Problem**: from Sutton & Barto book on Reinforcement Learning (2nd edition)

Using

(rlenv) mylogin@mymachine:~$ python test_rl_cliff.py 0

you can see how the CliffWalking environment is created and how it can be displayed.

Then for the next question, use

(rlenv) mylogin@mymachine:~$ python test_rl_cliff.py 1

❓ Question (3.1) ❓ Implement the step() method of class CliffWalking in file ‘environments.py’. Optionnaly, you can use the p_slip parameter of the class to set up a non deterministic environment: at each step, the agent has a probability p_slip to slip and take an random action (while still believing it tried its chosen action).

and

(rlenv) mylogin@mymachine:~$ python test_rl_cliff.py 2

❓ Question (3.2) ❓ Using file ‘test_rl_cliff.py’, see what kind of trajectory a random policy can generate on this environment. Can it get easily to the goal ? On average, how many iterations is needed by this policy to get to the goal ?

2) Exploration vs Exploitation

This is question 3.3 and further so

(rlenv) mylogin@mymachine:~$ python test_rl_cliff.py 3

This should allow you to run SARSA and Q-Learning on the ‘Cliff Walking’ problem for 500 episodes. Both use EspsilonGreedyPolicy( epsilon=0.1) as an “exploratory” policy. Then, a plot giving the evolution of the sum of the reward for each learning episode is generated.

❓ Question (3.4) ❓ Analyze the generated plot ? What seems to be the better algorithm ? Why ? Or, if you cannot decide, describe the advantages and limitations of both.

❓ Question (3.5) ❓ Is it better to learn with longer episodes or shorter episodes (play with the max_iteration parameter of the algorithm) ? Can you explain why ?

In fact, the python code of question (3.4) is not “fair” in the sense that these algorithms do not have the same goal. Q-Learning was not made to optimize reward during learning, in its original form, it could even be run with a uniform random policy until convergence (remember it is off-policy). And, once it has converged, the policy to test it should be the deterministic greedy policy.

❓ Question (3.6) ❓ Use Q-Learning (with wathever exploration policy you want) to learn the optimal policy. Display and test this optimal deterministic policy. In fact, can you make a plot where the current greedy policy learned by Q-Learning is tested every 1O episode until convergence ?

❓ Question (3.7) ❓ How can you use SARSA to learn the same “true” optimal policy ? Do it ! :o)

3) What about the results of Sutton & Barto

In Sutton & Barto book on Reinforcement Learning (2nd edition), they introduce the ‘Cliff Waiking’ example to “highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods”. They present the Figure below and write:

The lower part of the figure shows the performance of the Sarsa and Q-learning methods with ε-greedy action selection, ε = 0.1. After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cliff. Unfortunately, this results in its occasionally falling off the cliff because of the ε-greedy action selection. Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the grid. Although Q-learning actually learns the values of the optimal policy, its on-line performance is worse than that of Sarsa, which learns the roundabout policy.

Cliff Walking Results: from Sutton & Barto book on Reinforcement Learning (2nd edition) — **Cliff Walking Results**: from Sutton & Barto book on Reinforcement Learning (2nd edition)

❓ Question (3.8) ❓ What surprise you in this plot, compared to the plot you had in Question (3.3) ? Can you find an explanation ?

Play by yourself

No more questions, but you can improve many parts of your code or play with different environments. Some suggestions are given below.

Value Iteration. If you did not have the enough time during the TP session to implement Value Iteration, it is now the time to do that.

Boltzman Q-Learning one of the most effective algorithm (without explicit focus on exploration) is a version of Q-Learning where Gibbs sampling is used to choose actions. In short, the exploration policy choses an action a in state s with probability

\[P(a | s) = \frac{\exp( \beta Q(s,a))}{\sum_{a'}{\exp( \beta Q(s,a'))}}\].

where \(\beta\) is a slowly increasing. Maybe you could define a BoltzmanPolicy class ?

Toy-Text environment from OpenAI Gym (see Toy_text) have discrete space. So, you can use your algorithms on them. For example, if you want to try to optimize your taxi driver skill, try

import gymnasium as gym
env = gym.make('Taxi-v3')
...

Solutions

Complete working code is here : tp_basicRL.tgz