TP Deep Reinforcement Learning

Introduction: Environment and Algorithm

In this last episode of our serie, we will explore ONE algorithm/article of Deep Reinforcement Learning : DQN for Deep Q-Network. I am sure you have already guessed what is the purpose of this Artificial (Deep) Neural Network. If not, well…

WARNING We will run this Labwork on GPU (not cpu) WARNING

Virtual environment for GPU

As we are on the GPU (yes, I insist), we need to recreate a new virtual environment for python. This time, we will use a different command to take advantage of the fact that pytorch is already installed on the GPU cluster. So, here are the commands

You need to download the requirements.txt file.

mylogin@mymachine:~$ mkdir oh_what_a_nice_TP
mylogin@mymachine:~$ cd oh_what_a_nice_TP
mylogin@mymachine:~$ python3 -m venv --system-site-packages rlenv
mylogin@mymachine:~$ source ./rlenv/bin/activate
(rlenv) mylogin@mymachine:~$ pip install -r requirements.txt

The new argument is --system-site-packages that allow the usage of already installed libs and modules (like pytorch) from the virtual environment. For other explanations, you can get them on previous Labworks.

A) The environment : Space Invaders

For this part, you need these 4 files : env_wrapper.py, test_wrapper.py, memory.py, interactive_environment.py.

As shown using the following python script, we will be playing, nearly litteraly, with the Space Invaders Atari environment from OpenAI gym.

(rlenv) mylogin@mymachine:~$ python test_wrapper.py 0

In addition to displaying the obs/state of the environement, this scripts also gives you the semantics of the different actions.

You can then try to test you skill - and get some knowledge on the environment - by trying to activate actions with your keyboard. Keys Z, Q and D are mapped to actions 1, 2, 3 (FIRE, RIGHT, LEFT).

(rlenv) mylogin@mymachine:~$ python interactive_environement.py

❓ Question (A.0) ❓ what is the observation and action space.

❓ Question (A.1) ❓ Does the game starts at once ? What happen ? Can you maybe modify script test_wrapper.py, around question 0, to get a more precise idea of what happens ?

Understand DQN (from the original paper)

The original paper from the Deepmind Team is here : mnih15_dqnAtari_color.pdf. You can take the time to read the whole paper if you like, but it is only necessary to read highlighted passages.

B) Main idea of DQN

❓ Question (B.1) ❓ From reading the paragraph highlighted in green in the Mnih et al. paper, explain, maybe with a quick drawing, what is the main idea used in this paper.

1) What is a “state”

❓ Question (B.2) ❓ From reading the paragraph highlighted in pink in the Mnih et al. paper, explain how the network input is produced from the observation received from the environment.

It seems that one of the steps is to reduce/crop the image and transform it to greyscale. The OpenAI gym way of applying such transformations to state/obs is to use a Wrapper environment around the original env. An example of such a wrapped env is seen with:

(rlenv) mylogin@mymachine:~$ python test_wrapper.py 1

❓ Question (B.3) ❓ Using class FrameStack in env_wrapper.py, build an environment that returns a state made of 2 stacked image of the original environment. Maybe you need to act once if you want to have really two different images. Such an environment is tested with:

(rlenv) mylogin@mymachine:~$ python test_wrapper.py 2

❓ Question (B.4) ❓ And now, produce an environment that stacks 4 grayscale images. Test with:

(rlenv) mylogin@mymachine:~$ python test_wrapper.py 3

2) Build the DQN network

❓ Question (B.5) ❓ From reading the paragraph highlighted in blue in the Mnih et al. paper, make a schematic of the various layers of the DQN network. Precise the type, dimension and connectivity of each layer.

The next step is, of course, to implement this network. Please do that in class DQN_bn of file nn_models.py.

❓ Question (B.6) ❓ After having implemented your DQN, test to see if your architecture seems correct by running:

(rlenv) mylogin@mymachine:~$ python nn_models.py

This test generates uses a valid atari environment as close as possible to the one really used in the original paper. See env_wrapper.py (or ask questions) for details :o)

C) Run and Test…

Still a few steps, and you will be able to run and test DQN on the Space Invaders environment. Get the dqn_atari.py. We need to talk about the difference between the policy net and the target net. And also a bit about the Replay Buffer.

Let’s write down the pseudo code algorithm we use in learn_net() and test_net() methods of dqn_atari.py.

for i_ep in range( num_episodes ):
    # 0. Initialize the environment and state
    # Until break (called when episode is done)
    # infinite loop
    for t in count():
        # 1+2. Select and perform an action, using greedy policy
        # 3. Store transition in memory (i.e. Replay Buffer)
        # 4. Optimize model using mini-batch from Replay-Buffer
    # 5.Update the target network, copying all weights and biases in DQN

1) Replay-buffer

As highlighted in yellow in the original paper, DQN relies on a Replay-Buffer to do experience replay. It is linked to steps #3 and #4 in the previous pseudo-code.

In fact, this Replay-Buffer is hidden in some of the environment wrappers we are using when we call

    # create environment
    env = gym.make( "ALE/"+args.game, frameskip=1)
    env = env_wrapper.make_env( env, ...

in dqn_atari.py.

❓ Question (C.1) ❓ Can you pinpoint in which part of dqn_atari.py the replay-buffer should be used ? Using which class ?

2) policy_net and target_net

In my implementation of DQN, I use 2 neural network. The “classical” net described in the paper is in fact what I call the policy_net. It is defined in dqn_atari.py with

    # Policy Net: QValue net used to select_action
    policy_net = nn_models.DQN_bn(logger,
                                  nb_input_channels=4,
                                  nb_actions=dim_output).to(device)

The role of the target_net is to bring a bit more stability to the learning which can be unstable. It is used in two places. In the optimize() method, you can see a:

    next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

❓ Question (C.2) ❓ To which line of the ‘Algorithm 1’ in the original paper does this target net contribute ?

Besides, in step #5 of the learn_net() method of dqn_atari.py, you find these lines:

        if i_ep % target_update == 0:
            target_net.load_state_dict(policy_net.state_dict())

❓ Question (C.3) ❓ What is the purpose of these lines ? What value of the parameter target_update will give you a behavior as close as possible to the original paper ?

Before being able to use this target network, we must define it.

❓ Question (C.4) ❓ Complete the dqn_atari.py script by creating the target net after the policy_net.

    # Target Net: QValue net used to compute V(state),
    #             in computing loss for optimizing model
    target_net = ...

3) Learn to learn

It is time to complete our last function, the one that implements the Algorithm 1 of the original paper. Taking inspiration from the test_net() method, fill steps #1, #2, #3 and #4 of the learn_net() method of dqn_atari.py.

❓ Question (C.5) ❓ Get a proper and running algorithm.

4) Run Forest, run…

Everything is in place, you can run your tests… and chose the correct values for the parameters. Some parameters MUST be set on the command line. Get help with

(rlenv) mylogin@mymachine:~$ python dqn_atari.py --help

Some parameters are set directly in the script, in particular we have :

# hyperparameters
    BATCH_SIZE = 32
    GAMMA = 0.99
    EPSILON = 0.1
    LEARN_RATE = 1e-4
    TARGET_UPDATE = 10
    NUM_EPISODES_TRAIN = args.num_train
    NUM_EPISODES_TEST = args.num_test
    MEMORY_SIZE = 10000

As a bare minimum, you have to specify num_train and num_test, for example:

(rlenv) mylogin@mymachine:~$ python dqn_atari.py --num_train 1 --num_test 1

5) As a bonus.

You can also use Tensorboard while/after the script is running. Just run

(rlenv) mylogin@mymachine:~$ tensorboard --logdir runs

and open your browser on localhost:6006.

Plus, you can make movies of the test runs after learning using the -m option of dqn_atari. Movies can be read in the $SRC/videos/dqnBN_SpaceInvaders_video directory, using vlc.

Going further.

find the right parameters in order to get better results than me.
in the paper, they use an $\epsilon$-greedy algorithm with $\epsilon=0.05$ for testing. Maybe you could do the same.
modify dqn_atari.py to use a decreasing EPSILON during learning. Some hints are given in section 5 of the original paper.
you can also test the performance during learning and not at the end.

And one partial solution is…

Here is a link to how I did it : tp_deepRL.tgz