License : Creative Commons Attribution 4.0 International (CC BY-NC-SA 4.0)
Copyright :
CentraleSupelec
Last modified : April 15, 2024 10:31
Link to the source : index.md
In this last episode of our serie, we will explore ONE algorithm/article of Deep Reinforcement Learning : DQN for Deep Q-Network. I am sure you have already guessed what is the purpose of this Artificial (Deep) Neural Network. If not, well…
WARNING We will run this Labwork on GPU (not cpu) WARNING
As we are on the GPU (yes, I insist), we need to recreate a new virtual environment for python. This time, we will use a different command to take advantage of the fact that pytorch
is already installed on the GPU cluster. So, here are the commands
You need to download the requirements.txt file.
mylogin@mymachine:~$ mkdir oh_what_a_nice_TP
mylogin@mymachine:~$ cd oh_what_a_nice_TP
mylogin@mymachine:~$ python3 -m venv --system-site-packages rlenv
mylogin@mymachine:~$ source ./rlenv/bin/activate
(rlenv) mylogin@mymachine:~$ pip install -r requirements.txt
The new argument is --system-site-packages
that allow the usage of already installed libs and modules (like pytorch
) from the virtual environment. For other explanations, you can get them on previous Labworks.
For this part, you need these 4 files : env_wrapper.py, test_wrapper.py, memory.py, interactive_environment.py.
As shown using the following python script, we will be playing, nearly litteraly, with the Space Invaders Atari
environment from OpenAI gym.
(rlenv) mylogin@mymachine:~$ python test_wrapper.py 0
In addition to displaying the obs/state of the environement, this scripts also gives you the semantics of the different actions.
You can then try to test you skill - and get some knowledge on the environment - by trying to activate actions with your keyboard. Keys Z, Q and D are mapped to actions 1, 2, 3 (FIRE, RIGHT, LEFT).
(rlenv) mylogin@mymachine:~$ python interactive_environement.py
❓ Question (A.0) ❓ what is the observation and action space.
❓ Question (A.1) ❓ Does the game starts at once ? What happen ? Can you maybe modify script test_wrapper.py
, around question 0, to get a more precise idea of what happens ?
The original paper from the Deepmind Team is here : mnih15_dqnAtari_color.pdf. You can take the time to read the whole paper if you like, but it is only necessary to read highlighted passages.
❓ Question (B.1) ❓ From reading the paragraph highlighted in green in the Mnih et al. paper, explain, maybe with a quick drawing, what is the main idea used in this paper.
❓ Question (B.2) ❓ From reading the paragraph highlighted in pink in the Mnih et al. paper, explain how the network input is produced from the observation received from the environment.
It seems that one of the steps is to reduce/crop the image and transform it to greyscale. The OpenAI gym
way of applying such transformations to state/obs is to use a Wrapper
environment around the original env
. An example of such a wrapped env is seen with:
(rlenv) mylogin@mymachine:~$ python test_wrapper.py 1
❓ Question (B.3) ❓ Using class FrameStack
in env_wrapper.py
, build an environment that returns a state made of 2 stacked image of the original environment. Maybe you need to act once if you want to have really two different images. Such an environment is tested with:
(rlenv) mylogin@mymachine:~$ python test_wrapper.py 2
❓ Question (B.4) ❓ And now, produce an environment that stacks 4 grayscale images. Test with:
(rlenv) mylogin@mymachine:~$ python test_wrapper.py 3
❓ Question (B.5) ❓ From reading the paragraph highlighted in blue in the Mnih et al. paper, make a schematic of the various layers of the DQN network. Precise the type, dimension and connectivity of each layer.
The next step is, of course, to implement this network. Please do that in class DQN_bn
of file nn_models.py.
❓ Question (B.6) ❓ After having implemented your DQN, test to see if your architecture seems correct by running:
(rlenv) mylogin@mymachine:~$ python nn_models.py
This test generates uses a valid atari environment as close as possible to the one really used in the original paper. See env_wrapper.py
(or ask questions) for details :o)
Still a few steps, and you will be able to run and test DQN on the Space Invaders environment. Get the dqn_atari.py. We need to talk about the difference between the policy
net and the target
net. And also a bit about the Replay Buffer
.
Let’s write down the pseudo code algorithm we use in learn_net()
and test_net()
methods of dqn_atari.py
.
for i_ep in range( num_episodes ):
# 0. Initialize the environment and state
# Until break (called when episode is done)
# infinite loop
for t in count():
# 1+2. Select and perform an action, using greedy policy
# 3. Store transition in memory (i.e. Replay Buffer)
# 4. Optimize model using mini-batch from Replay-Buffer
# 5.Update the target network, copying all weights and biases in DQN
As highlighted in yellow in the original paper, DQN relies on a Replay-Buffer to do experience replay
. It is linked to steps #3 and #4 in the previous pseudo-code.
In fact, this Replay-Buffer
is hidden in some of the environment wrappers we are using when we call
# create environment
env = gym.make( "ALE/"+args.game, frameskip=1)
env = env_wrapper.make_env( env, ...
in dqn_atari.py
.
❓ Question (C.1) ❓ Can you pinpoint in which part of dqn_atari.py
the replay-buffer should be used ? Using which class ?
In my implementation of DQN, I use 2 neural network. The “classical” net described in the paper is in fact what I call the policy_net
. It is defined in dqn_atari.py
with
# Policy Net: QValue net used to select_action
policy_net = nn_models.DQN_bn(logger,
nb_input_channels=4,
nb_actions=dim_output).to(device)
The role of the target_net
is to bring a bit more stability to the learning which can be unstable. It is used in two places. In the optimize()
method, you can see a:
next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
expected_state_action_values = (next_state_values * GAMMA) + reward_batch
❓ Question (C.2) ❓ To which line of the ‘Algorithm 1’ in the original paper does this target net contribute ?
Besides, in step #5 of the learn_net()
method of dqn_atari.py
, you find these lines:
if i_ep % target_update == 0:
target_net.load_state_dict(policy_net.state_dict())
❓ Question (C.3) ❓ What is the purpose of these lines ? What value of the parameter target_update
will give you a behavior as close as possible to the original paper ?
Before being able to use this target network, we must define it.
❓ Question (C.4) ❓ Complete the dqn_atari.py
script by creating the target net after the policy_net.
# Target Net: QValue net used to compute V(state),
# in computing loss for optimizing model
target_net = ...
It is time to complete our last function, the one that implements the Algorithm 1
of the original paper. Taking inspiration from the test_net()
method, fill steps #1, #2, #3 and #4 of the learn_net()
method of dqn_atari.py
.
❓ Question (C.5) ❓ Get a proper and running algorithm.
Everything is in place, you can run your tests… and chose the correct values for the parameters. Some parameters MUST be set on the command line. Get help with
(rlenv) mylogin@mymachine:~$ python dqn_atari.py --help
Some parameters are set directly in the script, in particular we have :
# hyperparameters
BATCH_SIZE = 32
GAMMA = 0.99
EPSILON = 0.1
LEARN_RATE = 1e-4
TARGET_UPDATE = 10
NUM_EPISODES_TRAIN = args.num_train
NUM_EPISODES_TEST = args.num_test
MEMORY_SIZE = 10000
As a bare minimum, you have to specify num_train and num_test, for example:
(rlenv) mylogin@mymachine:~$ python dqn_atari.py --num_train 1 --num_test 1
You can also use Tensorboard
while/after the script is running. Just run
(rlenv) mylogin@mymachine:~$ tensorboard --logdir runs
and open your browser on localhost:6006
.
Plus, you can make movies of the test runs after learning using the -m
option of dqn_atari
. Movies can be read in the $SRC/videos/dqnBN_SpaceInvaders_video
directory, using vlc
.
dqn_atari.py
to use a decreasing EPSILON during learning. Some hints are given in section 5 of the original paper.Here is a link to how I did it : tp_deepRL.tgz