Skip to content

michalnand/reinforcement_learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

intrinsic motivation experiments

dependences

pip3 install numpy matplotlib sklearn torch opencv-python Pillow pyglet

for RTX3x GPUS :

pip3 install torch==1.11.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

gym environments install

pip3 install 'gym[atari, accept-rom-license]==0.21'
pip3 install pybullet
pip3 install procgen

random network distillation

main idea two models, one random, second trained to imitate first, difference is motivation (internal reward)

paper Exploration by Random Network Distillation

diagram

my idea : siamese network distillery

main idea target RND model should be learned too, to achieve metrics-like properties and provide stronger motivation signal, without vanihsing during training

diagram

  • projection of RND features sapace into 2D (512 dims to 2 dims using t-sne)
  • colors represent different rooms in Montezuma's Revenge environment
  • the common RND (random orthogonal init), provides isoladed islands, where in-rooms states are tightly grouped

diagram

  • siamese distiller spread features not only across different rooms, but also in-room itself

diagram

  • this agent was able to discover 15rooms (vs 9), and achieves bigger score

graph

  • as contrastive loss, common L2 MSE was used, where close states were labeled with 0, and different states labeled with 1

key points

generic bullet points

  • input into all experiments is 4x96x96
  • state is divided by 256, to provide values in $$ s \in \langle 0, 1) $$
  • for RND, only mean have to be substrated from state, dividing by std causes huge unstability
  • internal motivation is not divided by it's std - it can be, but I don't observed improvement (yet)
  • advantages are not notmalised by mean or std
  • two heads for critic used, ValueInt and ValueExt

hyperparameters

  • two gammas used $$ gamma_{ext} = 0.998 $$ $$ gamma_{int} = 0.99 $$
  • 128 paralel envs
  • 128 PPO steps
  • 4 training epochs
  • entropy regularisation 0.001
  • learning rate is set to 0.0001, but loss for critic is scaled by 0.5
  • MSE loss for RND is regularised, by keeping random 25% elements non zero (for 128 envs, 100% for 32envs), this effecitively slow down RND learning and provides internal motivation in usefull range
  • advantages scaling is $$ ext_adv = 2.0 $$ $$ int_adv = 1.0 $$
  • PPO clip set to 0.1
  • gradient clip set to 0.5

Results

RND baseline SND SND + entropy :

RND model

  • for RND model use ELU activation (ReLU or leaky ReLU doesn't works)
  • for RND model use orthogonal weight init (sqrt(2)), to provide sufficient amplitude output
  • RND motivation is computed as sum of features / 2, not as mean $$reward_{int} = sum(RNDfeatures_{target} - RNDfeatures_{pred}, dim=1)/2 $$

PPO model

  • both, 8x8 or 3x3 kernels on first layer seems to be working
  • after 3..4 conv layers, the aditional FC hidden layer is required for features (512 .. 800units)
  • actor and critic have another hidden layer (512 .. 800units)
  • features backbone weight init is best orthogonal sqrt(2)
  • policy (actor) weight init is best orthogonal 0.01
  • critic weight init is best orthogonal 0.1 for hidden and 0.01 for output

model details

model RND

  • input is single frame, 1x96x96
  • 3 conv layers
  • trained model need to be deeper (3fc layers)
  • ELU activations performs best
  • to avoid "almost zero" output, both model uses orthogonal weight init, sqrt(2), and zero bias

diagram

model PPO

  • input is four frames, 4x96x96 or 12x96x96

  • 3 .. 4 conv layers

  • separated critic heads for internal and external motivation - internal motivation is non-stationary

  • need few fc layers for features head, otherwise agent learns only pure policy

  • ReLU ativations performs best

  • for features, orthogonal init sqrt(2)

  • all models with zero initial bias

model A

diagram

model B

diagram

model C

diagram

model D

diagram diagram diagram

parameters

name model architecture normalise int reward critic hidden init range
ppo_rnd_a0 A false 0.1
ppo_rnd_b0 B false 0.1
ppo_rnd_b1 B false sqrt(2)
ppo_rnd_b2 B true 0.1
ppo_rnd_b3 B true sqrt(2)
ppo_rnd_c0 C false 0.1
ppo_rnd_c1 C false sqrt(2)
ppo_rnd_d0 D false 0.1

montezuma revenge RND results

animation

model A result

ppo_rnd_a0 results

model ppo_rnd_a0

graph

model B result

ppo_rnd_b0 results

model ppo_rnd_b0

graph

ppo_rnd_b1 results

model ppo_rnd_b1

graph

ppo_rnd_b2 results

model ppo_rnd_b2

graph

ppo_rnd_b3 results

model ppo_rnd_b3

graph

model C result

ppo_rnd_c0 results

model ppo_rnd_c0

graph

ppo_rnd_c1 results

model ppo_rnd_c1

graph

model D result

ppo_rnd_d0 results

model ppo_rnd_d0

graph

montezuma revenge SND results

  • agent is able to achieve score 10 000+
  • explore 14..16rooms
  • this score is achieved on 100x less samples than for RND

animation

model C result

ppo_snd_c_0 results

model ppo_snd_c_0

graph

ppo_snd_c_3 results

model ppo_snd_c_3

graph

ppo_snd_c_4 results

model ppo_snd_c_4

graph

gravitar

animation

graph

graph

venture

animation

graph

graph

pitfall

animation

graph

graph

breakout - WITHOUT external rewards

animation

graph

pacman - WITHOUT external rewards

animation

graph

dependences

cmake python3 python3-pip

basic python libs pip3 install numpy matplotlib torch torchviz pillow opencv-python networkx

environments pip3 install gym pybullet pybulletgym 'gym[atari]' 'gym[box2d]' gym-super-mario-bros gym_2048

my RLAgents github : RLAgents

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages