Skip to content

jason-huang03/mappo-warmup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MAPPO Project Report

Result

In this project, I modify the interface of overcooked environment and implement the runner for overcooked environment in order to use the mappo library. I trained the agent on five scenarios, all with $400$ time steps, with shared policy network for agents. I record two curves: the first one is average number of successes among rollouts during training, the other is the number of successes in evaluation.

Image 1 Image 2
Image 3 Image 4
Image 5 Image 6
Image 7 Image 8
Image 9 Image 10

As we can see, in all scenarios, the agents can achieve $200$ rewards during evaluation, which means successfully sending $10$ soups.

Details

I modify the interface of overcooked_ai_py.mdp.overcooked_env.Overcooked to make it suitable for the mappo library. I implemented the OvercookedRunner class and write the training script. To reproduce the result, first install dependencies as described in the overcooked and mappo repo. Then, run scripts inside ./on-policy/onpolicy/scripts/train_overcooked_scripts/

PPO Algorithm

I have written a blog about PPO during the course Introduction to AI instructed by Prof Yi Wu.

Key Hyperparameters

  • ppo_epoch: number of epoches in one round of PPO optimization.
  • clip_param: $\epsilon$ inside $\mathrm{clip}(r_{t}(\theta), 1-\epsilon, 1+\epsilon)$, which prevents large policy update.
  • num_mini_batch: number of mini batches for PPO.
  • entropy_coef: weight for entropy bonus in the loss function that encourages exploration.
  • gamma: ordinary discount factor for reward in the markov decision problem.
  • value_loss_coef: weight for value function error in the loss function.
  • gae_lambda: $\lambda$ in generalized advantage estimation, which is used to estimate the advantage function. $\lambda$ is a trade-off between using value function and using monte carlo estimation.
  • huber_delta: the parameter $\delta$ in huber loss, which makes training less sensitive to outliers in data than squared error loss.

Problem and Discussion

I found that it is extremely inefficient for agent to learn from scratch, as is the case in this project. It may take the agent a long time to discover actions that can yield great rewards. Once the agent found such actions, the learning becomes fast. So I think it may be better to pretrain agents on human behaviour and then do PPO training.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published