Super Mario AI

This repository explores multiple artificial intelligence approaches used to train agents to play Super Mario Bros and Super Mario World.

Here's my youtube video where I explain this project

An AI agent trained with the NeuroEvolution of Augmenting Topologies (NEAT) algorithm successfully beating Level 1-1
(Even going as far as learning a frame/pixel perfect glitch called 'wall jump')

Best fitness over generations of the previous agent (The best individual of generation 352 managed to beat the level).

NEAT Implementation
PPO Implementation
DDQN Implementation (WIP)

NEAT implementation

About NEAT

NEAT (NeuroEvolution of Augmenting Topologies) is an evolutionary algorithm used to evolve artificial neural networks. It gradually optimizes both network structure and weights through evolutionary processes. A full description of the algorithm is available in the original paper.
In this project, NEAT is implemented using the NEAT-Python library. The official documentation can be found here.

Here's the topology of the best evolved network from generation 352.

Environment

The Gym Super Mario Bros environment was used to provide the game interface and training environment for the agent. Documentation can be found here.

pip install gym-super-mario-bros

To play manually:

gym_super_mario_bros -e 'SuperMarioBrosRandomStages-v0' -m 'human' --stages '1-4'

Requirements

pip install gym==0.25.1 gym-super-mario-bros==7.4.0 nes-py==8.2.1 neat-python==1.1.0 numpy==1.26.4 opencv-python==4.12.0.88

Defining an input

To select the optimal action, the neural network is fed the current environment observation at each timestep. In this implementation, the agent captures a screenshot of the current frame, converts it to grayscale, normalizes pixel values to the range [0,1], and reduces the image to a 1024-dimensional input vector. In Mario, a single frame is not Markovian because it does not encode velocity. This is why stacking 4 frames together is typically considered a good strategy to approximate movement information. This implementation addresses the issue by implementing a recurrent network, so temporal information is handled through the network’s internal state rather than explicit frame stacking. In this particular case, experiments with frame stacking using a similar number of inputs proved ineffective, further motivating the recurrent approach.

The input image and its magnified version side by side.

Defining a Fitness Function

When working with genetic algorithms, it all really comes down to defining a fitness function that accurately evaluates how well individuals perform a given task.
In particular, if we want to train an AI agent to, let's say, clear level 1-1 of Super Mario Bros, we should analyze what it really means to be a good player in Super Mario.

Here's an example of what a poorly defined fitness function can lead to (in this case jumping was rewarded too much).

In this implementation, the fitness function evaluates Mario’s speed and position along the x-axis. Penalties are applied if Mario gets stuck or dies. In addition, a small reward is given for jumping, to encourage the evolution of individuals who jump obstacles and advance further.

Fitness Function

Let $t = 1, \dots, T$ denote timesteps until termination. The total fitness $F$ of an individual is defined as:

$$F = \max \Bigg( 0.1,\; \sum_{t=1}^{T} \Big( r_t + 0.01 + 0.1 \,(x_t - x_{t-1}) + 0.1 \,\mathbf{}_{y_t > y_{t-1}} \Big) + \max_t(x_t) + R_{\text{term}} \Bigg)$$

where the terminal reward $R_{\text{term}}$ is:

$$R_{\text{term}} = \begin{cases} +10000 & \text{if the flag is reached} \\\ -150 & \text{if the agent dies before reaching the flag} \\\ -100 & \text{if the agent gets stuck} \end{cases}$$

$r_t$ is the environment reward from gym_super_mario_bros
$x_t, y_t$ are Mario's horizontal and vertical positions at timestep $t$
$\mathbf{}{y_t > y_{t-1}}$ is an indicator function of whether Mario is jumping or not
"stuck" is defined as no forward progress for more than 250 consecutive timesteps
such a huge reward for reaching the flagpole is useful to understand in which generation the agent managed to beat the level

Try it Yourself

The code to train the agent can be found in the NEAT folder in this repository, it can be easily modified to change the behaviour of the agent, for example you can penalize slow agents to encourage the evolution of faster specimens that can reach speeds on par with glitchless world records.

PPO implementation

About PPO

An agent trained only on Yoshi Island 1 beating Yoshi Island 2 (look at what he does with the green koopa).

Proximal policy optimization (PPO) is a reinforcement learning algorithm developed by researchers at OpenAI, specifically, it is a policy gradient method. It has proven effective in challenging environments, for example an agent trained with this algorithm (OpenAI Five) was the first AI to beat the world champions in an esports game, and therefore it is perfect for Mario. Unlike off-policy algorithms such as DDQN, PPO does not use a replay buffer. As an on-policy method, PPO continuously updates its policy using data generated by the current policy only.

This time, I implemented it using stable-baselines3 library and following ClarityCoders's tutorial on PPO and adapting it to Super Mario World.

If you want to learn more about this algorithm, the original paper that describes it can be found here.

Note
The trained model zip files are approximately 260 MB and cannot be uploaded to GitHub due to size limits.

Requirements

You can import the requirements from this file.
This time I am using python 3.8.10 and pip 22.3.1

Environment

This time I used OpenAI's Gym Retro environment because it supports custom levels and allows direct access to RAM values, making it possible to inspect memory locations during gameplay.
This time I decided to train the agent on Super Mario World as it offers a more complex action space, in particular I was curious about the possibility of an agent learning trickjumps or glitches (Maybe an Agent could learn Arbitrary code execution given the proper environment).

To do this I had to manually integrate the game in Gym Retro by:

Finding the memory addresses of relevant variables (position, lives, score, etc.)
Implementing a custom reward function
Adding the necessary JSON configuration files

The same agent from before beating Yoshi Island 1.

To implement the agent you will have to replace the files (with the same name) in your game directory with these JSON files.

For example in scenario.json you can change or tweak these parameters, and add others to influence the reward function, in particular you can choose to feed the neural network only part of the screen by modyfing "crop" (In my case the network received the full image as input). I also decided to end each episode (done) after the player died once, even though Mario starts with 5 lives.

{
  "crop": [
    0,
    0,
    0,
    0
  ],
  "done": {
    "variables": {
      "dead": {
        "op": "equal",
        "reference": 0
      }
    }
  },
  "reward": {
    "variables": {
      "mario_x": {
        "reward": 1.0
      },
      "score": {
        "reward": 0.01
      },
      "checkpoint": {
        "op": "delta",
        "measurement": "nonzero",
        "reward": 100.0
      },
      "level_end": {
        "reward": 2000.0
    
      }
    }
  }
}

Me searching for the memory address for holding a shell.

Network Architecture

This PPO implementation uses a Convolutional Neural Network (CNN) policy to process visual observations directly from the game screen.
You can visualize the architecture by printing model.policy:

ActorCriticCnnPolicy(
  (features_extractor): NatureCNN(
    (cnn): Sequential(
      (0): Conv2d(3, 32, kernel_size=(8, 8), stride=(4, 4))
      (1): ReLU()
      (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (3): ReLU()
      (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
      (5): ReLU()
      (6): Flatten(start_dim=1, end_dim=-1)
    )
    (linear): Sequential(
      (0): Linear(in_features=43008, out_features=512, bias=True)
      (1): ReLU()
    )
  )
  (mlp_extractor): MlpExtractor(
    (shared_net): Sequential()
    (policy_net): Sequential()
    (value_net): Sequential()
  )
  (action_net): Linear(in_features=512, out_features=12, bias=True)
  (value_net): Linear(in_features=512, out_features=1, bias=True)
)

Training On Random Levels

An agent trained on random levels beating Yoshi Island 2.

To encourage generalization, the agent can be trained across a list of similar levels rather than a single stage. During training, a level is randomly selected at the start of each episode. This setup prevents overfitting to a specific environment and forces the agent to learn strategies that generalize across all levels, resulting in more robust (successful on a variety of problems without hyperparameter tuning) and versatile behavior.

Here is my implementation of this training setup.

Donut Plains 3

An agent trained on Donut Plains 3 (In gym-retro it is called Donut Plains 4).

Donut Plains 3 is a challenging level for an AI as it includes many moving platforms. After a bit of trial and error the agent manages to advance in the level surprisingly well, sadly in the middle it tries to use the on/off switch but it fails to wait for the moving platform. Nonetheless it's still impressive how an agent trained with PPO manages to adapt to moving platforms in such a short time. it's possible that by giving it enough time and by shaping the reward function to also encourage waiting for the platform, an AI agent could beat this level.

DDQN implementation

$${\color{red}WIP}$$

An agent I made in the past balancing an inverted pendulum (using a Q-table).

About DDQN

paper

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
DDQN/assets/images		DDQN/assets/images
NEAT Neuroevolution of Augmenting Topologies		NEAT Neuroevolution of Augmenting Topologies
PPO Proximal Policy Optimization		PPO Proximal Policy Optimization
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Super Mario AI

Here's my youtube video where I explain this project

NEAT implementation

About NEAT

Environment

To play manually:

Requirements

Defining an input

Defining a Fitness Function

Fitness Function

Try it Yourself

PPO implementation

About PPO

Requirements

Environment

Network Architecture

Training On Random Levels

Donut Plains 3

DDQN implementation

$${\color{red}WIP}$$

About DDQN

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Super Mario AI

Here's my youtube video where I explain this project

NEAT implementation

About NEAT

Environment

To play manually:

Requirements

Defining an input

Defining a Fitness Function

Fitness Function

Try it Yourself

PPO implementation

About PPO

Requirements

Environment

Network Architecture

Training On Random Levels

Donut Plains 3

DDQN implementation

$${\color{red}WIP}$$

About DDQN

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages