This project is meant to provide a framework for experimenting with the Reinforcement Learning concepts from [1], using the Swift programming language. It is primarily meant to help me build an understanding of the general reinforcement learning problem, and some of the approaches to solving it.
A Markov Decision Process (MDP) is a way of modeling a transition system with the Markov property. Simply put, the Markov property states that it is possible to determine the next state of a system given the current state of the system. In other words, it does not matter what events occurred up until the current state; everything that is important about the system is captured in the current state. Systems that exibit the Markov property can be analyzed using techniques like policy iteration and value iteration.
When presented with an MDP we have all the information necessary to determine an optimal policy, which can then be used to determine the optimal value function.
An MDP consists of the following:
- S - the set of states that the system can be in.
- A(s) - the set of actions that can be taken from a given state.
- T(s, a) -> (s', r) - the transition function, which, given a state and an action, transitions to the next state, and yields a reward.
It is worth noting that the transition function T can be non-deterministic. When an action is selected for a given state, one of several (or many!) possible transitions may actually occur, according to a distribution of transitions.
The notion of a state is quite generic in this framework. You can define any type that conforms to the Hashable
protocol. Your choice of data representation is very important, since the state representation is what determines whether your transition system will have the Markov property. Strive to provide as compact a representation as possible. Also, states should use value semantics.
An action represents a choice made by the agent. Actions are best represented with Swift enum
types. Not only will this keep all of your actions together for a particular MDP, you can also use features like associated data, to allow an agent to pass data in an action (e.g. placing a bet in a game of chance.) Like states, actions must conform to the Hashable
protocol.
Rewards are numerical amounts which are given each time a transition occurs. In this framework they are of type Double
, and can be positive, negative, or zero. No special care is taken to prevent infinite rewards, but this should be avoided, since this will destroy any hopes of convergence.
A value function is a mapping, either from state to rewards, or state-action pairs to rewards. Value functions are defined for a given policy, because it is necessary to know what actions will be taken, either from the given state, or after taking the given action from the given state, depending on whether we have a state-value function or an action-value function.
The Transition
type pairs a state with a reward. It is important to realize that a transition represents an edge label in the graph representation of a transition system. As mentioned before, the transition function in an MDP is actually stochastic. This can be modelled by defining a Distribution
of Transition
objects for each action that can be selected in a state.
While the Markov Decision Process model provides all of the information necessary to compute an optimal policy, this is not usually available to an agent. Instead, the agent is typically confronted with an environment, where it must learn from experience what "rewards" await it (remember, a reward can be negative as well as positive.) In this framework, the agent-environment interface is provided by the Environment
protocol. If you already have an MDP, then you can easily hide it behind an Environment
interface with the MdpEnvironment
class. All that is required besides the actual MDP is the initial state to place the environment in.
The framework defines the following protocols in order to lay out the Reinforcement Learning problem in general terms. These protocols employ associated types so that they can be used to model different types of worlds.
-
Distribution
A basic interface for a probability distribution that lets you define the event type. Useful for modeling problems with random variables. Used to provide fuzzy transitions. -
MarkovDecisionProcess
A stochastic process that is modeled as a transition machine, which is composed of a set of states that are connected by actions. The actions are fuzzy, meaning that an agent can choose to perform an action, but the model may yield a transition drawn from a distribution. A transition is composed of a state and a reward. A transition causes the machine to enter the state and yield the reward, which is collected each time that transition is followed. This model provides more information than an agent would normally receive. -
Environment
In a true Reinforcement Learning setting the agent does not have insight into the world that is provided by a Markov Decision Process. The Environment protocol presents a narrower interface to agents, one that provides a new state and reward when an agent chooses an action from the current state. Think of sitting down in front of the game Zork for the first time. In that setting you were the agent, and the command line was the environment. -
Policy
A policy represents a plan or strategy that an agent will follow...when in this state, perform this action...in order to maximize long-term gains. Policies are probabilistic. Rather than give a single action, they return the probability of taking an action from the given state. -
StepFunction
Function that controls the step size during Q-Learning.
-
WeightedDistribution
Represents a simple distribution of weighted events. The weights, which are of typeDouble
, must sum to1.0
. -
BinDistribution
Represents a distribution of events that all have an equal chance of occurring (think of a bin full of balls.) Very convenient, and very space-inefficient, since even identical events are represented by separate values. -
DistributionEstimator
Represents a sample-based estimate of a distribution by counting events as they occur.
-
RandomSelectPolicy
A policy that randomly selects from all available actions, with equal probability. -
StochasticPolicy
A policy that randomly selects from one of a distribution of equally probable actions. This differs from theRandomSelectPolicy
in that the distributions are configurable, and it does not require aMarkovDecisionProcess
to provide the available actions. This is used by thePolicyImprover
to track more than one "best" action for any given situation, which allows for more exploration of the state space when using the policy. -
EpsilonGreedyPolicy
A policy that chooses the best possible action for a given state most of the time, but occassionally (with probabilityepsilon
) selects a random action. This is used to tune the exploit/explore qualities of a reinforcement learning algorithm. Ifepsilon
is0.0
then this will act as a greedy policy.
ConstantStepFunction
A step function that maintains a constant step size over all episodes.
-
TableDrivenMDP
Provides a tabular implementation of theMarkovDecisionProcess
protocol. Tabular representations provide a common means of defining MDPs, but they require space for all of the possible states, along with all of the transitions between those states. Great for smaller problems. -
GridWorld
A tabular implementation of the Grid World MDP from Sutton and Barto. This implementation sets up the grid with walls around the perimeter, each of which has a reward value of -1.0. The grid can then be customized by adding two types of features. The first feature is a nexus, which is a grid square that, once entered, leads to a target grid square no matter which direction is chosen by the agent. The reward for this can be specified. The second feature is a vortex, which is a grid square from which there is no escape. The reward can also be specified for this feature. If you want to create a GridWorld with an end goal, you can do so by adding a vortex with a positive reward.
-
PolicyEvaluator
This class implements policy evaluation in order to determine the value function for a given policy. -
PolicyIterator
This class implements policy iteration in order to improve a given policy. It also implements generalized policy iteration or GPI in order to find an optimal policy within a given tolerance. -
ValueIterator
This class implements value iteration in order to find an optimal policy within a given tolerance.
-
QLearner
Implements q learning, which is an off-policy reinforcement learning technique. Supports a "backwards" learning mode that updates the estimates at the end of the episode by capturing all transitions and then using them in reverse. This allows all updates to be based on estimates from the current episode. This can improve convergence [2]. -
DecayingStepFunction
A step function that decays as more values are seen for each state/action pair. Can be configured to cap the decay at a minimum step size.
-
getStateValue
A state value function (i.e. V_pi(s)) used for policy evaluation. -
getActionValue
An action value function (i.e. Q_pi(s, a)) used for policy improvement.
[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: an introduction, 2nd ed. Cambridge, MA: The MIT Press, 2018
[2] T. Mitchell, Machine Learning McGraw-Hill, 1997