This whole note is under review. The following text is only a draft.
Reinforcement Learning (RL) is the study of agents that learn how to achieve some goal by maximizing rewards in some environment. It's inspired by the punishment and pleasure mechanisms present in animals, including humans. A living being, using its sensors to get signals from the environment, learn from experience the actions that should be done and the ones that should be avoided, all without a detailed prior model of how the environment works. RL builds upon this idea to create intelligent algorithms.
Reinforcement Learning is considered a third paradigm of Machine Learning, besides supervised and unsupervised learning. It's not supervised because the agent doesn't have access to a previous labeled dataset where it can learn from. It's also not unsupervised because it's not purely trying to find patterns in unlabeled data.
When applying RL, it's important to model the collection of rewards by the agent as the time goes by. A simple mathematical tool used to model time passage is the Markov process. In the context of RL, an extension of the Markov process called Markov Decision Process (MDP) is often used. MDPs are Markov processes with a reward attached to each state transition. Each transition is achieved by some agent's action. These transitions may be subject to a probability distribution, which means the same action at the same state may lead the agent to different states.
More formally, a MDP is a tuple M = (S, A, T, R, γ) where:
- S is the finite set of states of the environment
- A is the finite set of actions
- T: S x A x S → [0, 1] is the state transition function that gives the probability of ending up in state s' from state s after performing action a
- R: S x A x S → ℝ is the reward function that gives the reward obtained by the agent if it is in state s, takes action a and ends up in state s'
- 0 < γ < 1 is the discount factor (see discussion below)
MDPs can represent continuing and episodic tasks. A continuing task is one that is supposed to run forever. As an example, imagine a vacuum robot that is always cleaning and recharging itself. Conversely, episodic tasks are tasks that have a well defined ending (such as a round of a game of chess). Episodic tasks have terminal states representing the end of an episode (e.g. the moment a player loses or win the round). We can treat episodic tasks in the same way as continuing tasks if we consider that each terminal state only transitions to itself and generates a reward of zero. In this fashion, terminal states can also be called absorbing states.
In the context of MDPs, a policy π is a function that returns an action given a state (i.e. π: S → A). A policy models the behavior of an agent. From a policy, we can sample a walk in the MDP. This can be represented as a sequence of state-actions-reward triplets. In mathematical notation:
Notice the states si in the sequence can be the same since a MDP can contain loops.
The return of a sample is how good it is in terms of rewards generated. More specifically, we are interested in knowing how good a sample has done from a specific sate. This metric is useful to help the agent pick its next move. Not only we want our agent to get a high reward when it transitions, but we also want it to move to a state that has the potential of generating more good rewards. Focusing solely on the next best reward would make the agent take actions that are initially good but would lead it to a very bad state. It would also make the agent loss the opportunity of taking low reward states that would ultimately take it to a "pot of gold" in the end.
Although we could define the return of a state as the sum of the rewards obtained after that state, this would bring problems with continuing tasks that never reach a terminal state, or even with episodic tasks that contain deadlocks. A simple sum of rewards would amount the return of a state to infinite and no comparisons would be possible. A better definition of return is through the use of a discount factor γ that makes future rewards less important and converges to a single number even in an infinite sequence. Specifically, a return Gt of the state t in a sequence can be define as:
Notice the γ factor determines how much weight is given to future rewards. If γ is close to 0, the agent will act "myopic". As γ approaches 1, the agent gets more far-sighted. Nevertheless, the way γ is applied to future rewards forms a geometric series, and this makes the return converge to a fixed number.
Because of the stochasticity of the MDP, we are not really interested in the return, which is specific to a single walk, but in the expectation of returns. We call that expectation as the value of a state. The value function V is defined as follows:
Where π is the policy being followed by the agent. In its last recursive form, V is called a Bellman equation.
From the V formula, we can devise a policy π*(s) that always pick the action that generates the highest value. This π* defines what we call an optimal policy.
The optimal policy π* is initially unknown and of great interest. Having the optimal policy means we can make our agent use the best strategy possible in a given MDP. There are some algorithms to find π*, most notable the value iteration and the policy iteration algorithms. Value iteration iterates over initially random values attached to each state, computing a new value for each state at every step. This method has been proven to converge to the true state values. Once we have them, it's trivial to find the optimal policy: for each state, the agent should take the action that generates the best value.
Policy iteration, in its turn, starts with a random policy and then calculate V for each state according to that policy (the policy evaluation step). In another step, the action that generates the best value in each state is compared to the one prescribed by the current policy. If a different action is better than the one in the policy, the policy is updated to prescribe this new action instead (the policy improvement step). The method runs until convergence. Policy iteration exploits the fact that we don't need to know exactly the true state values in order to find the optimal policy.
As we saw in the last section, there are solutions for the problem of finding an optimal policy when all the components of a MDP are known. But what if we don't know the transition probabilities? This is a more realistic scenario, as it would be impossible or infeasible to model the T function beforehand.
In such an environment, we can apply model-free methods that are akin to policy evaluation. We try some policy, evaluate it and improve it if other actions turn out to be better.
As now we don't know T, we can't evaluate a policy using V, so first we must find a new evaluation method. For episodic tasks, the simplest idea is doing Monte-Carlo (MC) policy evaluation, which is basically simulating many episodes and averaging out the results. An overview of the process goes like this:
- Select a policy to be evaluated.
- Simulate an agent that follows that policy until a terminal state is reached. In other words, sample a walk using that policy.
- The transition to the terminal state will generate a reward. This final reward can be propagated back for the calculation of the return of all the states found in the way using Gt.
- Repeat steps 2 and 3 many times.
- For each state, consider the final estimated value the mean of the returns calculated for that state over all the episodes.
The last step involves calculating a mean. Mean calculation can be done incrementally, so an incremental algorithm can be devised. Given the sequence x1, x2, ..., the sequence of means μ1, μ2, ... can be computed as following:
The α term can be viewed as a parameter for the calculation of a different kind of mean (for example, α can be a constant instead of 1/k). This type of iterative mean calculation is widespread in AI models and can be interpreted as the correction of an estimative when predicting μk-1 but getting xk instead. In this view, α acts as a learning rate and (xk - μk-1) as an error term.
Using the same logic, a Monte-Carlo update that calculates the new value for a state given the result of an episode can be written as V(st) ← V(st) + α[Gt - V(st)]. This procedure converges to the true state values under a given policy.
MC policy evaluation always considers all the subsequent states' returns of a given state. Instead, Temporal-Difference (TD) policy evaluation builds upon the idea of using only the current estimated values of some of the future states, a concept known as bootstrapping.
The simplest TD algorithm is called TD(0). It only considers the value of the next state. For example, if an agent performs some action in a state st and transitions to a state st+1, only V(st+1) will be used to update the value of V(st). The update is given by V(st) ← V(st) + α[rt+1 + γV(st+1) - V(st)]. Compared to the Monte-Carlo update, we can see that Gt is switched for rt+1 + γV(st+1).
Why only consider the next state instead of wait the episode to finish and get the more precise value as it's done in Monte-Carlo learning? A first advantage of TD(0) is that it works for continuing tasks. TD(0) update allows us to calculate a new value right after the state transition. We don't have to wait for the end of some episode anymore.
But an even more compelling advantage of TD(0) over Monte-Carlo is that it usually converges faster. This is a surprising fact given that Monte-Carlo always uses the true outcomes of a MDP, while TD(0) uses biased values by not looking at the end of the road.
This result is a consequence of TD(0)'s implicit consideration of the MDP structure and its Markov properties when updating the values. For example, imagine a simple MDP with two states A and B. Three episodes in this MDP generate the following sequences of states and rewards: (B, 1), (B, 1) and (A, 0, B, 0). In order to calculate V(A), Monte-Carlo learning will check the final reward of all episodes that passed by A, propagate the reward back to A using the return function and average the results. In this case, there is only one episode with A. As this episode generates only one reward of 0, V(A) becomes 0.
In its turn, TD(0) will calculate V(A) based only on V(B). Because B generated more positive values in the previous episodes, this means it will have a value higher than 0 and this will be reflected in the calculation of V(A), that will also be more than 0. TD(0) is implicitly accounting for the fact that if an agent can reach B from A, B has to have a major influence on A's value, even if the experience so far with A tells another story. This connection between A and B is ignored by Monte-Carlo learning.
TD(0) and Monte-Carlo can be combined to form a more general policy evaluation algorithm called TD(λ). As its name implies, it makes use of a λ parameter (0 ≤ λ ≤ 1) to control how much the learning should consider the future states of the episode. If λ = 0, TD(λ) behaves the same way as TD(0). If λ = 1, TD(λ) becomes MC evaluation.
Again, why bother mixing TD(0) and MC approaches if the former is considered better than the latter? It turns out that intermediate values of λ makes learning even more efficient than TD(0) for many MDPs in practice.
The presented TD methods—which includes Monte-Carlo thanks to TD(λ)—are only good to evaluate a given policy. The greater question of finding the optimal policy still remains. This is known in literature as the control problem and, as mentioned before, can be solved with methods based on the same principles of policy iteration.
Remember that policy iteration is composed of two steps: policy evaluation and policy improvement. Using TD, we can perform the policy evaluation step. What about the policy improvement step? The first idea is to do that in the same greedy way as in the original policy iteration. After the evaluation gives us the state values, if we find an action in some state that does better than the current action prescribed by the policy, we change the policy to take that action in that state. We repeat these steps until convergence.
But notice that in order to calculate how good an action a does in a state s as in the original policy iteration algorithm, we need to sum all the possible rewards multiplied by the probability of transitioning to each possible new state. This means we need to know T, which we don't have. To circumvent this issue, we can use a variant of the V function called Q that takes a state and an action as inputs and returns the expected return under some policy. Using mathematical notation, Q is defined as follows:
Since we can write V in terms of Q:
We can build the following Bellman equation for Q:
An useful property of Q is that the action a that maximizes V(s) is the same one that returns the maximum Q(s,a). So to do policy improvement, we can look for the actions that generate the best value of Q. No multiplication of rewards and transition probabilities is needed anymore. The Q function caches all this information.
Unfortunately, the introduction of Q makes our greedy approach not work for many policies. As Q is action dependent, the policy improvement step will stick to the action that happened to generate the best Q value in the first iteration and never try other potentially better actions. Solutions to this problem are divided in on-policy and off-policy methods.
On-policy methods are the ones that improve a stochastic policy that always have some positive probability of picking any possible action in any state. This would avoid the policy getting stuck with some action that is not optimal.
The simplest application of this principle is to use a policy that always have a probability ε of selecting a totally random action. We call this the ε-greedy policy.
With an ε-greedy policy, we can apply policy iteration using Q in the same way we use V. In particular, if we use TD(0) as the policy evaluation method, we have an update rule in the form of Q(st, at) ← Q(st, at) + α[rt+1 + γQ(st+1, at+1) - Q(st, at)]. The policy iteration algorithm that uses this rule is called SARSA, because of the sequence State-Action-Reward-State-Action used in the update.
In the limit, this scheme will converge to a policy that is very close to the optimal one. It will never reach the optimal policy, though, because we are making the compromise of keeping the policy stochastic. The resulting policy exploits the environment by taking optimal actions a portion of the time, but it also dedicates some time to take non-optimal actions in order to explore the environment and hopefully find better actions. This exploration-exploitation tension is a pervasive theme in RL, and the perfect balance is still an open problem.
A more sophisticated model-free approach to find an optimal policy is to use off-policy methods. Off-policy methods use two policies, a behavior policy and a target policy. The behavior policy is the one that is actually used by the agent in the environment. The target policy is the policy used to update the Q values during policy iteration.
In particular, if we set the behavior policy as the ε-greedy policy and the target policy as the full greedy policy (the one which always picks the actions with the best Q value), we have an update rule in the form of Q(st, at) ← Q(st, at) + α[rt+1 + γ maxaQ(st+1, a) - Q(st, at)]. This is known as Q-learning.
An important implementation detail of algorithms such as SARSA and Q-learning is that they demand a separate variable for each state-action pair to hold their values of Q.Therefore, in order to use them, we need some sort of table, where we can query the value of a particular pair. For this reason, those algorithms are said to be in a tabular format.
The need of a table is problematic when the MDP has a high (or even infinite) number of states or actions, as every computer has its memory limits. A workaround is to use a function approximator instead of the real table. So the tabular Q(s,a) is replaced by a function capable of returning a good estimate for every state-action pair, even the ones that have never been experienced by the agent.
Currently, the most appealing function approximator is the neural network. It has been proven that neural networks, under the right conditions, can serve as universal approximators. When using a neural network, we can adapt the update rules of SARSA and Q-learning to update the network weights. Unfortunately, these new versions bring their own set of challenges. Most notably, once a non-linear approximator such as a neural network is used, there are no more guarantees of convergence.
One of the state-of-the-art algorithms in RL is DQN, a variant of Q-learning that uses neural networks as approximators. In order to circumvent the divergence that may arise, DQN employs two techniques. First, each update is made using a random subset of past transitions, not the immediate transition the agent has just experienced. Second, the updates are applied to a cloned version of the original neural network, while the original network keeps their weights fixed. After a number of transitions, the original network copy the weights of its clone. Both techniques have been show to help in stabilizing the training. Using DQN, researchers were able to train an agent to play Atari games with impressive results.
So far, we have seen the theory around MDPs where there is only one agent in the environment. Since the introduction of other agents makes it possible to represent more realistic scenarios, it is of research interest to extend that framework to environments with multiple agents.
In the single-agent case, the goal of MDP-solving methods is to find an optimal policy, which is defined by the one that generates the greatest expected return. In the multi-agent case, the definition of the optimal policy is less straightforward. Since each agent has the potential to change not only its own return but the other agents’ returns, simply stating that each agent should seek the best expected return would not suffice. A good return for an agent could result in a poor return for another.
Many of the problems involving multiple agents have been studied by Game Theory (GT), a field with roots in Economics. From that field, we borrow the concept of equilibrium. The objects of study of GT are called games. In a game, an equilibrium is reached when no agent (or player in GT jargon) can obtain a better expected return by unilaterally changing its policy (or strategy). Here, the focus is on games where there are only 2 players.
Finding equilibria is appealing because, once we reached it, we discovered a set of policies that have a minimum guaranteed expected return for the player who follows it, no matter what the other player does and even if the player publicly announces his strategy beforehand. This also means that if the other player does not go along and acts as to break the equilibrium, he can only decrease its own return.
Equilibrium is also a stable concept that can be defined as the goal of solving the multi-agent problem. This avoids a more convoluted “guessing game” where each player tries to predict what the other is thinking about. Hence, we define the goal of solving a MDP with multiple agents as finding policies that make the agents behave in an equilibrium.
There can be multiple equilibria in the same game. One particular equilibrium is achieved by both players assuming the other one will always act to generate the worst expected return to its opponent, no matter what strategy one player chooses. The equilibrium is reached when both players settle in a strategy that maximizes the minimum expected return offered by the opponent. This is called the minimax equilibrium.
Another kind of equilibrium is the correlated equilibrium. It assumes the actions of each player are suggested by a trusted third-party that samples two action (one for each player) from a public probability distribution. Importantly, the probability of a certain action being suggested to a player may not be independent of the action that is being suggested to the other. Moreover, one player is only aware of the probability distribution of the action pairs, but not the actual suggestion made to the other player.
It has been demonstrated that, for some probability distributions, players who follow the suggestions made by the third-party are in equilibrium, i.e. one player would not obtain a better expected return by deviating from his suggestion if the other is following theirs. These probability distributions are the ones that fit in rationality constraints (along with the regular constraints for probabilities). These constraints make sure that the expected return obtained by one player following its suggestions is at least as great as if he deviates (while the other player obeys to theirs).
There can be multiple probability distributions that satisfy those constraints. We can fix an objective function and solve a linear programing problem to obtain a specific distribution. Greenwald and Hall propose the utilitarian correlated equilibrium (uCE),where the objective function is the maximum sum of players’ expected returns. This equilibrium can be interpreted as the one that yields the best “social welfare” in the game.