Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: theory page #76

Merged
merged 5 commits into from
Aug 4, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/conf.py.in
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ extensions = [
"sphinx.ext.viewcode",
]

# Math setting
extensions += ["sphinx.ext.mathjax"]

# Code style
pygments_style = "monokai"

Expand Down
163 changes: 163 additions & 0 deletions docs/discussion/theory.rst
Original file line number Diff line number Diff line change
@@ -1,2 +1,165 @@
Ecole Theoretical Model
=======================

The Ecole API and classes directly relate to the different components of
an episodic `partially-observable Markov decision process <https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process>`_
(PO-MDP).

Markov decision process
-----------------------
Consider a regular Markov decision process
:math:`(\mathcal{S}, \mathcal{A}, p_{init}, p_{trans}, R)`, whose components are

* a state space :math:`\mathcal{S}`
* an action space :math:`\mathcal{A}`
* an initial state distribution :math:`p_{init}: \mathcal{S} \to \mathbb{R}_{\geq 0}`
* a state transition distribution
:math:`p_{trans}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}`
* a reward function :math:`R: \mathcal{S} \to \mathbb{R}`.

.. note::

The choice of having deterministic rewards :math:`r_t = R(s_t)` is
arbitrary here, in order to best fit the Ecole API. Note that it is
not a restrictive choice though, as any MDP with stochastic rewards
:math:`r_t \sim p_{reward}(r_t|s_{t-1},a_{t-1},s_{t})`
can be converted into an equivalent MDP with deterministic ones,
by considering the reward as part of the state.

Together with an action policy

.. math::

\pi: \mathcal{A} \times \mathcal{S} \to \mathbb{R}_{\geq 0}

an MDP can be unrolled to produce state-action trajectories

.. math::

\tau=(s_0,a_0,s_1,\dots)

that obey the following joint distribution

.. math::

\tau \sim \underbrace{p_{init}(s_0)}_{\text{initial state}}
\prod_{t=0}^\infty \underbrace{\pi(a_t | s_t)}_{\text{next action}}
\underbrace{p_{trans}(s_{t+1} | a_t, s_t)}_{\text{next state}}
\text{.}

MDP control problem
^^^^^^^^^^^^^^^^^^^
We define the MDP control problem as that of finding a policy
:math:`\pi^\star` which is optimal with respect to the expected total
reward,

.. math::
:label: mdp_control

\pi^\star = \underset{\pi}{\operatorname{arg\,max}}
\lim_{T \to \infty} \mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right]
\text{,}

where :math:`r_t := R(s_t)`.

.. note::

In the general case this quantity may not be bounded, for example for MDPs
that correspond to continuing tasks. In Ecole we garantee that all
environments correspond to **episodic** tasks, that is, each episode is
garanteed to start from an initial state :math:`s_0`, and end in a
terminal state :math:`s_{final}`. For convenience this terminal state can
be considered as absorbing, i.e.,
:math:`p_{dyn}(s_{t+1}|a_t,s_t=s_{final}) := \delta_{s_{final}}(s_{t+1})`,
and associated to a null reward, :math:`R(s_{final}) := 0`, so that all
future states encountered after :math:`s_{final}` can be safely ignored in
the MDP control problem.

Partially-observable Markov decision process
--------------------------------------------
In the PO-MDP setting, complete information about the current MDP state
is not necessarily available to the decision-maker. Instead,
at each step only a partial observation :math:`o \in \Omega`
is made available, which can be seen as the result of applying an observation
function :math:`O: \mathcal{S} \to \Omega` to the current state. As a result,
PO-MDP trajectories take the form

.. math::

\tau=(o_0,r_0,a_0,o_1\dots)
\text{,}

where :math:`o_t:= O(s_t)` and :math:`r_t:=R(s_t)` are respectively the
observation and the reward collected at time step :math:`t`. Due to the
non-Markovian nature of those trajectories, that is,

.. math::

o_{t+1},r_{t+1} \mathop{\rlap{\perp}\mkern2mu{\not\perp}} o_0,r_0,a_0,\dots,o_{t-1},r_{t-1},a_{t-1} \mid o_t,r_t,a_t
\text{,}

the decision-maker must take into account the whole history of past
observations, rewards and actions, in order to decide on an optimal action
at current time step :math:`t`. The PO-MDP policy then takes the form

.. math::

\pi:\mathcal{A} \times \mathcal{H} \to \mathbb{R}_{\geq 0}
\text{,}

where :math:`h_t:=(o_0,r_0,a_0,\dots,o_t,r_t)\in\mathcal{H}` represents the
PO-MDP history at time step :math:`t`, so that :math:`a_t \sim \pi(a_t|h_t)`.

PO-MDP control problem
^^^^^^^^^^^^^^^^^^^^^^
The PO-MDP control problem can then be written identically to the MDP one,

.. math::
:label: pomdp_control

\pi^\star = \underset{\pi}{\operatorname{arg\,max}} \lim_{T \to \infty}
\mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right]
\text{.}

Ecole as PO-MDP components
--------------------------

The following Ecole components can be directly translated into PO-MDP
components from the above formulation:

* :py:class:`~ecole.typing.RewardFunction` <=> :math:`R`
* :py:class:`~ecole.typing.ObservationFunction` <=> :math:`O`
* :py:meth:`~ecole.typing.Dynamics.reset_dynamics` <=> :math:`p_{init}(s_0)`
* :py:meth:`~ecole.typing.Dynamics.step_dynamics` <=> :math:`p_{trans}(s_{t+1}|s_t,a_t)`

The :py:class:`~ecole.environment.EnvironmentComposer` class wraps all of
those components together to form the PO-MDP. Its API can be interpreted as
follows:

* :py:meth:`~ecole.environment.EnvironmentComposer.reset` <=>
:math:`s_0 \sim p_{init}(s_0), r_0=R(s_0), o_0=O(s_0)`
* :py:meth:`~ecole.environment.EnvironmentComposer.step` <=>
:math:`s_{t+1} \sim p_{trans}(s_{t+1}|a_s,s_t), r_t=R(s_t), o_t=O(s_t)`
* ``done == True`` <=> the PO-MDP will now enter the terminal state,
:math:`s_{t+1}==s_{final}`. As such, the current episode ends now.

The state space :math:`\mathcal{S}` can be considered to be the whole computer
memory occupied by the environment, which includes the state of the underlying
SCIP solver instance. The action space :math:`\mathcal{A}` is specific to each
environment.

.. note::
We allow the environment to specify a set of valid actions at each time
step :math:`t`. The ``action_set`` value returned by
:py:meth:`~ecole.environment.EnvironmentComposer.reset` and
:py:meth:`~ecole.environment.EnvironmentComposer.step` serves this purpose,
and can be left to ``None`` when the action set is implicit.


.. note::

As can be seen from :eq:`pomdp_control`, the initial reward :math:`r_0`
returned by :py:meth:`~ecole.environment.EnvironmentComposer.reset`
does not affect the control problem. In Ecole we
nevertheless chose to preserve this initial reward, in order to obtain
meaningfull cumulated episode rewards (e.g., total running time).
49 changes: 49 additions & 0 deletions docs/static/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,52 @@
.highlight .k {
color: #77D1F6 !important;
}

/* CSS to fix Mathjax equation numbers displaying above.
*
* Credit to @hagenw https://github.com/readthedocs/sphinx_rtd_theme/pull/383
*/
div.math {
position: relative;
padding-right: 2.5em;
}
.eqno {
height: 100%;
position: absolute;
right: 0;
padding-left: 5px;
padding-bottom: 5px;
/* Fix for mouse over in Firefox */
padding-right: 1px;
}
.eqno:before {
/* Force vertical alignment of number */
display: inline-block;
height: 100%;
vertical-align: middle;
content: "";
}
.eqno .headerlink {
display: none;
visibility: hidden;
font-size: 14px;
padding-left: .3em;
}
.eqno:hover .headerlink {
display: inline-block;
visibility: hidden;
margin-right: -1.05em;
}
.eqno .headerlink:after {
visibility: visible;
content: "\f0c1";
font-family: FontAwesome;
display: inline-block;
margin-left: -.9em;
}
/* Make responsive */
.MathJax_Display {
max-width: 100%;
overflow-x: auto;
overflow-y: hidden;
}