Skip to content

Commit

Permalink
Switched back to Mathjax, replaced the argmax and nindep commands
Browse files Browse the repository at this point in the history
  • Loading branch information
gasse committed Aug 3, 2020
1 parent 586e85b commit c50d70d
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 24 deletions.
14 changes: 0 additions & 14 deletions docs/conf.py.in
Original file line number Diff line number Diff line change
Expand Up @@ -53,20 +53,6 @@ napoleon_google_docstring = False
napoleon_numpy_docstring = True


# LaTex configuration (for math)
extensions += ["sphinx.ext.imgmath"]
imgmath_image_format = "svg"
imgmath_latex_preamble = r'''
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\newcommand\indep{\protect\mathpalette{\protect\independenT}{\perp}}
\def\independenT#1#2{\mathop{\rlap{$#1#2$}\mkern2mu{#1#2}}}
\newcommand\nindep{\protect\mathpalette{\protect\nindependenT}{\perp}}
\def\nindependenT#1#2{\mathop{\rlap{$#1#2$}\mkern2mu{\not#1#2}}}
\newcommand{\overbar}[1]{\mkern 1.5mu\overline{\mkern-1.5mu#1\mkern-1.5mu}\mkern 1.5mu}
'''


# Preprocess docstring to remove "core" from type name
def preprocess_signature(app, what, name, obj, options, signature, return_annotation):
if signature is not None:
Expand Down
20 changes: 10 additions & 10 deletions docs/discussion/theory.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Ecole Theoretical Model
=======================

The ECOLE API and classes directly relate to the different components of
The Ecole API and classes directly relate to the different components of
an episodic `partially-observable Markov decision process <https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process>`_
(PO-MDP).

Expand All @@ -20,7 +20,7 @@ Consider a regular Markov decision process
.. note::

The choice of having deterministic rewards :math:`r_t = R(s_t)` is
arbitrary here, in order to best fit the ECOLE API. Note that it is
arbitrary here, in order to best fit the Ecole API. Note that it is
not a restrictive choice though, as any MDP with stochastic rewards
:math:`r_t \sim p_{reward}(r_t|s_{t-1},a_{t-1},s_{t})`
can be converted into an equivalent MDP with deterministic ones,
Expand Down Expand Up @@ -56,16 +56,16 @@ reward,
.. math::
:label: mdp_control
\pi^\star = \argmax_{\pi} \lim_{T \to \infty}
\mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right]
\pi^\star = \underset{\pi}{\operatorname{arg\,max}}
\lim_{T \to \infty} \mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right]
\text{,}
where :math:`r_t := R(s_t)`.

.. note::

In the general case this quantity may not be bounded, for example for MDPs
that correspond to continuing tasks. In ECOLE we garantee that all
that correspond to continuing tasks. In Ecole we garantee that all
environments correspond to **episodic** tasks, that is, each episode is
garanteed to start from an initial state :math:`s_0`, and end in a
terminal state :math:`s_{final}`. For convenience this terminal state can
Expand Down Expand Up @@ -95,7 +95,7 @@ non-Markovian nature of those trajectories, that is,

.. math::
o_{t+1},r_{t+1} \nindep o_0,r_0,a_0,\dots,o_{t-1},r_{t-1},a_{t-1} \mid o_t,r_t,a_t
o_{t+1},r_{t+1} \mathop{\rlap{\perp}\mkern2mu{\not\perp}} o_0,r_0,a_0,\dots,o_{t-1},r_{t-1},a_{t-1} \mid o_t,r_t,a_t
\text{,}
the decision-maker must take into account the whole history of past
Expand All @@ -117,14 +117,14 @@ The PO-MDP control problem can then be written identically to the MDP one,
.. math::
:label: pomdp_control
\pi^\star = \argmax_{\pi} \lim_{T \to \infty}
\pi^\star = \underset{\pi}{\operatorname{arg\,max}} \lim_{T \to \infty}
\mathbb{E}_\tau\left[\sum_{t=0}^{T} r_t\right]
\text{.}
ECOLE as PO-MDP components
Ecole as PO-MDP components
--------------------------

The following ECOLE components can be directly translated into PO-MDP
The following Ecole components can be directly translated into PO-MDP
components from the above formulation:

* :py:class:`~ecole.typing.RewardFunction` <=> :math:`R`
Expand Down Expand Up @@ -160,6 +160,6 @@ environment.

As can be seen from :eq:`pomdp_control`, the initial reward :math:`r_0`
returned by :py:meth:`~ecole.environment.EnvironmentComposer.reset`
does not affect the control problem. In ECOLE we
does not affect the control problem. In Ecole we
nevertheless chose to preserve this initial reward, in order to obtain
meaningfull cumulated episode rewards (e.g., total running time).

0 comments on commit c50d70d

Please sign in to comment.