|
23 | 23 | \Delta(S)`, where :math:`q(s'|s, a)` is the probability that the state
|
24 | 24 | in the next period is :math:`s'` when the current state is :math:`s`
|
25 | 25 | and the action chosen is :math:`a`; and
|
26 |
| -* discount factor :math:`\beta \in [0, 1)`. |
| 26 | +* discount factor :math:`0 \leq \beta < 1`. |
27 | 27 |
|
28 | 28 | For a policy function :math:`\sigma`, let :math:`r_{\sigma}` and
|
29 | 29 | :math:`Q_{\sigma}` be the reward vector and the transition probability
|
30 | 30 | matrix for :math:`\sigma`, which are defined by :math:`r_{\sigma}(s) =
|
31 | 31 | r(s, \sigma(s))` and :math:`Q_{\sigma}(s, s') = q(s'|s, \sigma(s))`,
|
32 | 32 | respectively. The policy value function :math:`v_{\sigma}` for
|
33 |
| -:math`\sigma` is defined by |
| 33 | +:math:`\sigma` is defined by |
34 | 34 |
|
35 |
| -..math:: |
| 35 | +.. math:: |
36 | 36 |
|
37 | 37 | v_{\sigma}(s) = \sum_{t=0}^{\infty}
|
38 | 38 | \beta^t (Q_{\sigma}^t r_{\sigma})(s)
|
|
45 | 45 |
|
46 | 46 | The *Bellman equation* is written as
|
47 | 47 |
|
48 |
| -..math:: |
| 48 | +.. math:: |
49 | 49 |
|
50 | 50 | v(s) = \max_{a \in A(s)} r(s, a)
|
51 | 51 | + \beta \sum_{s' \in S} q(s'|s, a) v(s') \quad (s \in S).
|
52 | 52 |
|
53 | 53 | The *Bellman operator* :math:`T` is defined by the right hand side of
|
54 | 54 | the Bellman equation:
|
55 | 55 |
|
56 |
| -..math:: |
| 56 | +.. math:: |
57 | 57 |
|
58 | 58 | (T v)(s) = \max_{a \in A(s)} r(s, a)
|
59 | 59 | + \beta \sum_{s' \in S} q(s'|s, a) v(s') \quad (s \in S).
|
60 | 60 |
|
61 | 61 | For a policy function :math:`\sigma`, the operator :math:`T_{\sigma}` is
|
62 | 62 | defined by
|
63 | 63 |
|
64 |
| -..math:: |
| 64 | +.. math:: |
65 | 65 |
|
66 | 66 | (T_{\sigma} v)(s) = r(s, \sigma(s))
|
67 | 67 | + \beta \sum_{s' \in S} q(s'|s, \sigma(s)) v(s')
|
|
117 | 117 |
|
118 | 118 |
|
119 | 119 | class DiscreteDP(object):
|
120 |
| - """ |
| 120 | + r""" |
121 | 121 | Class for dealing with a discrete dynamic program.
|
122 | 122 |
|
123 | 123 | There are two ways to represent the data for instantiating a
|
@@ -165,7 +165,7 @@ class DiscreteDP(object):
|
165 | 165 | Transition probability array.
|
166 | 166 |
|
167 | 167 | beta : scalar(float)
|
168 |
| - Discount factor. Must be in [0, 1). |
| 168 | + Discount factor. Must be 0 <= beta < 1. |
169 | 169 |
|
170 | 170 | s_indices : array_like(int, ndim=1), optional(default=None)
|
171 | 171 | Array containing the indices of the states.
|
@@ -297,7 +297,7 @@ def __init__(self, R, Q, beta, s_indices=None, a_indices=None):
|
297 | 297 | raise ValueError('R must be 1- or 2-dimensional')
|
298 | 298 |
|
299 | 299 | msg_dimension = 'dimensions of R and Q must be either 1 and 2, ' \
|
300 |
| - 'of 2 and 3' |
| 300 | + 'or 2 and 3' |
301 | 301 | msg_shape = 'shapes of R and Q must be either (n, m) and (n, m, n), ' \
|
302 | 302 | 'or (L,) and (L, n)'
|
303 | 303 |
|
|
0 commit comments