Issues with Stochastic MuZero #60

carlosgmartin · 2023-08-08T20:40:02Z

I'm having issues with mctx.stochastic_muzero_policy. Here's an example:

import jax
import mctx
from jax import numpy as jnp

num_actions = 4
num_chance_outcomes = 2


def decision_recurrent_fn(params, key, action, state):
    return (
        mctx.DecisionRecurrentFnOutput(
            chance_logits=jnp.full(num_chance_outcomes, 0.0),
            afterstate_value=jnp.array(0.0),
        ),
        state,
    )


def chance_recurrent_fn(params, key, action, afterstate):
    return (
        mctx.ChanceRecurrentFnOutput(
            action_logits=jnp.full(num_actions, 0.0),
            value=jnp.array(0.0),
            # reward=jnp.array(1.),
            reward=1 + (action == 0) * 100,
            discount=jnp.array(0.0),
        ),
        afterstate,
    )


def root_fn(state):
    return mctx.RootFnOutput(
        prior_logits=jnp.full(num_actions, 0.0),
        value=jnp.array(0.0),
        embedding=state,
    )


def main():
    root = root_fn(jnp.full(4, 0.0))
    root = jax.tree_map(lambda x: x[None], root)

    key = jax.random.PRNGKey(0)

    output = mctx.stochastic_muzero_policy(
        params=jnp.full(20, 0.0),
        rng_key=key,
        root=root,
        decision_recurrent_fn=jax.vmap(decision_recurrent_fn, [None, None, 0, 0]),
        chance_recurrent_fn=jax.vmap(chance_recurrent_fn, [None, None, 0, 0]),
        num_simulations=1000,
        num_actions=num_actions,
        num_chance_outcomes=num_chance_outcomes,
    )
    assert (output.search_tree.children_rewards == 0).all()
    print(output.action_weights)  # [[0.007 0.451 0.063 0.479]]


if __name__ == "__main__":
    main()

The first issue is that the children_rewards are all 0, despite the fact that chance_recurrent_fn always yields a positive reward.

The second issue is that the final weight of the zeroth action (which receives an additional reward of 100) is not higher than the rest, despite a large number of simulations.

Any idea what might be causing these issues?

The text was updated successfully, but these errors were encountered:

fidlej · 2023-08-09T22:08:01Z

Thanks for sharing the minimal example. I can clear one confusion: The action passed to chance_recurrent_fn(params, key, action, afterstate) is actually the chance outcome. To give different actions different rewards, modify the decision_recurrent_fn to output different afterststate for each action.

You can take an inspiration from the bandit in the tests:
https://github.com/deepmind/mctx/blob/bfb7316b96f9e5b04744e8872c1abba9b2dac6b9/mctx/_src/tests/policies_test.py#L42

I will improve the documentation for the chance_recurrent_fn. Sorry for the confusion.

carlosgmartin · 2023-08-09T22:12:39Z

@fidlej Thanks for your reply. Perhaps the argument can be renamed to outcome, for clarity?

Fixes #60. PiperOrigin-RevId: 555288953

carlosgmartin · 2023-08-13T07:35:46Z

@fidlej Any idea about the children_rewards issue?

fidlej · 2023-08-13T15:22:44Z

You can see that the output.search_tree contains only the actions relevant for the decision nodes.
The masking is done here:
https://github.com/deepmind/mctx/blob/bfb7316b96f9e5b04744e8872c1abba9b2dac6b9/mctx/_src/policies.py#L366

The zeros in the children_rewards then make sense. The reward is zero for the children of the decision nodes.

Fixes #60. PiperOrigin-RevId: 555288953

copybara-service bot pushed a commit that referenced this issue Aug 9, 2023

Clarify the chance_recurrent_fn arguments.

e1b12f6

Fixes #60. PiperOrigin-RevId: 555288953

copybara-service bot mentioned this issue Aug 9, 2023

Clarify the chance_recurrent_fn arguments. #61

Closed

copybara-service bot pushed a commit that referenced this issue Aug 15, 2023

Clarify the chance_recurrent_fn arguments.

d38b186

Fixes #60. PiperOrigin-RevId: 555288953

copybara-service bot pushed a commit that referenced this issue Aug 15, 2023

Clarify the chance_recurrent_fn arguments.

80bee3d

Fixes #60. PiperOrigin-RevId: 555288953

copybara-service bot closed this as completed in c13a660 Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Stochastic MuZero #60

Issues with Stochastic MuZero #60

carlosgmartin commented Aug 8, 2023

fidlej commented Aug 9, 2023

carlosgmartin commented Aug 9, 2023

carlosgmartin commented Aug 13, 2023

fidlej commented Aug 13, 2023

Issues with Stochastic MuZero #60

Issues with Stochastic MuZero #60

Comments

carlosgmartin commented Aug 8, 2023

fidlej commented Aug 9, 2023

carlosgmartin commented Aug 9, 2023

carlosgmartin commented Aug 13, 2023

fidlej commented Aug 13, 2023