Skip to content

Quantum reinforcement learning tutorial#541

Merged
MichaelBroughton merged 12 commits into
tensorflow:masterfrom
sjerbi:master
May 28, 2021
Merged

Quantum reinforcement learning tutorial#541
MichaelBroughton merged 12 commits into
tensorflow:masterfrom
sjerbi:master

Conversation

@sjerbi
Copy link
Copy Markdown
Contributor

@sjerbi sjerbi commented Apr 13, 2021

This is the tutorial on "Parametrized Quantum Circuits for Reinforcement Learning" that we discussed. @MichaelBroughton and others, I'll be happy to take your comments and do some revision.

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Apr 13, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@sjerbi
Copy link
Copy Markdown
Contributor Author

sjerbi commented Apr 13, 2021

@googlebot I signed it!

@sjerbi
Copy link
Copy Markdown
Contributor Author

sjerbi commented Apr 13, 2021

@MichaelBroughton seems like the gym package (providing the RL environment) is missing from the notebook dependencies of Continuous Integration

Copy link
Copy Markdown
Collaborator

@MichaelBroughton MichaelBroughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening the PR, this is a good first pass. I think we will need to make some modifications before it's ready to merge. I didn't go into a ton of detail on this review, but tried to stick to more thematic points. Here are the major points where I think we need to improve:

  1. Change of language to 2nd person everywhere. "We will show" -> "You can do". Also keeping the language a little more brief and to the point (at the expense of some specificity) appeals a lot to the more hands on audience.
  2. The ordering of the sections seems a little off, I think after you show the visual of your re-uploading PQC circuit it makes a lot more sense to then go into immediately constructing that circuit in Cirq so the reader has a better feel for what's happening while it's fresh in their mind. From there the transition to building up a Keras layer around that circuit feels much more natural.
  3. It looks like the Keras layers you are using could be combined down into just one layer, which would also simplify a lot of the model construction code as well.
  4. The training loops are pretty complicated. It might make sense to try and break them down into simpler helpers that are easier to read so that one can follow the basic steps of "Ok now we're gathering trajectories from the environment with our agent" vs "Ok now we're updating our agent using the trajectories we just gathered" etc. The TF-Agens folks do a pretty good job of this in their REINFORCE tutorial: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial which could be a good template to follow in this case.
  5. There are some performance gotchas. We shouldn't recommend users disable cores to get better performance, and in a case like this one I don't think we should be seeing worse performance the more cores we add so we might want to take a look into why that's happening. Also, every time we push changes to this repo we check that the tutorials still run to completion, we can't have this one go for several hours :). Using things like model.compile and calling into your agent from python one time with batch_size==batch_size instead of calling into the agent from python batch_size times with batch_size==1 will also certainly help too.

Once we tackle all of these I think we will have a pretty awesome QRL tutorial on our hands!

Re: the pip install for gym not working on CI, if that isn't working you can add an install line into ci_validate_tutorials.sh to add gym ( https://github.com/tensorflow/quantum/blob/master/scripts/ci_validate_tutorials.sh )

Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
@MichaelBroughton
Copy link
Copy Markdown
Collaborator

Is this notebook ready for another round of review ?

@sjerbi
Copy link
Copy Markdown
Contributor Author

sjerbi commented Apr 25, 2021

@MichaelBroughton Yes, I've changed the language, restructured and detailed some of the code, as you suggested. Regarding the performance catch, I've added "dummy" cells that make the notebook run in a flash when the only purpose is to check that the code runs properly. If one wants to train agents until optimal performance, more compute time is needed, and I don't think that it can be improved much without big code changes.

You can have a look at my replies on the comments you left during your previous review.

Copy link
Copy Markdown
Collaborator

@MichaelBroughton MichaelBroughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things are looking good! I like how the structure of the tutorial has changed. It feels like the user has a much more natural path through understanding the material and the language is a bit more digestible. That being said I think there are still some large fixups needed in the code before we can merge. At a high level:

  1. Certain patterns like the ones below should be removed from the code:
n = 5
qubits = [cirq.GridQubit(1, i) for i in range(n)]

Can be problematic because n doesn't really tell the user anything. A more informative variable name would be n_qubits. In addition to this there are more pythonic ways of constructing qubits using the built in functions in cirq with something like:

qubits = cirq.GridQubit.rect(1, n_qubits)

OR

L = 3
theta_dim = 3 * L + whatever

Where L doesn't follow the snake case convention and in fact the overall logic of the code itself can be simplified if variables were removed in place of more pythonic constructs.

A good example of something in this category that could probably be done with relative ease is: instead of defining a second interaction_env function try to make the first one general enough to support both RL use cases (also might be done with the cell the re-instantiates the optimizers for a second time).

  1. We should be keeping an eye to performance patterns and best practices for TF code. I went over the Reinforce section fairly closely and found several large areas where the code could not only be shortened to fit more natural TF paradigms, but also see a speed boost. If we want to forgo using common patterns like:
@tf.function
def my_update_step(stuff)
.
.
.

We should be sure that it is for good reason. When switching to these common TF patterns I was able to take the runtime for 500 episodes from ~50 mins on colab (this version) down to ~15 minutes ( qrl.ipynb.tar.gz and incorporates most of the comments from above). Could you please do the same for the 2nd RL example you give in the tutorial and make sure we aren't leaving any performance on the table or forgoing good TF style when we don't have to ?

  1. Right now there is one picture and two plots in the tutorial. Is there any chance we could add a few more visuals ? Maybe an animated gif at the top of cartpole ? A schematic diagram of how the state information is fed the QC into the agent and back (outside of just the circuit structure) ? Maybe make a gif using the env.render() to let the user visualize their agents performance ?

Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
@sjerbi
Copy link
Copy Markdown
Contributor Author

sjerbi commented May 10, 2021

Thanks for the second review @MichaelBroughton. Cool that you managed to get the policy-gradient code to run faster. In my opinion, this makes the code a bit harder to parse, but I guess speed is an important concern.

I've implemented the changes you suggested (including the correction I mention above in the ReUploading layer + some minor changes in naming and simplifications here and there), added a rendering of the CartPole env, and an intro sentence.

Regarding the Q-learning code, I don't see how it can be sped-up without changing the implementation compared to our paper. We can't run episodes in parallel as the updates of the model happen at the interaction-step level and not episode-level. Maybe changing the Q_learning_loss function into Q_learning_update and decorating it with a @tf.function could slightly speed things up but I don't think by much and don't have time to test.

I don't think I will have time to look into this tutorial further in the upcoming weeks. Would you be ok with calling this last version final and putting it online?

@sjerbi
Copy link
Copy Markdown
Contributor Author

sjerbi commented May 13, 2021

@MichaelBroughton I checked, the Q_learning_update function with a @tf.function decorator runs ~3x to 6x slower (depending on how much I delegate to it) than the current implementation. I think we're good to go as is.

@MichaelBroughton
Copy link
Copy Markdown
Collaborator

MichaelBroughton commented May 13, 2021

@MichaelBroughton I checked, the Q_learning_update function with a @tf.function decorator runs ~3x to 6x slower (depending on how much I delegate to it) than the current implementation. I think we're good to go as is.

I don't think we're good to go in this case for a couple of reasons.

  1. When writing a tutorial we want to lead by example with good quality code that beginner to intermediate users shouldn't be confused by when using the standard @tf.function results in slow downs instead of speed ups.
  2. I took another look at the code and placed a very simple timing block around the update portion of the training loop. So now it looks like this:
# Update model
if step_count % update_VQC == 0 and len(replay_memory) >= batch_size:
    # Compute loss
    t = time.time()
    tape, loss = Q_learning_loss(replay_memory, batch_size, loss_function, model, model_target, 
                                 gamma, n_actions)

    # Backpropagation
    grads = tape.gradient(loss, model.trainable_variables)
    for optimizer, w in zip([optimizer_in, optimizer_var, optimizer_out], [w_in, w_var, w_out]):
        optimizer.apply_gradients([(grads[w], model.trainable_variables[w])])
    t2 = time.time()
   print('Update time:', t2 - t)

Across training with your implementation this averaged around 0.08 (s) per update in a defalut google colab runtime. Adding:

@tf.function
def Q_learning_update2(states, rewards, next_states, actions, done, gamma, n_actions):
    """Update model."""
    states = tf.convert_to_tensor(states)
    rewards = tf.convert_to_tensor(rewards)
    next_states = tf.convert_to_tensor(next_states)
    actions = tf.convert_to_tensor(actions)
    done = tf.convert_to_tensor(done)

    # Compute their target q_values and the masks on sampled actions
    future_rewards = model_target([next_states])
    target_q_values = rewards + (gamma * tf.reduce_max(future_rewards, axis=1)
                                                   * (1.0 - done))
    masks = tf.one_hot(actions, n_actions)

    # Train the model on the states and target Q-values
    with tf.GradientTape() as tape:
        tape.watch(model.trainable_variables)
        q_values = model([states])
        # Apply the masks to the Q-values
        q_values_masked = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)
        # Calculate loss between target Q-values and model Q-values
        loss = tf.keras.losses.Huber()(target_q_values, q_values_masked)

    # Backpropagation
    grads = tape.gradient(loss, model.trainable_variables)
    for optimizer, w in zip([optimizer_in, optimizer_var, optimizer_out], [w_in, w_var, w_out]):
        optimizer.apply_gradients([(grads[w], model.trainable_variables[w])])

Took this time down to 0.04 (s) per update, with more time spent interacting with the environment vs updating the model ( as one would expect). So instead of a ~3-6x slowdown it was a factor of ~1.5-2x faster, even with Q_learning_update2 being far from perfect.

The likely reason that just decorating the existing implementation with @tf.function caused a slowdown is due to this warning "Caution: Passing python scalars or lists as arguments to tf.function will always build a new graph. To avoid this, pass numeric arguments as Tensors whenever possible:" which can be found on the docs page for @tf.function

  1. Lastly, the Q_learning_update2 function follows more closely with the common patterns and best practices that are mentioned throughout the TensorFlow website and featured in a lot of guides + tutorials. We should make an effort to try and remain consistent with best practices for style and performance (like I mentioned in my earlier review as well). See here, here and here for a few examples of common tf.function patterns being used to decorate functions that only accept tf.Tensor or np.ndarray arguments.

On the topic of the v3.1 changes: I think things are really starting to come together nicely in this tutorial, but the second half still seems unnecessarily complex and could use another pass based on the 2nd round of feedback I left (for example there are still two different environment interaction functions that could likely be simplified down to just one , along with some style inconsistencies). I think adressing these points and then one final round of thorough review would put us in good shape to merge the tutorial and put it up live on the website.

I don't think I will have time to look into this tutorial further in the upcoming weeks. Would you be ok with calling this last version final and putting it online?

I don't think we're ready just yet. If time is of the essence, I'd be happy to chip into cleaning up some of the code myself and getting review some someone like @zaqqwerty or @lamberta if we just want to get this merged and live ASAP.

@sjerbi
Copy link
Copy Markdown
Contributor Author

sjerbi commented May 13, 2021

@MichaelBroughton I checked, the Q_learning_update function with a @tf.function decorator runs ~3x to 6x slower (depending on how much I delegate to it) than the current implementation. I think we're good to go as is.

I don't think we're good to go in this case for a couple of reasons.

1. When writing a tutorial we want to lead by example with good quality code that beginner to intermediate users shouldn't be confused by when using the standard `@tf.function` results in slow downs instead of speed ups.

2. I took another look at the code and placed a very simple timing block around the update portion of the training loop. So now it looks like this:
# Update model
if step_count % update_VQC == 0 and len(replay_memory) >= batch_size:
    # Compute loss
    t = time.time()
    tape, loss = Q_learning_loss(replay_memory, batch_size, loss_function, model, model_target, 
                                 gamma, n_actions)

    # Backpropagation
    grads = tape.gradient(loss, model.trainable_variables)
    for optimizer, w in zip([optimizer_in, optimizer_var, optimizer_out], [w_in, w_var, w_out]):
        optimizer.apply_gradients([(grads[w], model.trainable_variables[w])])
    t2 = time.time()
   print('Update time:', t2 - t)

Across training with your implementation this averaged around 0.08 (s) per update in a defalut google colab runtime. Adding:

@tf.function
def Q_learning_update2(states, rewards, next_states, actions, done, gamma, n_actions):
    """Update model."""
    states = tf.convert_to_tensor(states)
    rewards = tf.convert_to_tensor(rewards)
    next_states = tf.convert_to_tensor(next_states)
    actions = tf.convert_to_tensor(actions)
    done = tf.convert_to_tensor(done)

    # Compute their target q_values and the masks on sampled actions
    future_rewards = model_target([next_states])
    target_q_values = rewards + (gamma * tf.reduce_max(future_rewards, axis=1)
                                                   * (1.0 - done))
    masks = tf.one_hot(actions, n_actions)

    # Train the model on the states and target Q-values
    with tf.GradientTape() as tape:
        tape.watch(model.trainable_variables)
        q_values = model([states])
        # Apply the masks to the Q-values
        q_values_masked = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)
        # Calculate loss between target Q-values and model Q-values
        loss = tf.keras.losses.Huber()(target_q_values, q_values_masked)

    # Backpropagation
    grads = tape.gradient(loss, model.trainable_variables)
    for optimizer, w in zip([optimizer_in, optimizer_var, optimizer_out], [w_in, w_var, w_out]):
        optimizer.apply_gradients([(grads[w], model.trainable_variables[w])])

Took this time down to 0.04 (s) per update, with more time spent interacting with the environment vs updating the model ( as one would expect). So instead of a ~3-6x slowdown it was a factor of ~1.5-2x faster, even with Q_learning_update2 being far from perfect.

The likely reason that just decorating the existing implementation with @tf.function caused a slowdown is due to this warning "Caution: Passing python scalars or lists as arguments to tf.function will always build a new graph. To avoid this, pass numeric arguments as Tensors whenever possible:" which can be found on the docs page for @tf.function

That's exactly the function I tested (I checked again to make sure I wasn't missing on something). Running it on my laptop (i.e., allowing multiprocessing), I get 0.05s per update for the current implementation v.s. 0.35s for yours.

On the topic of the v3.1 changes: I think things are really starting to come together nicely in this tutorial, but the second half still seems unnecessarily complex and could use another pass based on the 2nd round of feedback I left (for example there are still two different environment interaction functions that could likely be simplified down to just one , along with some style inconsistencies).

As I told you in a previous comment, the environment interaction function for Q-learning cannot be made to play episodes in parallel (as you did for the policy gradient approach), since the updates of the model happen during episodes rather than between batches episodes. i.e., the model used for episode n+1 will be most of the time different from the model used for episode n. Changing this will not only change the implementation of the algorithm this time, but the algorithm itself, and hence won't be reproducing the results of our paper.

I don't think we're ready just yet. If time is of the essence, I'd be happy to chip into cleaning up some of the code myself and getting review some someone like @zaqqwerty or @lamberta if we just want to get this merged and live ASAP.

All this being said, I don't object for extra help if you deem changes necessary.

@MichaelBroughton
Copy link
Copy Markdown
Collaborator

MichaelBroughton commented May 17, 2021

Ok, I took a quick pass over the second half of the tutorial and refactored the code so that both algorithms can make use of the environment interaction function (which also needed a rewrite to support this). Overall the tutorial is now shorter, which is good. Here it is:
qrl_pass.ipynb.tar.gz
I also cleaned up some of the indentation that wasn't consistent. Removed some lingering "We"s in the language. Tightened up some of the variable names and in general smoothed out the sharper edges. Removed the extremely large docstring from the ReUploadingLayer since it is already explained in the text and not worth repeating again right below it. Since the runtime is so long I also added break statements in both training loops once a perfect score was reached to save on time.

I'd be happy to move forward to merging this implementation with some minimal changes if need be for anything I might've missed or you feel very strongly about.

Since we are going to be close to merging now, would you be able to also update the following in your PR:

  1. Add an entry for your tutorial in _book.yaml
  2. Add a regex replace rule in the test_tutorials.py for setting n_examples=50 So that the CI doesn't run the tutorial for an extremely long time and instead just quickly test that things run end to end.
  3. When you are done making any edits can you make sure that you run the notebook end to end in a cloud based google colab runtime, get results you are happy with, download the .ipynb file and then use that file as the final version you place in the PR. This is so that we can verify it performs as expected on colab (how we expect a lot of people to first run this) and that we don't get any unexpected behavior when we move to render it on the website. The renderings and outputs in the downloaded .ipynb will be the ones that appear on the site.

@sjerbi
Copy link
Copy Markdown
Contributor Author

sjerbi commented May 17, 2021

@MichaelBroughton thanks for going through the second part of the tutorial again.

refactored the code so that both algorithms can make use of the environment interaction function (which also needed a rewrite to support this).

Sooo, at first you wanted to avoid batch_size=1 evaluations, and now to avoid having two functions that interact with the environment you ended up using only batch_size=1 evaluations again? while keeping two functions? (although now one after the other) I strongly feel that this is not what we want for this tutorial, for all the following reasons:

  • The code for gather_steps is quite hard to parse and uses only batch_size=1 (if it's hard to parse, at least it should be for a performance gain..). If we're using batch_size=1 evaluations, we may as well go back to the interaction_step function for both parts and avoid using more complicated functions that basically do the same thing.
  • The implementation with the stochastic variable makes no sense without the context of Q-learning, which is only introduced later
  • A user trying to understand the Q-learning code has to basically go back and forth between the beginning and the end of the tutorial in order to make sense of the function calls
  • The gain in length is really not that big, relatively to the gain in clarity and ease of navigation with two separate interaction functions.
  • Also makes the code for the whitepaper longer

For these reasons, I implemented all your other changes but strongly feel that we shouldn't be trying to merge the interaction functions.

I also cleaned up some of the indentation that wasn't consistent. Removed some lingering "We"s in the language. Tightened up some of the variable names and in general smoothed out the sharper edges.

Since the runtime is so long I also added break statements in both training loops once a perfect score was reached to save on time.

These changes I implemented.

Removed the extremely large docstring from the ReUploadingLayer since it is already explained in the text and not worth repeating again right below it.

I kept the part of the DateReuploading docstring that wasn't explained elsewhere (and that I think a user defining a new custom layer or adapting this one will definitely want to have).

  • Add an entry for your tutorial in _book.yaml

  • Add a regex replace rule in the test_tutorials.py for setting n_examples=50 So that the CI doesn't run the tutorial for an extremely long time and instead just quickly test that things run end to end.

  • When you are done making any edits can you make sure that you run the notebook end to end in a cloud based google colab runtime, get results you are happy with, download the .ipynb file and then use that file as the final version you place in the PR. This is so that we can verify it performs as expected on colab (how we expect a lot of people to first run this) and that we don't get any unexpected behavior when we move to render it on the website. The renderings and outputs in the downloaded .ipynb will be the ones that appear on the site.

Did all this.

Copy link
Copy Markdown
Collaborator

@MichaelBroughton MichaelBroughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sooo, at first you wanted to avoid batch_size=1 evaluations, and now to avoid having two functions that interact with the environment you ended up using only batch_size=1 evaluations again? while keeping two functions? (although now one after the other) I strongly feel that this is not what we want for this tutorial, for all the following reasons: The code for gather_steps is quite hard to parse and uses only batch_size=1 (if it's hard to parse, at least it should be for a performance gain..). If we're using batch_size=1 evaluations, we may as well go back to the interaction_step function for both parts and avoid using more complicated functions that basically do the same thing.
The implementation with the stochastic variable makes no sense without the context of Q-learning, which is only introduced later
A user trying to understand the Q-learning code has to basically go back and forth between the beginning and the end of the tutorial in order to make sense of the function calls
The gain in length is really not that big, relatively to the gain in clarity and ease of navigation with two separate interaction functions.
Also makes the code for the whitepaper longer
For these reasons, I implemented all your other changes but strongly feel that we shouldn't be trying to merge the interaction functions.

Fair enough. I had figured we ought to at least see what having a single environment interaction function looks like. If you feel that strongly we should keep two separate ones that's fine by me.

In this last round of review I have:

  1. Changing update_VQC to use snake case.
  2. Clarifying whether the update_target variable is reffering to 30 steps or 30 episodes. In the training loop below it appears as through it is episodes, can you confirm this and change the comment in the code if need be ?
  3. The simplification to the final training loop that removes one of the "if" statements as well as a few loop variables that need tracking. I feel fairly strongly that this is a valuable change that will help make this part of the tutorial more pedagogical and easier to digest for our readers when there is less code and variables to keep track of.
  4. A missed "we".

That's it. Once these few comments are addressed we will merge. I'm excited to see it up on the site!

Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment thread docs/tutorials/quantum_reinforcement_learning.ipynb Outdated
Comment on lines +1416 to +1467
"env = gym.make(\"CartPole-v1\")\n",
"\n",
"interaction = namedtuple('interaction', ('state', 'action', 'reward', 'next_state', 'done'))\n",
" \n",
"episode_reward_history = []\n",
"step_count = 0\n",
"for episode in range(n_episodes):\n",
" episode_reward = 0\n",
" state = env.reset()\n",
" done = False\n",
" \n",
" while not done:\n",
" # Increase step count\n",
" step_count += 1\n",
"\n",
" # Interact with env\n",
" state, action, next_state, reward, done = interaction_env(state, model, epsilon, n_actions, env)\n",
" \n",
" # Store last interaction in the replay memory\n",
" sarsd = interaction(np.copy(state), action, reward, np.copy(next_state), float(done))\n",
" replay_memory.append(sarsd)\n",
" \n",
" state = np.array(next_state)\n",
" episode_reward += reward\n",
" \n",
" # Update model\n",
" if step_count % update_VQC == 0 and len(replay_memory) >= batch_size:\n",
" # Sample a batch of interactions\n",
" batch = random.choices(replay_memory, k=batch_size)\n",
" batch = interaction(*zip(*batch))\n",
" \n",
" # Update Q-function\n",
" Q_learning_update(np.asarray(batch.state),\n",
" np.asarray(batch.reward, dtype=np.float32),\n",
" np.asarray(batch.next_state),\n",
" np.asarray(batch.action),\n",
" np.asarray(batch.done, dtype=np.float32),\n",
" model, gamma, n_actions)\n",
" \n",
" # Update target model\n",
" if episode % update_target == 0:\n",
" model_target.set_weights(model.get_weights())\n",
"\n",
" # Decay epsilon\n",
" epsilon = max(epsilon * decay_epsilon, epsilon_min)\n",
" episode_reward_history.append(episode_reward)\n",
" if (episode+1)%10 == 0:\n",
" avg_rewards = np.mean(episode_reward_history[-10:])\n",
" print(\"Episode {}/{}, average last 10 rewards {}\".format(\n",
" episode+1, n_episodes, avg_rewards))\n",
" if avg_rewards >= 500.0:\n",
" break"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to have a second environment interaction function there is no need for this training loop function to remain this complicated. In particular in the last review I proposed this change:

env = gym.make("CartPole-v1")
env.reset()

# Gather initial batch_size replays.
replay_memory += gather_steps(state_bounds, n_actions, model, batch_size, env, False, 1.0)

for episode in range(n_episodes):
    episode_reward = 0
    env.reset()
    
    while True:
        # Gather and store n_steps_update new episode steps.
        trajectories = gather_steps(
            state_bounds, n_actions, model, n_steps_update, env, False, epsilon)
        replay_memory += trajectories

        # Sample batch_size steps and train.
        training_batch = np.random.choice(replay_memory, size=batch_size)
        q_learning_update(
          np.asarray([x['state'] for x in training_batch]),
          np.asarray([x['reward'] for x in training_batch], dtype=np.float32),
          np.asarray([x['next_state'] for x in training_batch]),
          np.asarray([x['action'] for x in training_batch]),
          np.asarray([x['done'] for x in training_batch], dtype=np.float32),
          gamma, n_actions)

        episode_reward += np.sum([x['reward'] for x in trajectories])
    
        # Update target model. Is this correct ?
        if episode % update_target == 0:
            model_target.set_weights(model.get_weights())

        # The episode is finished.
        if trajectories[-1]['done']:
            break

    # Decay epsilon.
    epsilon = max(epsilon * decay_epsilon, epsilon_min)
    episode_reward_history.append(episode_reward)
    if (episode+1)%10 == 0:
        avg_reward = np.mean(episode_reward_history[-10:])
        print("Episode {}/{}, average last 10 rewards {}".format(
            episode+1, n_episodes, avg_reward))
        if avg_reward >= 500.0:
            break

Which entirely eliminates the need for the if step_count % update_VQC ==0 and len(...) >= memory_size, done and the variable state from the training loop, as well as the need for a namedtuple. In the snippet I use gather_step, but this can be replaced with interaction_env and a small change to interaction_env to add a variable number k steps from the environment at a time. I feel that this is a very useful simplification to this training loop and is something worth incorporating for ease of understanding for the readers.

@sjerbi
Copy link
Copy Markdown
Contributor Author

sjerbi commented May 21, 2021

That's it. Once these few comments are addressed we will merge. I'm excited to see it up on the site!

Cool! We're getting there!

2\. Clarifying whether the `update_target` variable is reffering to 30 steps or 30 episodes. In the training loop below it appears as through it is episodes, can you confirm this and change the comment in the code if need be ?

You are right in pointing out that something is wrong, it's supposed to be every 30 steps.

Given this, doesn't it make sense that the model updates and target-model updates are taken on the same footing, i.e. both implemented with similar if conditions? Otherwise steps_per_update_target has to be a multiple of steps_per_update and we have to keep track of how many model updates have been done, right?

Which entirely eliminates the need for the if step_count % update_VQC ==0 and len(...) >= memory_size, done and the variable state from the training loop, as well as the need for a namedtuple. In the snippet I use gather_step, but this can be replaced with interaction_env and a small change to interaction_env to add a variable number k steps from the environment at a time. I feel that this is a very useful simplification to this training loop and is something worth incorporating for ease of understanding for the readers.

You'll see that I nevertheless made the training loop a bit easier to parse: got rid of the done and the namedtupe, moved variables to interact_env, and removed the len(...) >= memory_size condition which wasn't doing much anyway. Let me know if you like this version, otherwise we can try to find a nice way to remove the if step_count % steps_per_update == 0 condition entirely.

Copy link
Copy Markdown
Collaborator

@MichaelBroughton MichaelBroughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MichaelBroughton MichaelBroughton merged commit ec232e6 into tensorflow:master May 28, 2021
jaeyoo pushed a commit to jaeyoo/quantum that referenced this pull request Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants