Quantum reinforcement learning tutorial#541
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
|
@googlebot I signed it! |
|
@MichaelBroughton seems like the |
There was a problem hiding this comment.
Thanks for opening the PR, this is a good first pass. I think we will need to make some modifications before it's ready to merge. I didn't go into a ton of detail on this review, but tried to stick to more thematic points. Here are the major points where I think we need to improve:
- Change of language to 2nd person everywhere. "We will show" -> "You can do". Also keeping the language a little more brief and to the point (at the expense of some specificity) appeals a lot to the more hands on audience.
- The ordering of the sections seems a little off, I think after you show the visual of your re-uploading PQC circuit it makes a lot more sense to then go into immediately constructing that circuit in Cirq so the reader has a better feel for what's happening while it's fresh in their mind. From there the transition to building up a Keras layer around that circuit feels much more natural.
- It looks like the Keras layers you are using could be combined down into just one layer, which would also simplify a lot of the model construction code as well.
- The training loops are pretty complicated. It might make sense to try and break them down into simpler helpers that are easier to read so that one can follow the basic steps of "Ok now we're gathering trajectories from the environment with our agent" vs "Ok now we're updating our agent using the trajectories we just gathered" etc. The TF-Agens folks do a pretty good job of this in their REINFORCE tutorial: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial which could be a good template to follow in this case.
- There are some performance gotchas. We shouldn't recommend users disable cores to get better performance, and in a case like this one I don't think we should be seeing worse performance the more cores we add so we might want to take a look into why that's happening. Also, every time we push changes to this repo we check that the tutorials still run to completion, we can't have this one go for several hours :). Using things like
model.compileand calling into your agent from python one time withbatch_size==batch_sizeinstead of calling into the agent from pythonbatch_sizetimes withbatch_size==1will also certainly help too.
Once we tackle all of these I think we will have a pretty awesome QRL tutorial on our hands!
Re: the pip install for gym not working on CI, if that isn't working you can add an install line into ci_validate_tutorials.sh to add gym ( https://github.com/tensorflow/quantum/blob/master/scripts/ci_validate_tutorials.sh )
|
Is this notebook ready for another round of review ? |
|
@MichaelBroughton Yes, I've changed the language, restructured and detailed some of the code, as you suggested. Regarding the performance catch, I've added "dummy" cells that make the notebook run in a flash when the only purpose is to check that the code runs properly. If one wants to train agents until optimal performance, more compute time is needed, and I don't think that it can be improved much without big code changes. You can have a look at my replies on the comments you left during your previous review. |
MichaelBroughton
left a comment
There was a problem hiding this comment.
Things are looking good! I like how the structure of the tutorial has changed. It feels like the user has a much more natural path through understanding the material and the language is a bit more digestible. That being said I think there are still some large fixups needed in the code before we can merge. At a high level:
- Certain patterns like the ones below should be removed from the code:
n = 5
qubits = [cirq.GridQubit(1, i) for i in range(n)]Can be problematic because n doesn't really tell the user anything. A more informative variable name would be n_qubits. In addition to this there are more pythonic ways of constructing qubits using the built in functions in cirq with something like:
qubits = cirq.GridQubit.rect(1, n_qubits)OR
L = 3
theta_dim = 3 * L + whatever
Where L doesn't follow the snake case convention and in fact the overall logic of the code itself can be simplified if variables were removed in place of more pythonic constructs.
A good example of something in this category that could probably be done with relative ease is: instead of defining a second interaction_env function try to make the first one general enough to support both RL use cases (also might be done with the cell the re-instantiates the optimizers for a second time).
- We should be keeping an eye to performance patterns and best practices for TF code. I went over the Reinforce section fairly closely and found several large areas where the code could not only be shortened to fit more natural TF paradigms, but also see a speed boost. If we want to forgo using common patterns like:
@tf.function
def my_update_step(stuff)
.
.
.We should be sure that it is for good reason. When switching to these common TF patterns I was able to take the runtime for 500 episodes from ~50 mins on colab (this version) down to ~15 minutes ( qrl.ipynb.tar.gz and incorporates most of the comments from above). Could you please do the same for the 2nd RL example you give in the tutorial and make sure we aren't leaving any performance on the table or forgoing good TF style when we don't have to ?
- Right now there is one picture and two plots in the tutorial. Is there any chance we could add a few more visuals ? Maybe an animated gif at the top of cartpole ? A schematic diagram of how the state information is fed the QC into the agent and back (outside of just the circuit structure) ? Maybe make a gif using the
env.render()to let the user visualize their agents performance ?
|
Thanks for the second review @MichaelBroughton. Cool that you managed to get the policy-gradient code to run faster. In my opinion, this makes the code a bit harder to parse, but I guess speed is an important concern. I've implemented the changes you suggested (including the correction I mention above in the Regarding the Q-learning code, I don't see how it can be sped-up without changing the implementation compared to our paper. We can't run episodes in parallel as the updates of the model happen at the interaction-step level and not episode-level. Maybe changing the I don't think I will have time to look into this tutorial further in the upcoming weeks. Would you be ok with calling this last version final and putting it online? |
|
@MichaelBroughton I checked, the |
I don't think we're good to go in this case for a couple of reasons.
# Update model
if step_count % update_VQC == 0 and len(replay_memory) >= batch_size:
# Compute loss
t = time.time()
tape, loss = Q_learning_loss(replay_memory, batch_size, loss_function, model, model_target,
gamma, n_actions)
# Backpropagation
grads = tape.gradient(loss, model.trainable_variables)
for optimizer, w in zip([optimizer_in, optimizer_var, optimizer_out], [w_in, w_var, w_out]):
optimizer.apply_gradients([(grads[w], model.trainable_variables[w])])
t2 = time.time()
print('Update time:', t2 - t)Across training with your implementation this averaged around 0.08 (s) per update in a defalut google colab runtime. Adding: @tf.function
def Q_learning_update2(states, rewards, next_states, actions, done, gamma, n_actions):
"""Update model."""
states = tf.convert_to_tensor(states)
rewards = tf.convert_to_tensor(rewards)
next_states = tf.convert_to_tensor(next_states)
actions = tf.convert_to_tensor(actions)
done = tf.convert_to_tensor(done)
# Compute their target q_values and the masks on sampled actions
future_rewards = model_target([next_states])
target_q_values = rewards + (gamma * tf.reduce_max(future_rewards, axis=1)
* (1.0 - done))
masks = tf.one_hot(actions, n_actions)
# Train the model on the states and target Q-values
with tf.GradientTape() as tape:
tape.watch(model.trainable_variables)
q_values = model([states])
# Apply the masks to the Q-values
q_values_masked = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)
# Calculate loss between target Q-values and model Q-values
loss = tf.keras.losses.Huber()(target_q_values, q_values_masked)
# Backpropagation
grads = tape.gradient(loss, model.trainable_variables)
for optimizer, w in zip([optimizer_in, optimizer_var, optimizer_out], [w_in, w_var, w_out]):
optimizer.apply_gradients([(grads[w], model.trainable_variables[w])])Took this time down to 0.04 (s) per update, with more time spent interacting with the environment vs updating the model ( as one would expect). So instead of a ~3-6x slowdown it was a factor of ~1.5-2x faster, even with The likely reason that just decorating the existing implementation with @tf.function caused a slowdown is due to this warning
On the topic of the v3.1 changes: I think things are really starting to come together nicely in this tutorial, but the second half still seems unnecessarily complex and could use another pass based on the 2nd round of feedback I left (for example there are still two different environment interaction functions that could likely be simplified down to just one , along with some style inconsistencies). I think adressing these points and then one final round of thorough review would put us in good shape to merge the tutorial and put it up live on the website.
I don't think we're ready just yet. If time is of the essence, I'd be happy to chip into cleaning up some of the code myself and getting review some someone like @zaqqwerty or @lamberta if we just want to get this merged and live ASAP. |
That's exactly the function I tested (I checked again to make sure I wasn't missing on something). Running it on my laptop (i.e., allowing multiprocessing), I get 0.05s per update for the current implementation v.s. 0.35s for yours.
As I told you in a previous comment, the environment interaction function for Q-learning cannot be made to play episodes in parallel (as you did for the policy gradient approach), since the updates of the model happen during episodes rather than between batches episodes. i.e., the model used for episode
All this being said, I don't object for extra help if you deem changes necessary. |
|
Ok, I took a quick pass over the second half of the tutorial and refactored the code so that both algorithms can make use of the environment interaction function (which also needed a rewrite to support this). Overall the tutorial is now shorter, which is good. Here it is: I'd be happy to move forward to merging this implementation with some minimal changes if need be for anything I might've missed or you feel very strongly about. Since we are going to be close to merging now, would you be able to also update the following in your PR:
|
|
@MichaelBroughton thanks for going through the second part of the tutorial again.
Sooo, at first you wanted to avoid
For these reasons, I implemented all your other changes but strongly feel that we shouldn't be trying to merge the interaction functions.
These changes I implemented.
I kept the part of the DateReuploading docstring that wasn't explained elsewhere (and that I think a user defining a new custom layer or adapting this one will definitely want to have).
Did all this. |
MichaelBroughton
left a comment
There was a problem hiding this comment.
Sooo, at first you wanted to avoid batch_size=1 evaluations, and now to avoid having two functions that interact with the environment you ended up using only batch_size=1 evaluations again? while keeping two functions? (although now one after the other) I strongly feel that this is not what we want for this tutorial, for all the following reasons: The code for gather_steps is quite hard to parse and uses only batch_size=1 (if it's hard to parse, at least it should be for a performance gain..). If we're using batch_size=1 evaluations, we may as well go back to the interaction_step function for both parts and avoid using more complicated functions that basically do the same thing.
The implementation with the stochastic variable makes no sense without the context of Q-learning, which is only introduced later
A user trying to understand the Q-learning code has to basically go back and forth between the beginning and the end of the tutorial in order to make sense of the function calls
The gain in length is really not that big, relatively to the gain in clarity and ease of navigation with two separate interaction functions.
Also makes the code for the whitepaper longer
For these reasons, I implemented all your other changes but strongly feel that we shouldn't be trying to merge the interaction functions.
Fair enough. I had figured we ought to at least see what having a single environment interaction function looks like. If you feel that strongly we should keep two separate ones that's fine by me.
In this last round of review I have:
- Changing
update_VQCto use snake case. - Clarifying whether the
update_targetvariable is reffering to 30 steps or 30 episodes. In the training loop below it appears as through it is episodes, can you confirm this and change the comment in the code if need be ? - The simplification to the final training loop that removes one of the "if" statements as well as a few loop variables that need tracking. I feel fairly strongly that this is a valuable change that will help make this part of the tutorial more pedagogical and easier to digest for our readers when there is less code and variables to keep track of.
- A missed "we".
That's it. Once these few comments are addressed we will merge. I'm excited to see it up on the site!
| "env = gym.make(\"CartPole-v1\")\n", | ||
| "\n", | ||
| "interaction = namedtuple('interaction', ('state', 'action', 'reward', 'next_state', 'done'))\n", | ||
| " \n", | ||
| "episode_reward_history = []\n", | ||
| "step_count = 0\n", | ||
| "for episode in range(n_episodes):\n", | ||
| " episode_reward = 0\n", | ||
| " state = env.reset()\n", | ||
| " done = False\n", | ||
| " \n", | ||
| " while not done:\n", | ||
| " # Increase step count\n", | ||
| " step_count += 1\n", | ||
| "\n", | ||
| " # Interact with env\n", | ||
| " state, action, next_state, reward, done = interaction_env(state, model, epsilon, n_actions, env)\n", | ||
| " \n", | ||
| " # Store last interaction in the replay memory\n", | ||
| " sarsd = interaction(np.copy(state), action, reward, np.copy(next_state), float(done))\n", | ||
| " replay_memory.append(sarsd)\n", | ||
| " \n", | ||
| " state = np.array(next_state)\n", | ||
| " episode_reward += reward\n", | ||
| " \n", | ||
| " # Update model\n", | ||
| " if step_count % update_VQC == 0 and len(replay_memory) >= batch_size:\n", | ||
| " # Sample a batch of interactions\n", | ||
| " batch = random.choices(replay_memory, k=batch_size)\n", | ||
| " batch = interaction(*zip(*batch))\n", | ||
| " \n", | ||
| " # Update Q-function\n", | ||
| " Q_learning_update(np.asarray(batch.state),\n", | ||
| " np.asarray(batch.reward, dtype=np.float32),\n", | ||
| " np.asarray(batch.next_state),\n", | ||
| " np.asarray(batch.action),\n", | ||
| " np.asarray(batch.done, dtype=np.float32),\n", | ||
| " model, gamma, n_actions)\n", | ||
| " \n", | ||
| " # Update target model\n", | ||
| " if episode % update_target == 0:\n", | ||
| " model_target.set_weights(model.get_weights())\n", | ||
| "\n", | ||
| " # Decay epsilon\n", | ||
| " epsilon = max(epsilon * decay_epsilon, epsilon_min)\n", | ||
| " episode_reward_history.append(episode_reward)\n", | ||
| " if (episode+1)%10 == 0:\n", | ||
| " avg_rewards = np.mean(episode_reward_history[-10:])\n", | ||
| " print(\"Episode {}/{}, average last 10 rewards {}\".format(\n", | ||
| " episode+1, n_episodes, avg_rewards))\n", | ||
| " if avg_rewards >= 500.0:\n", | ||
| " break" |
There was a problem hiding this comment.
If we are going to have a second environment interaction function there is no need for this training loop function to remain this complicated. In particular in the last review I proposed this change:
env = gym.make("CartPole-v1")
env.reset()
# Gather initial batch_size replays.
replay_memory += gather_steps(state_bounds, n_actions, model, batch_size, env, False, 1.0)
for episode in range(n_episodes):
episode_reward = 0
env.reset()
while True:
# Gather and store n_steps_update new episode steps.
trajectories = gather_steps(
state_bounds, n_actions, model, n_steps_update, env, False, epsilon)
replay_memory += trajectories
# Sample batch_size steps and train.
training_batch = np.random.choice(replay_memory, size=batch_size)
q_learning_update(
np.asarray([x['state'] for x in training_batch]),
np.asarray([x['reward'] for x in training_batch], dtype=np.float32),
np.asarray([x['next_state'] for x in training_batch]),
np.asarray([x['action'] for x in training_batch]),
np.asarray([x['done'] for x in training_batch], dtype=np.float32),
gamma, n_actions)
episode_reward += np.sum([x['reward'] for x in trajectories])
# Update target model. Is this correct ?
if episode % update_target == 0:
model_target.set_weights(model.get_weights())
# The episode is finished.
if trajectories[-1]['done']:
break
# Decay epsilon.
epsilon = max(epsilon * decay_epsilon, epsilon_min)
episode_reward_history.append(episode_reward)
if (episode+1)%10 == 0:
avg_reward = np.mean(episode_reward_history[-10:])
print("Episode {}/{}, average last 10 rewards {}".format(
episode+1, n_episodes, avg_reward))
if avg_reward >= 500.0:
breakWhich entirely eliminates the need for the if step_count % update_VQC ==0 and len(...) >= memory_size, done and the variable state from the training loop, as well as the need for a namedtuple. In the snippet I use gather_step, but this can be replaced with interaction_env and a small change to interaction_env to add a variable number k steps from the environment at a time. I feel that this is a very useful simplification to this training loop and is something worth incorporating for ease of understanding for the readers.
Cool! We're getting there!
You are right in pointing out that something is wrong, it's supposed to be every 30 steps. Given this, doesn't it make sense that the model updates and target-model updates are taken on the same footing, i.e. both implemented with similar
You'll see that I nevertheless made the training loop a bit easier to parse: got rid of the |
bump cibuildwheel version
This is the tutorial on "Parametrized Quantum Circuits for Reinforcement Learning" that we discussed. @MichaelBroughton and others, I'll be happy to take your comments and do some revision.