Need to be able to do batched conditionals in tensorflow.
At the current moment we aren't calculating gamma loss with the reward function.
Add a replay_memory to the subcritic network instead of the polynomial critic network.
Mini-batch of 64 instead of 1 (online to mini-batch)