Setup shortterm replay buffer

Need to be able to do batched conditionals in tensorflow.

At the current moment we aren't calculating gamma loss with the reward function.

Add a replay_memory to the subcritic network instead of the polynomial critic network. 

Mini-batch of 64 instead of 1 (online to mini-batch)