Improve memory efficiency of the distributed wrappers

For example the distributed loss wrapper computes a loss for the global batch, but keeps track of only the local gradients. It would be better to compute the loss relevant to only the local batch by making use of ```indices_tuple```.