Distill or prune model to save training time

If we can distill or prune the NLLB-200 shortly after starting fine tuning, we may be able to dramatically reduce (50% or more) the training and inferencing time needed. It could even do something like this:
* Take the 3.3GB model and train for 1000 steps on the 2x A100's.  Prune and save.
* Load the model on a single A100 and finish training and inferencing.