During training, the low-rank matrices create a bypass, which introduces additional latency. So why not merge the low-rank weights directly into W during training, just like we do at inference time? I'm not sure whether this would affect gradient flow?