Why not merge BA during training?

During training, the low-rank matrices create a bypass, which introduces additional latency. So why not merge the low-rank weights directly into **_W_** during training, just like we do at inference time? I'm not sure whether this would affect gradient flow？