Mix ViT in_channels restriction

What's the reasoning behind limiting the Mix Visual Transformer encoder #632 to 3 input channels?

https://github.com/qubvel/segmentation_models.pytorch/blob/master/segmentation_models_pytorch/encoders/mix_transformer.py#L468

I couldn't spot anything in the paper or the original SegFormer implementation.