Accessing image x text features

I'm interested in inspecting the image crossed with text features during the fusion step. However, when I extract them it appears the multi scale image features are concatenated and have the approximate shape (B, 10000+, 256). There isn't a square number of image patches so I can't just reshape it to (B,H,W,256). How can I parse out the multi-scale features.