Thank you for your amazing work.I searched online and found that the CLIP-ViT-L/14@336px model divides an image into 14*14=196 patches, and the embedding dimension is 768. In your work the shape of features after CLIP visual encoder is (576,1024). How does it come?