Getting the performance of (CLIP w/ RN-50) and (CLIP w/ EN-B5)

Hello, thank you for the great work and for publishing the code!

I wonder if there is a fast solution to using your released codebase to reproduce the result of (CLIP w/ RN-50) and (CLIP w/ EN-B5). For example, for image classification, can I directly take an RN-50 or EN-B5 checkpoint (e.g., pre-trained on ImageNet) and fine-tune it with VinDr/RSNA training sets to get the performance? Thanks!