
Hello authors, in Figure 4, what can be observed is merely that the features within each modality are clustered more tightly. However, the reduction in the modality gap mentioned in the paper should not be evident from this, should it? The distance between the two groups of points in different colors is what actually represents the gap between modalities.