-
Notifications
You must be signed in to change notification settings - Fork 139
Open
Description
Line 400 in 508ffa3
| retrieve_logits = logit_scale * torch.matmul(sequence_output, visual_output.t()) |
if self.training:
visual_output = allgather(visual_output, self.task_config)
video_mask = allgather(video_mask, self.task_config)
sequence_output = allgather(sequence_output, self.task_config)
torch.distributed.barrier()
visual_output = visual_output / visual_output.norm(dim=-1, keepdim=True)
visual_output = self._mean_pooling_for_similarity_visual(visual_output, video_mask)
visual_output = visual_output / visual_output.norm(dim=-1, keepdim=True)
sequence_output = sequence_output.squeeze(1)
sequence_output = sequence_output / sequence_output.norm(dim=-1, keepdim=True)
logit_scale = self.clip.logit_scale.exp()
retrieve_logits = logit_scale * torch.matmul(sequence_output, visual_output.t())
The current code seems to calculate the loss on the global similarity matrix on each gpu. Computing loss only for local and global features as described in openai/CLIP#132 seems to be more computationally and memory efficient.
Sorry to bother you if I misunderstood the code
Metadata
Metadata
Assignees
Labels
No labels