Last active
January 26, 2024 15:03
-
-
Save andylolu2/eb18527193a2cef1bfb694302a74aed6 to your computer and use it in GitHub Desktop.
CLIP loss
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# b - batch size | |
# d - feature dimension | |
# t - learned temperature parameter | |
# image_encoder - ResNet or Vision Transformer | |
# text_encoder - CBOW or Text Transformer | |
# I[B, h, w, c] - minibatch of aligned images | |
# T[B, l] - minibatch of aligned texts | |
# extract feature representations of each modality | |
F_i = image_encoder(I) # [b, d] | |
F_t = text_encoder(T) # [b, d] | |
# scaled pairwise cosine similarities [b, b] | |
sim = cosine_similarity(F_i, F_t) * np.exp(t) | |
# symmetric loss function | |
loss_i = cross_entropy_loss(sim, np.arange(n)) | |
loss_t = cross_entropy_loss(sim.T, np.arange(n)) | |
loss = loss_i + loss_t |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment