CLIP Mannequin and The Significance of Multimodal Embeddings | by Fahim Rustamy, PhD | Dec, 2023

Thank you for reading this post, don't forget to subscribe!

Fahim Rustamy, PhD

Towards Data Science

10 min learn

Dec 11, 2023

CLIP, which stands for Contrastive Language-Picture Pretraining, is a deep studying mannequin developed by OpenAI in 2021. CLIP’s embeddings for photographs and textual content share the identical area, enabling direct comparisons between the 2 modalities. That is achieved by coaching the mannequin to deliver associated photographs and texts nearer collectively whereas pushing unrelated ones aside.

Some purposes of CLIP embrace:

  1. Picture Classification and Retrieval: CLIP can be utilized for picture classification duties by associating photographs with pure language descriptions. It permits for extra versatile and versatile picture retrieval methods the place customers can seek for photographs utilizing textual queries.
  2. Content material Moderation: CLIP can be utilized to reasonable content material on on-line platforms by analyzing photographs and accompanying textual content to establish and filter out inappropriate or dangerous content material.

The unique CLIP mannequin aimed to unite picture and textual content modalities inside a shared embedding area. This idea, together with its methods, extends past photographs and textual content to embrace different modalities. Netflix, in this weblog publish, educated a mannequin by combining video and textual content modalities within the frequent embedding area to reinforce search inside video purposes. Contrastive Language-Audio Pretraining (CLAP) is one other mannequin that integrates textual content and audio modalities throughout the similar embedding area, making it worthwhile for bettering search functionalities inside audio purposes.

The underlying know-how for CLIP is very simple however very highly effective, opening the door for a lot of multi-model machine studying methods. Meta AI just lately launched ImageBind, which learns a joint embedding throughout six modalities — photographs, textual content, audio, depth, thermal, and IMU knowledge. CLIP, the primary large-scale AI mannequin that accepts two modalities, is a prerequisite to understanding ImageBind and different multi-modality AI methods.

Imagebind from META AI accepts six completely different modalities as enter (Taken from ImageBind’s official GitHub web page).

What’s CLIP

CLIP is designed to foretell which N × N potential (picture, textual content) pairings throughout the batch are precise matches. To attain this, CLIP establishes a multi-modal embedding area by way of the joint coaching of a picture encoder and textual content encoder. The CLIP loss goals to maximise the cosine similarity between the picture and textual content embeddings for the N real pairs within the batch whereas minimizing the cosine similarity for the N² − N incorrect pairings. The optimization course of includes utilizing a symmetric cross-entropy loss perform that operates on these similarity scores. The next presents pseudocode (taken from the unique paper) outlining the core implementation of CLIP.

# image_encoder - ResNet or Imaginative and prescient Transformer
# text_encoder - CBOW or Textual content Transformer
# I[n, h, w, c] - minibatch of aligned photographs
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - realized proj of picture to embed
# W_t[d_t, d_e] - realized proj of textual content to embed
# t - realized temperature parameter
# extract function representations of every modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(, W_i), axis=1)
T_e = l2_normalize(, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits =, T_e.T) * np.exp(t)
# symmetric loss perform
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

Right here’s a step-by-step description of every line within the pseudo code and its implementation utilizing PyTorch:

Mannequin Structure:

ClIP makes use of two separate architectures because the spine for encoding imaginative and prescient and textual content datasets:

  • image_encoder: Represents the neural community structure (e.g., ResNet or Imaginative and prescient Transformer) chargeable for encoding photographs.
  • text_encoder: Represents the neural community structure (e.g., CBOW, BERT, or Textual content Transformer) chargeable for encoding textual info.

The unique CLIP mannequin was educated from scratch with out initializing the picture encoder and the textual content encoder with pre-trained weights as a result of massive quantity of the dataset (400 million image-text pairs) that they used to coach their CLIP mannequin. Within the instance on this weblog publish, we’ll do issues a bit in a different way. We’ll begin with pre-trained weights from resnet (for photographs) and distilbert (for textual content) fashions to initialize these components.

Structure of CLIP mannequin (taken from the unique paper)

Enter Information:

The mannequin takes a batch of n pairs of photographs and texts as enter the place:

  • I[n, h, w, c]: Represents a minibatch of aligned photographs, the place n is the batch measurement, h is the picture peak, w is the picture width, and c is the variety of channels.
  • T[n, l]: Represents a minibatch of aligned texts, the place n is the batch measurement, and l is the size of the textual sequence.
One batch of picture and caption pairs for a batch measurement of 128

Function Extraction:

  • I_f = image_encoder(I): Extracts function representations (I_f) from the picture encoder. The form of I_f is [n, d_i], the place d_i is the dimensionality of the picture options.
  • T_f = text_encoder(T): Extracts function representations (T_f) from the textual content encoder. The form of T_f is [n, d_t], the place d_t is the dimensionality of the textual content options.
I_f = fashions.resnet34(pretrained=True)      # for encoding photographs
T_f= AutoModel.from_pretrained("distilbert-base-multilingual-cased") # for encoding captions

Realized Projections:

  • W_i[d_i, d_e]: Represents the realized projection matrix for mapping picture options (I_f) to an embedding area (I_e). The form of W_i is [d_i, d_e], the place d_e is the specified dimensionality of the joint embedding area.
  • W_t[d_t, d_e]: Represents the realized projection matrix for mapping textual content options (T_f) to the identical embedding area (T_e). The form of W_t is [d_t, d_e].

The projection operation will be coded utilizing a neural community with two linear layers, whose weights are the realized projection matrix. Normally, the projection weights are the one weights with lively gradients that may be educated on new datasets. Moreover, the projection layer performs a vital function in aligning the size of picture and textual content embeddings, making certain that they’ve the identical measurement.

class Projection(nn.Module):
def __init__(self, d_in: int, d_out: int, p: float=0.5) -> None:
self.linear1 = nn.Linear(d_in, d_out, bias=False)
self.linear2 = nn.Linear(d_out, d_out, bias=False)
self.layer_norm = nn.LayerNorm(d_out)
self.drop = nn.Dropout(p)

def ahead(self, x: torch.Tensor) -> torch.Tensor:
embed1 = self.linear1(x)
embed2 = self.drop(self.linear2(F.gelu(embed1)))
embeds = self.layer_norm(embed1 + embed2)
return embeds

Embedding and Normalization:

  • I_e = l2_normalize(, W_i), axis=1): Embeds and normalizes picture options within the joint embedding area (I_e).
  • T_e = l2_normalize(, W_t), axis=1): Embeds and normalizes textual content options within the joint embedding area (T_e).

The code beneath illustrates the sequential processing of picture and textual content knowledge. Initially, the information undergoes processing by way of the bottom encoder, adopted by the projection layer. lastly, normalized embeddings are generated for each modalities and returned.

class VisionEncoder(nn.Module):
def __init__(self, d_out: int) -> None:
base = fashions.resnet34(pretrained=True)
d_in = base.fc.in_features
base.fc = nn.Identification()
self.base = base
self.projection = Projection(d_in, d_out)
for p in self.base.parameters():
p.requires_grad = False

def ahead(self, x):
projected_vec = self.projection(self.base(x))
projection_len = torch.norm(projected_vec, dim=-1, keepdim=True)
return projected_vec / projection_len

class TextEncoder(nn.Module):
def __init__(self, d_out: int) -> None:
self.base = AutoModel.from_pretrained(Config.text_model)
self.projection = Projection(Config.transformer_embed_dim, d_out)
for p in self.base.parameters():
p.requires_grad = False

def ahead(self, x):
out = self.base(x)[0]
out = out[:, 0, :] # get CLS token output
projected_vec = self.projection(out)
projection_len = torch.norm(projected_vec, dim=-1, keepdim=True)
return projected_vec / projection_len

vision_encoder = VisionEncoder(Config.embed_dim)
I_e = vision_encoder(photographs)
caption_encoder = TextEncoder(Config.embed_dim)
T_e = caption_encoder(textual content["input_ids"])

Cosine Similarities:

  • logits =, T_e.T) * np.exp(t): Computes pairwise cosine similarities between picture and textual content embeddings, scaled by a realized temperature parameter t.

On this instance, we interchangeably use similarity with logits in the identical method that was used within the authentic paper. We won’t embrace the temperature parameter t on this weblog publish.

logits = T_e @ T_e.T

Symmetric Loss Perform:

CLIP makes use of contrastive loss (first launched in Illustration Studying with Contrastive Predictive Coding) to deliver associated photographs and texts nearer collectively whereas pushing unrelated ones aside.

  • labels = np.arange(n): Generates labels representing the indices of the batch.
  • loss_i = cross_entropy_loss(logits, labels, axis=0): Computes the cross-entropy loss alongside the picture axis.
  • loss_t = cross_entropy_loss(logits, labels, axis=1): Computes the cross-entropy loss alongside the textual content axis.
  • loss = (loss_i + loss_t)/2: Computes the symmetric common of the picture and textual content losses.
def CLIP_loss(logits: torch.Tensor) -> torch.Tensor:
n = logits.form[1] # variety of samples
labels = torch.arange(n) # Create labels tensor
# Calculate cross entropy losses alongside axis 0 and 1
loss_i = F.cross_entropy(logits.transpose(0, 1), labels, discount="imply")
loss_t = F.cross_entropy(logits, labels, discount="imply")
# Calculate the ultimate loss
loss = (loss_i + loss_t) / 2

return loss

Last Customized CLIP Mannequin

Combing all of the completely different items collectively, the ultimate customized CLIP mannequin appears like the next:

class CustomModel(nn.Module):
def __init__(self, lr: float = 1e-3) -> None:
self.vision_encoder = VisionEncoder(Config.embed_dim)
self.caption_encoder = TextEncoder(Config.embed_dim)
self.tokenizer = Tokenizer(AutoTokenizer.from_pretrained(Config.text_model)) = lr
self.gadget = "cuda" if torch.cuda.is_available() else "cpu"

def ahead(self, photographs, textual content):
textual content = self.tokenizer(textual content).to(self.gadget)

image_embed = self.vision_encoder(photographs)
caption_embed = self.caption_encoder(textual content["input_ids"])
similarity = caption_embed @ image_embed.T

loss = CLIP_loss(similarity)
img_acc, cap_acc = metrics(similarity)
return loss, img_acc, cap_acc


This instance demonstrates the method of making picture caption datasets and coaching a customized CLIP mannequin. The intention is to coach a imaginative and prescient encoder and a textual content encoder collectively to challenge the illustration of photographs and their captions into the identical embedding area, such that the caption embeddings are positioned close to the embeddings of the photographs they describe. The code for this challenge is in my GitHub repository.

Dataset and Dataloader

Our customized CLIP mannequin might be educated utilizing the flickr30k dataset. This dataset includes greater than 31,000 photographs, every with a minimal of 5 impartial human-generated captions. We are going to use two captions for every picture on this instance to have a complete of 62,000 picture and textual content pairs for coaching. Though historically employed for picture captioning duties, we intend to adapt the image-caption pairs to coach our twin encoder mannequin particularly for picture search functions. The GitHub repository additionally contains the code to coach the mannequin on the MS-COCO dataset with 164,000 picture and textual content pairs.

from torch.utils.knowledge import DataLoader
from datasets import load_dataset
from torchvision import transforms
from PIL import Picture
import torch
from torchvision import transforms
from PIL import Picture
# Outline a customized dataset class for Flickr30k
class Flickr30kDataset(torch.utils.knowledge.Dataset):
def __init__(self):
self.dataset = load_dataset("nlphuji/flickr30k", cache_dir="./huggingface_data")
self.remodel = transforms.Compose([
transforms.Resize((224, 224)),
self.cap_per_image = 2

def __len__(self):
return self.dataset.num_rows["test"] * self.cap_per_image

def __getitem__(self, idx):
original_idx = idx // self.cap_per_image
picture = self.dataset["test"][original_idx]["image"].convert("RGB")
picture = self.remodel(picture)

# labels
caption = self.dataset["test"][original_idx]["caption"][idx % self.cap_per_image]

return "picture": picture, "caption": caption

# Create an occasion of the customized dataset
flickr30k_custom_dataset = Flickr30kDataset()

Key mannequin constants embraceembed_dim for realized representations, transformer_embed_dim for transformer layer options, and max_len for textual content enter size. The chosen text_model is “distilbert-base-multilingual-cased.” Coaching spans 3epochs with abatch_size of 128, that are the constants that may feed into the mannequin constructing and coaching.

from dataclasses import dataclass

class Config:
Configuration class for the CLIP coaching script.

embed_dim: int = 512 # Embedding dimension
transformer_embed_dim: int = 768 # Transformer embedding dimension
max_len: int = 32 # Most textual content size
text_model: str = "distilbert-base-multilingual-cased" # Textual content mannequin identify
epochs: int = 3 # Variety of coaching epochs
batch_size: int = 128 # Batch measurement

The DataLoader is about up for environment friendly iteration throughout coaching, offering organized entry to image-caption pairs.

# Create the DataLoader
clip_dataloader = DataLoader(flickr30k_custom_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)

Right here is an instance of a picture caption pair in one of many batches within the dataset.

import numpy as np
import matplotlib.pyplot as plt
# Create an iterator from the dataloader
data_iter = iter(clip_dataloader)

# Get one batch
batch = subsequent(data_iter)

picture = batch["image"][0] # get one picture from the batch
caption = batch["caption"][0] # get one textual content from the batch

# Convert the picture tensor to a NumPy array and permute dimensions
image_np = np.transpose(picture.numpy(), (1, 2, 0))

# Show the picture and caption
plt.title(f"Caption: caption")

Right here, we provoke our CustomModel and ship it to the gadget (CPU or GPU). Moreover, we specify the parameters to be optimized all through the coaching course of. Provided that we now have fastened the bottom layer for each textual content and picture encoders, solely the parameters related to the projection layer will bear coaching on the brand new dataset.

# Create an occasion of your mannequin
mannequin = CustomModel().to(gadget)

# Outline optimizer
optimizer = torch.optim.Adam([
'params': model.vision_encoder.parameters(),
'params': model.caption_encoder.parameters()

Mannequin coaching

The coaching was carried out with a Tesla T4 (g4dn-xlarge) GPU machine for 3 coaching epochs. The Jupyter Pocket book is out there within the challenge’s GitHub repository and accommodates the code for the coaching loop.

batch_zero = True
for epoch in vary(start_epoch, num_epochs):
for batch in clip_dataloader:
picture = batch["image"].to(gadget)
textual content = batch["caption"]
# photographs, textual content = batch
loss, img_acc, cap_acc = mannequin.common_step((picture, textual content))

# Backward move and optimization

if batch_zero:
print(f"Epoch [0/num_epochs], Batch Loss: loss.merchandise()")
batch_zero = False

# Print coaching statistics
print(f"Epoch [epoch+1/num_epochs], Batch Loss: loss.merchandise()")

print("Coaching full.")

The next are the outcomes of coaching loops for every epoch utilizing the flicker30k dataset. For extra particulars, please check with this pocket book.

Epoch [0/3], Batch Loss: 4.854558944702148
Epoch [1/3], Batch Loss: 3.187166690826416
Epoch [2/3], Batch Loss: 3.0981950759887695
Epoch [3/3], Batch Loss: 3.164858818054199
Coaching full.

Listed below are the outcomes from the coaching loops for every epoch utilizing the COCO2017 dataset. The mannequin displays sooner convergence on the COCO dataset, attributed to the supply of over 160,000 image-text pairs, in distinction to the 62,000 picture pairs within the flickr30k dataset. For extra particulars, please check with this pocket book.

Epoch [0/3], Batch Loss: 4.852224349975586
Epoch [1/3], Batch Loss: 2.7819151878356934
Epoch [2/3], Batch Loss: 2.727229118347168
Epoch [3/3], Batch Loss: 2.717097759246826
Coaching full.


In conclusion, this weblog publish has explored the CLIP mannequin, uncovering its potential for wide-ranging purposes. As we perceive the purposes of CLIP, it turns into evident that its affect spans far past preliminary expectations, paving the way in which for progressive options throughout various fields. CLIP was the primary profitable mannequin that bridged the hole between completely different modalities and opened avenues for cross-disciplinary improvements.

Leave a Reply

Your email address will not be published. Required fields are marked *