Skip to content

Add support for computing CLIP image and text embeddings separately (Closes #148) #227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 1, 2023

Conversation

xenova
Copy link
Collaborator

@xenova xenova commented Jul 29, 2023

This PR adds support for computing CLIP text and vision embeddings separately. It uses a custom ONNX config (based on this) and requires models to be exported with the --split_modalities flag set in the conversion script. For example:

python -m scripts.convert --quantize --model_id openai/clip-vit-base-patch16 --split_modalities

Usage:

Example: Compute text embeddings with CLIPTextModelWithProjection.

import { AutoTokenizer, CLIPTextModelWithProjection } from '@xenova/transformers';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
const text_model = await CLIPTextModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');

// Run tokenization
let texts = ['a photo of a car', 'a photo of a football match'];
let text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Compute embeddings
const { text_embeds } = await text_model(text_inputs);
// Tensor {
//   dims: [ 2, 512 ],
//   type: 'float32',
//   data: Float32Array(1024) [ ... ],
//   size: 1024
// }

Example: Compute vision embeddings with CLIPVisionModelWithProjection.

import { AutoProcessor, CLIPVisionModelWithProjection, RawImage} from '@xenova/transformers';

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch16');
const vision_model = await CLIPVisionModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');

// Read image and run processor
let image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
let image_inputs = await processor(image);

// Compute embeddings
const { image_embeds } = await vision_model(image_inputs);
// Tensor {
//   dims: [ 1, 512 ],
//   type: 'float32',
//   data: Float32Array(512) [ ... ],
//   size: 512
// }

@xenova
Copy link
Collaborator Author

xenova commented Jul 29, 2023

cc @josephrocca

@xenova xenova linked an issue Jul 29, 2023 that may be closed by this pull request
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 29, 2023

The documentation is not available anymore as the PR was closed or merged.

@josephrocca
Copy link
Contributor

🥹 what did we (the web/js community) do to deserve you xenova

🧎

@xenova
Copy link
Collaborator Author

xenova commented Jul 29, 2023

I will merge into main soon 🚀 ... I'm just testing everything by creating a simple Next.js semantic image search application. Results seem pretty good so far:

image
As you can see, it correctly ranks the first image of a woman exploring a forest much higher (relatively) than the others. (similarity is the cosine similarity between query text embedding and image embeds)

@josephrocca
Copy link
Contributor

josephrocca commented Jul 30, 2023

The little web app examples you've been putting together are great! Being able to tweet "this is running completely in the browser!" with a little video showing some magic AI thing (that most devs might assume you'd need a big GPU server for), is, I think, opening up peoples eyes to the possibilities here. And your design skills are 🔥 (looking sadly at my clip-image-sorter web design lol)

(Aside: I've mentioned this before, but there's another subset of users [which includes myself], who really just want little code snippets for various tasks - i.e. where a full-blown application isn't really that useful, because I end up having to dig through code to "extract" out the simple few-lines-of-code that I wanted. If these examples could be linked in the "Supported Tasks" table, that would be perfect I think. I know you're getting around to this eventually - but I know it's sometimes useful to hear pain-points from users a few times so you know that users aren't just requesting some niche "nice-to-have" thing that they happened to ponder for a few moments.)

@xenova xenova merged commit 2fde656 into main Aug 1, 2023
@brettshepherd
Copy link

I will merge into main soon 🚀 ... I'm just testing everything by creating a simple Next.js semantic image search application. Results seem pretty good so far:

image As you can see, it correctly ranks the first image of a woman exploring a forest much higher (relatively) than the others. (similarity is the cosine similarity between query text embedding and image embeds)

This may not be the best place to ask, but how did you generate the ai_description for those images?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] Separate text & image embeddings
4 participants