Add support for computing CLIP image and text embeddings separately (Closes #148) #227

xenova · 2023-07-29T13:26:53Z

This PR adds support for computing CLIP text and vision embeddings separately. It uses a custom ONNX config (based on this) and requires models to be exported with the --split_modalities flag set in the conversion script. For example:

python -m scripts.convert --quantize --model_id openai/clip-vit-base-patch16 --split_modalities

Usage:

Example: Compute text embeddings with CLIPTextModelWithProjection.

import { AutoTokenizer, CLIPTextModelWithProjection } from '@xenova/transformers';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
const text_model = await CLIPTextModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');

// Run tokenization
let texts = ['a photo of a car', 'a photo of a football match'];
let text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Compute embeddings
const { text_embeds } = await text_model(text_inputs);
// Tensor {
//   dims: [ 2, 512 ],
//   type: 'float32',
//   data: Float32Array(1024) [ ... ],
//   size: 1024
// }

Example: Compute vision embeddings with CLIPVisionModelWithProjection.

import { AutoProcessor, CLIPVisionModelWithProjection, RawImage} from '@xenova/transformers';

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch16');
const vision_model = await CLIPVisionModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');

// Read image and run processor
let image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
let image_inputs = await processor(image);

// Compute embeddings
const { image_embeds } = await vision_model(image_inputs);
// Tensor {
//   dims: [ 1, 512 ],
//   type: 'float32',
//   data: Float32Array(512) [ ... ],
//   size: 512
// }

xenova · 2023-07-29T13:28:03Z

cc @josephrocca

HuggingFaceDocBuilderDev · 2023-07-29T13:42:03Z

The documentation is not available anymore as the PR was closed or merged.

josephrocca · 2023-07-29T22:08:13Z

🥹 what did we (the web/js community) do to deserve you xenova

🧎

xenova · 2023-07-29T23:23:03Z

I will merge into main soon 🚀 ... I'm just testing everything by creating a simple Next.js semantic image search application. Results seem pretty good so far:

As you can see, it correctly ranks the first image of a woman exploring a forest much higher (relatively) than the others. (similarity is the cosine similarity between query text embedding and image embeds)

josephrocca · 2023-07-30T06:09:45Z

The little web app examples you've been putting together are great! Being able to tweet "this is running completely in the browser!" with a little video showing some magic AI thing (that most devs might assume you'd need a big GPU server for), is, I think, opening up peoples eyes to the possibilities here. And your design skills are 🔥 (looking sadly at my clip-image-sorter web design lol)

(Aside: I've mentioned this before, but there's another subset of users [which includes myself], who really just want little code snippets for various tasks - i.e. where a full-blown application isn't really that useful, because I end up having to dig through code to "extract" out the simple few-lines-of-code that I wanted. If these examples could be linked in the "Supported Tasks" table, that would be perfect I think. I know you're getting around to this eventually - but I know it's sometimes useful to hear pain-points from users a few times so you know that users aren't just requesting some niche "nice-to-have" thing that they happened to ponder for a few moments.)

brettshepherd · 2023-10-28T19:12:29Z

I will merge into main soon 🚀 ... I'm just testing everything by creating a simple Next.js semantic image search application. Results seem pretty good so far:

As you can see, it correctly ranks the first image of a woman exploring a forest much higher (relatively) than the others. (similarity is the cosine similarity between query text embedding and image embeds)

This may not be the best place to ask, but how did you generate the ai_description for those images?

xenova added 8 commits July 29, 2023 13:14

Define custom CLIP ONNX configs

a528d1b

Update conversion script

c1332f9

Support specifying custom model file name

aba4c95

Use int64 for CLIP input ids

1b10187

Add support for CLIP text and vision models

82281c1

Fix JSDoc

b15b802

Add docs for CLIPTextModelWithProjection

48c9e4b

Add docs for CLIPVisionModelWithProjection

b2b9980

xenova linked an issue Jul 29, 2023 that may be closed by this pull request

[Feature request] Separate text & image embeddings #148

Closed

xenova added 2 commits July 29, 2023 22:09

Add unit test for CLIP text models

093acfa

Add unit test for CLIP vision models

f42bf9c

Set resize precision to 3 decimal places

421446c

xenova added 6 commits July 30, 2023 12:58

Fix RawImage.save() function

5ceebd6

Throw error when reading image and status != 200

3db01c1

Create basic semantic image search application

7038e60

Separate out components

4837c2c

Add update-database script

954b785

Update transformers.js version

c67e4ce

xenova merged commit 2fde656 into main Aug 1, 2023

josephrocca mentioned this pull request Aug 22, 2023

Added docs links to supported tasks #257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for computing CLIP image and text embeddings separately (Closes #148) #227

Add support for computing CLIP image and text embeddings separately (Closes #148) #227

xenova commented Jul 29, 2023 •

edited

Loading

xenova commented Jul 29, 2023

HuggingFaceDocBuilderDev commented Jul 29, 2023 •

edited

Loading

josephrocca commented Jul 29, 2023

xenova commented Jul 29, 2023 •

edited

Loading

josephrocca commented Jul 30, 2023 •

edited

Loading

brettshepherd commented Oct 28, 2023

Add support for computing CLIP image and text embeddings separately (Closes #148) #227

Add support for computing CLIP image and text embeddings separately (Closes #148) #227

Conversation

xenova commented Jul 29, 2023 • edited Loading

Usage:

xenova commented Jul 29, 2023

HuggingFaceDocBuilderDev commented Jul 29, 2023 • edited Loading

josephrocca commented Jul 29, 2023

xenova commented Jul 29, 2023 • edited Loading

josephrocca commented Jul 30, 2023 • edited Loading

brettshepherd commented Oct 28, 2023

xenova commented Jul 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 29, 2023 •

edited

Loading

xenova commented Jul 29, 2023 •

edited

Loading

josephrocca commented Jul 30, 2023 •

edited

Loading