Skip to content

[Question] Using CLIP for simple image-text similarity #136

Closed
@josephrocca

Description

@josephrocca

I'm trying to get a simple image-text similarity thing working with CLIP, and I'm not sure how to do it, or whether it's currently supported with Transformers.js outside of the zero-shot image classification pipeline.

Is there a code example somewhere to get me started? Here's what I have so far:

import { AutoModel, AutoTokenizer } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.1.1';
let tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
let model = await AutoModel.from_pretrained('Xenova/clip-vit-base-patch16');
let inputIds = await tokenizer(["cat", "astronaut"]);
let image = await fetch("https://i.imgur.com/fYhUGoY.jpg").then(r => r.blob());
// how to process the image, and how to pass the image and inputIds to `model`?

Here's what I see if I inspect the model function in DevTools:

image

I also tried this:

import { AutoModel, AutoTokenizer, AutoProcessor } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.1.1';
let model = await AutoModel.from_pretrained('Xenova/clip-vit-base-patch16');
let processor = await AutoProcessor.from_pretrained("Xenova/clip-vit-base-patch16");
let inputs = await processor({text:["a photo of a cat", "a photo of an astronaut"], images:["https://i.imgur.com/fYhUGoY.jpg"]});
let outputs = await model(inputs);

But it seems that processor expects an array of images, or something? The above code throws an error saying that an .rgb() method should exist on the input.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions