Closed
Description
I'm trying to get a simple image-text similarity thing working with CLIP, and I'm not sure how to do it, or whether it's currently supported with Transformers.js outside of the zero-shot image classification pipeline.
Is there a code example somewhere to get me started? Here's what I have so far:
import { AutoModel, AutoTokenizer } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.1.1';
let tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
let model = await AutoModel.from_pretrained('Xenova/clip-vit-base-patch16');
let inputIds = await tokenizer(["cat", "astronaut"]);
let image = await fetch("https://i.imgur.com/fYhUGoY.jpg").then(r => r.blob());
// how to process the image, and how to pass the image and inputIds to `model`?
Here's what I see if I inspect the model
function in DevTools:
I also tried this:
import { AutoModel, AutoTokenizer, AutoProcessor } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.1.1';
let model = await AutoModel.from_pretrained('Xenova/clip-vit-base-patch16');
let processor = await AutoProcessor.from_pretrained("Xenova/clip-vit-base-patch16");
let inputs = await processor({text:["a photo of a cat", "a photo of an astronaut"], images:["https://i.imgur.com/fYhUGoY.jpg"]});
let outputs = await model(inputs);
But it seems that processor
expects an array of images, or something? The above code throws an error saying that an .rgb()
method should exist on the input.