-
Couldn't load subscription status.
- Fork 0
Description
In an attempt to create a provider that leverages local models (i.e. models run in the browser), I've run in a few circles. Documenting here for future self.
Transformers.js
Ideally, Huggingface's Transformers.js library would meet my needs. but it falls short in the one area I need it to work.
My assumption is that I need to leverage the Image-text-to-text pipeline. The JS implementation of the library does not support Image-text-to-text. There is a PR to support it. Nor is there an AutoModel* for it.
web-llm
The web-llm project looks very promising, but it seems to have quite a few bugs.
For vision, I tried using the recommended Phi 3.5 model, but encountered this error.
It seems that Gemma 3 is not supported yet, as I get the same error as in this issue.
MediaPipe
MediaPipe has been the most successful endeavor so far.
They support multimodal prompting out of the box.
The biggest hangup is how the model is hosted.
Self serving the model file
They recommend downloading the model file and serving it yourself. The model is gated, and you have to be logged and granted access to download it.
That is just not reasonable for a plugin. There's no way I'm publishing a 4GB model file to npm.
User download
The provider could provide a way for users to download the file.
Huggingface's OAuth login requires a clientId if not signing in via a HF space. So I can't just ask users to sign in and then use the model.
Another option is to ask users to supply their own HF token, but that requires that they have a token and access to the model.
Ideally, if web-llm get fixed, that would be best.