Skip to content

Conversation

@nArn0
Copy link

@nArn0 nArn0 commented Jan 16, 2026

Implement Joycaption as a specific llava model. All information and quantized Joycaption model available here: https://huggingface.co/n-Arno/joycaption-mlx-mxfp4

I know this PR may not fit perfectly in mlx-vlm, but just in case, i prefer proposing the PR even if it may be rejected.

Thanks for your work on mlx-vlm!

@Blaizzy
Copy link
Owner

Blaizzy commented Jan 17, 2026

Hey @nArn0
This is perfect fit actually, thanks for the contributions!

A could nits:

  1. If the model uses llava arch, then in this PR you can just add a model mapping key, mapping llava_joycaption to the existing llava. https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/utils.py#L26-L33
  2. This can be a notebook with more details, you can check the existing notebooks: examples/test_joycaption.py

@nArn0
Copy link
Author

nArn0 commented Jan 18, 2026

Oh! Thanks for your feedback! I'll look into it, JoyCaption looks like a classic llava but since it uses Siglip2, i had to get your implementation from mlx-embeddings for the vision part which didn't fit the classic CLIP llava used.

Indeed, a notebook would help for the example.

My only issue i really didn't find how to fix is that if torchvision is present, torch.nn.functional.interpolate fails with a cryptic error.

@Blaizzy
Copy link
Owner

Blaizzy commented Jan 18, 2026

My pleasure!

In that case just import all the common components from llava (inherit from them) and add the new. For instance inherit Model and override vision_tower.

Check mistral3

https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/mistral3/mistral3.py#L6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants