Feat: add multimodal - poc #141

a-ghorbani · 2025-05-12T07:33:54Z

This PR is a poc for adding multimodal capabilities, based on https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd

ngxson

Nice, this looks correct overall, I only have some small comments

cpp/rn-llama.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

Vali-98 · 2025-05-14T14:21:54Z

Hey there, I'm interested in taking a look into this on the weekend. One concern I have is about tokenization - is it possible to consistently get the token count of an image/audio input?

a-ghorbani · 2025-05-14T16:44:07Z

Hey there, I'm interested in taking a look into this on the weekend. One concern I have is about tokenization - is it possible to consistently get the token count of an image/audio input?

mtmd_image_tokens_get_n_tokens will get you the number of tokens.
But my current implementation is not using that correctly, I guess.

Perhaps we need to use this approach for a reliable input token management.

a-ghorbani · 2025-05-15T21:16:07Z

on iPhone 13 Pro:

vision-pal0.mp4

hmmhmmhm · 2025-05-16T02:54:44Z

on iPhone 13 Pro:
vision-pal0.mp4

Bravo!

jhen0409

Awesome! Thanks for trying to integrate this.

I may make some changes later.

cpp/rn-llama.cpp

cpp/rn-llama.h

a-ghorbani · 2025-05-18T20:29:58Z

cpp/rn-llama.cpp

+    input_text.add_special = n_past == 0;  // Add BOS token if this is the first message
+    input_text.parse_special = true;       // Parse special tokens like <__image__>
+
+    /**


Hey @jhen0409 I thought this might be helpful to add, but if it feels like it's cluttering the code, feel free to remove it.

a-ghorbani · 2025-05-18T20:30:31Z

cpp/rn-llama.cpp

+    // Track the total number of tokens (both text and image)
+    size_t total_token_count = 0;
+
+    /**


@jhen0409 same with this comment.

a-ghorbani · 2025-05-18T20:35:42Z

cpp/rn-llama.cpp

+    // Track the total number of tokens (both text and image)
+    size_t total_token_count = 0;
+
+    /**


@ngxson I am wondering if this comment here is accurate. thanks!

Yes this whole comment block seems correct to me. I'm glad you figured out that you need a separated size_t total_token_count to keep track of the actual number of tokens, otherwise all_tokens.size() will give the total position and not the total "context" tokens

You can also take a look at ggml-org/llama.cpp#13576 to see how I address this issue. It should be quite the same as you are doing here.

a-ghorbani · 2025-05-18T20:36:08Z

cpp/rn-llama.cpp

+    input_text.add_special = n_past == 0;  // Add BOS token if this is the first message
+    input_text.parse_special = true;       // Parse special tokens like <__image__>
+
+    /**


@ngxson I am wondering if this comment here is accurate.

…processImage

a-ghorbani added 2 commits May 11, 2025 19:26

feat: add multimodal

668f1eb

feat: include image processing in doCompletion

45801aa

ngxson reviewed May 12, 2025

View reviewed changes

cpp/rn-llama.cpp Outdated Show resolved Hide resolved

cpp/rn-llama.cpp Outdated Show resolved Hide resolved

a-ghorbani and others added 2 commits May 12, 2025 21:44

Update cpp/rn-llama.cpp

2b7372c

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

Update cpp/rn-llama.cpp

aae6515

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

jhen0409 linked an issue May 13, 2025 that may be closed by this pull request

Vision support #140

Closed

fix: calculate token counts and embd correctly for multimodal path

3fa3502

chore: remove debug logger

2214a97

fix(ios): correct file paths in iOS CMakeLists.txt

e22cd25

a-ghorbani mentioned this pull request May 14, 2025

[Feat]: llama.cpp adds vision support a-ghorbani/pocketpal-ai#309

Closed

Vali-98 mentioned this pull request May 15, 2025

[FEATURE] llama.cpp adds vision support Vali-98/ChatterUI#339

Closed

a-ghorbani added 3 commits May 15, 2025 15:19

feat: add mtmd header files and update include paths in build scripts

b513b00

feat: expose mmproj_use_gpu

0a27900

fix: remove use_gpu from initMultimodal api

620a435

jhen0409 reviewed May 17, 2025

View reviewed changes

cpp/rn-llama.cpp Outdated Show resolved Hide resolved

cpp/rn-llama.h Outdated Show resolved Hide resolved

jhen0409 added 11 commits May 17, 2025 10:06

fix(ts): correct image path

87ed99a

feat: accept multiple image paths

90437e7

fix(cpp): embd cache

09ca6c5

feat(android): refactor image_paths param

1fb0bda

fix(cpp): correct log

788b496

feat(cpp): avoid process chunk for embed

91a857c

feat: support base64 image

664c0b1

fix(ios): revert lock change

f2436a5

fix(ts): codegen

c4038ef

feat(ts): refactor getFormattedChat

a2e1966

chore(example): cleanup

93aba5d

jhen0409 mentioned this pull request May 18, 2025

feat: integrate mtmd mybigday/llama.node#91

Merged

4 tasks

a-ghorbani added 2 commits May 18, 2025 09:44

fix(example): limit image selection to supported formats only

bd700a0

chore: clean up some logs

70dfb3c

a-ghorbani commented May 18, 2025

View reviewed changes

fix: add check to prevent context overflow when processing tokens in …

3ea0cc7

…processImage

a-ghorbani marked this pull request as ready for review May 18, 2025 21:08

jhen0409 linked an issue May 19, 2025 that may be closed by this pull request

LLaVa support #33

Closed

jhen0409 added 5 commits May 19, 2025 10:37

feat: remove mmproj param in initContext

7978f08

feat: move mmproj_use_gpu to initMultimodal

6efc96c

fix: remove mmproj_use_gpu in NativeContextParams

52068b5

feat: add releaseMultimodal method

4c34cb7

feat(cpp, ios): avoid mtmd/clip in framework headers

be1bfe3

jhen0409 approved these changes May 19, 2025

View reviewed changes

jhen0409 merged commit 615ecaa into mybigday:main May 19, 2025
6 checks passed

a-ghorbani deleted the feat/add-multimodal branch June 30, 2025 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: add multimodal - poc #141

Feat: add multimodal - poc #141

Uh oh!

a-ghorbani commented May 12, 2025

Uh oh!

ngxson left a comment

Uh oh!

Uh oh!

Uh oh!

Vali-98 commented May 14, 2025

Uh oh!

a-ghorbani commented May 14, 2025

Uh oh!

a-ghorbani commented May 15, 2025

Uh oh!

hmmhmmhm commented May 16, 2025

Uh oh!

jhen0409 left a comment

Uh oh!

Uh oh!

Uh oh!

a-ghorbani May 18, 2025

Uh oh!

a-ghorbani May 18, 2025

Uh oh!

a-ghorbani May 18, 2025 •

edited

Loading

Uh oh!

ngxson May 18, 2025 •

edited

Loading

Uh oh!

a-ghorbani May 18, 2025

Uh oh!

Uh oh!

Uh oh!

Feat: add multimodal - poc #141

Feat: add multimodal - poc #141

Uh oh!

Conversation

a-ghorbani commented May 12, 2025

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Vali-98 commented May 14, 2025

Uh oh!

a-ghorbani commented May 14, 2025

Uh oh!

a-ghorbani commented May 15, 2025

Uh oh!

hmmhmmhm commented May 16, 2025

Uh oh!

jhen0409 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

a-ghorbani May 18, 2025

Choose a reason for hiding this comment

Uh oh!

a-ghorbani May 18, 2025

Choose a reason for hiding this comment

Uh oh!

a-ghorbani May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a-ghorbani May 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

a-ghorbani May 18, 2025 •

edited

Loading

ngxson May 18, 2025 •

edited

Loading