Skip to content

Feat: add multimodal - poc #141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
May 19, 2025
Merged

Conversation

a-ghorbani
Copy link
Collaborator

This PR is a poc for adding multimodal capabilities, based on https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd

Copy link
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks correct overall, I only have some small comments

a-ghorbani and others added 2 commits May 12, 2025 21:44
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
@jhen0409 jhen0409 linked an issue May 13, 2025 that may be closed by this pull request
@Vali-98
Copy link
Contributor

Vali-98 commented May 14, 2025

Hey there, I'm interested in taking a look into this on the weekend. One concern I have is about tokenization - is it possible to consistently get the token count of an image/audio input?

@a-ghorbani
Copy link
Collaborator Author

Hey there, I'm interested in taking a look into this on the weekend. One concern I have is about tokenization - is it possible to consistently get the token count of an image/audio input?

mtmd_image_tokens_get_n_tokens will get you the number of tokens.
But my current implementation is not using that correctly, I guess.

Perhaps we need to use this approach for a reliable input token management.

@a-ghorbani
Copy link
Collaborator Author

on iPhone 13 Pro:

vision-pal0.mp4

@hmmhmmhm
Copy link

on iPhone 13 Pro:
vision-pal0.mp4

Bravo!

Copy link
Member

@jhen0409 jhen0409 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for trying to integrate this.

I may make some changes later.

input_text.add_special = n_past == 0; // Add BOS token if this is the first message
input_text.parse_special = true; // Parse special tokens like <__image__>

/**
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jhen0409 I thought this might be helpful to add, but if it feels like it's cluttering the code, feel free to remove it.

// Track the total number of tokens (both text and image)
size_t total_token_count = 0;

/**
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhen0409 same with this comment.

// Track the total number of tokens (both text and image)
size_t total_token_count = 0;

/**
Copy link
Collaborator Author

@a-ghorbani a-ghorbani May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson I am wondering if this comment here is accurate. thanks!

Copy link
Contributor

@ngxson ngxson May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this whole comment block seems correct to me. I'm glad you figured out that you need a separated size_t total_token_count to keep track of the actual number of tokens, otherwise all_tokens.size() will give the total position and not the total "context" tokens

You can also take a look at ggml-org/llama.cpp#13576 to see how I address this issue. It should be quite the same as you are doing here.

input_text.add_special = n_past == 0; // Add BOS token if this is the first message
input_text.parse_special = true; // Parse special tokens like <__image__>

/**
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson I am wondering if this comment here is accurate.

@a-ghorbani a-ghorbani marked this pull request as ready for review May 18, 2025 21:08
@jhen0409 jhen0409 linked an issue May 19, 2025 that may be closed by this pull request
@jhen0409 jhen0409 merged commit 615ecaa into mybigday:main May 19, 2025
6 checks passed
@a-ghorbani a-ghorbani deleted the feat/add-multimodal branch June 30, 2025 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vision support LLaVa support
5 participants