-
Notifications
You must be signed in to change notification settings - Fork 61
Feat: add multimodal - poc #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this looks correct overall, I only have some small comments
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Hey there, I'm interested in taking a look into this on the weekend. One concern I have is about tokenization - is it possible to consistently get the token count of an image/audio input? |
Perhaps we need to use this approach for a reliable input token management. |
on iPhone 13 Pro: vision-pal0.mp4 |
Bravo! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks for trying to integrate this.
I may make some changes later.
input_text.add_special = n_past == 0; // Add BOS token if this is the first message | ||
input_text.parse_special = true; // Parse special tokens like <__image__> | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jhen0409 I thought this might be helpful to add, but if it feels like it's cluttering the code, feel free to remove it.
// Track the total number of tokens (both text and image) | ||
size_t total_token_count = 0; | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhen0409 same with this comment.
// Track the total number of tokens (both text and image) | ||
size_t total_token_count = 0; | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson I am wondering if this comment here is accurate. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this whole comment block seems correct to me. I'm glad you figured out that you need a separated size_t total_token_count
to keep track of the actual number of tokens, otherwise all_tokens.size()
will give the total position and not the total "context" tokens
You can also take a look at ggml-org/llama.cpp#13576 to see how I address this issue. It should be quite the same as you are doing here.
input_text.add_special = n_past == 0; // Add BOS token if this is the first message | ||
input_text.parse_special = true; // Parse special tokens like <__image__> | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson I am wondering if this comment here is accurate.
This PR is a poc for adding multimodal capabilities, based on https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd