-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multimodality #967
Multimodality #967
Conversation
Removed the `convert_to_Langchain_message` and `convert_to_Cat_message` functions, as they are no longer used in the code base and included in the `StrayCat class`.
Modified `CatMessage` and `UserMessage` classes to accept `images` and `audio` as either a single string or a list of strings.
- Marked the `content` field in `CatMessage` as deprecated and mapped it to the `text` field. - Introduced a `deprecation_warning` to notify users of the deprecation. - Added a `computed_field` and property for `content` to return the `text` value while maintaining backward compatibility.
- Added detailed docstrings to `Role`, `ModelInteraction`, `LLMModelInteraction`, `EmbedderModelInteraction`, `MessageWhy`, `CatMessage`, and `UserMessage` classes.
- Introduced `HistoryEntry` class for structured conversation history, encapsulating role, timestamp, and message content. - Deprecated direct `message`, `why`, and `who` attributes in favor of accessing message details through the `content` attribute in `HistoryEntry`. - Added `update_history` method, enhancing compatibility with `UserMessage` and `CatMessage` objects. - Deprecated `update_conversation_history` in favor of `update_history`.
- Refactoring LLM Initialization: Extracted LLM initialization logic into a private method `_initialize_llm` to streamline the main method, making the initialization process more modular and easier to maintain. - Added `_test_llm_mulimodality` to check if the selected LLM supports image input, using a black pixel in base64 as a test input.
- Refactored message mormatting introducing helper functions `format_human_message`, `format_ai_message`, and `format_images` to modularize the formatting of `HumanMessage` and `AIMessage` objects for Langchain compatibility. - Added `has_image_modality` to check if the LLM supports image inputs, leveraging this in `format_human_message` to handle image attachments when available.
If the selected LLM supports only image URIs, but a URL is provided, the image is downloaded and converted to base64 URI.
The audio attribute was removed because it is not currently used.
The image used to verify LLM support for LLM image URLs is now downloaded and converted to base64 to verify support for URIs.
Great stuff! P.S.: was the PR description created by an LLM? |
In part, it was one o'clock and I wanted to sleep ahahaha |
}) | ||
return content | ||
|
||
def _check_image_support(llm, image_type: str, image_value: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer having a good exception handling at the llm invocation layer.
If an admin selects an LLM not supporting images, and then sends it an image, can't we just handle that during LLM invocation with a try except? If any exception happens (images not supported or something else).
I would simplify this part a lot, take less responsibility on our shoulder and fail gracefully if something strange happen.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new changes, the call to a non-multimodal LLM will fail in the agent, returning the error like any other request error to the LLM. A little less robust but simpler, as you suggested
The role of the speaker (AI or Human). | ||
when : float | ||
The timestamp of the message. | ||
content : Union[UserMessage, CatMessage] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why encapsulating UserMessage
and CatMessage
in content
?
Can't we just have a list of instances of those two classes, and have role
and when
defined inside?
I sugget a class Message
, parent for both CatMessage
and HumanMessage
, having those attributes.
If this is too complicated and breaks compatibility, we'll wait for multimodality in v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe encapsulation offers several advantages over inheritance in this case.
First, it provides a clear separation of responsibilities, where UserMessage
and CatMessage
represent the message content, while HistoryEntry
encapsulates metadata like when
and role
, allowing us to add future conversation metadata without modifying the message classes.
Composition also helps us avoid potential inheritance issues if UserMessage
and CatMessage
evolve presenting significant differences and it makes it easier to add new message types without affecting the history structure.
For me, inheritance is not the answer in this case.
Modified `CatMessage` and `UserMessage` classes to use singular image and audio
- Removed LLMSupportedModalities for cleaner architecture. - Simplified image handling: image URLs are now always converted to base64. - If an LLM that does not support images is selected, an error is shown as usual.
- `role` is an immutable property. This attribute is useful to avoid checking the type of content to determine if it is a user message or a cat message. - `when` uses a default factory to set the current time.
|
||
# If the image is a URL, download it and encode it as a data URI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the image is an URL now is always downloaded and encoded to base64.
Thanks a lot for your amazing PR and patience, kudos @Pingdred . |
Description
This pull request introduces several enhancements to the Cheshire Cat project, focusing on improving the integration of language models with multimodal capabilities, whis focus on image support. These changes aim to provide a more standardized interface for developers to work with Cheshire Cat.
Multimodal Message Handling
The
CatMessage
andUserMessage
classes have been enhanced to support the inclusion of images and audio, facilitating richer interactions. This update enables users to effortlessly send and receive multimedia content during their conversations with the Cheshire Cat.Deprecation of content Field and update_conversation_history
The
content
field in theCatMessage
class has been marked as deprecated, and it has been mapped to thetext
field to maintain backward compatibility. This change ensures a more intuitive and standardized approach to message handling.Additionally, the
update_conversation_history
method has been deprecated in favor of the newupdate_history method
, which enhances compatibility withUserMessage
andCatMessage
objects and ensures that the conversation history is properly updated and maintained.Conversation History Management
The
HistoryEntry
class has been introduced to provide a structured representation of the conversation history. This class encapsulates the role, timestamp, and message content, allowing for more efficient and organized access to the conversation details.The direct
message
,why
, andwho
attributes have been deprecated in favor of accessing message details through thecontent
attribute inHistoryEntry
that can be aUserMessage
or aCatMessage
. This change promotes a more consistent and intuitive approach to working with the conversation history.The
update_history
method has been added inWorkingMemory
to enhance compatibility withUserMessage
andCatMessage
objects.LLM Initialization and Multimodal Support
During the initialization of the selected LLM in the CheshireCat class, the _check_image_support method is used to verify support for multimodal inputs. This method checks if the LLM can process image inputs by testing both an image URL and a base64-encoded data URI of an image.
Documentation Improvements
Detailed docstrings have been added to several key classes and methods, including
Role
,ModeInteraction
,LLMModelInteraction
,EmbedderModeInteraction
,MessageWhy
, andCatMessage
.Add Images to Langchain Conversation History
The process of converting messages to Langchain format in the StrayCat method langchainfy_chat_history has been refactored to introduce new helper functions: format_human_message, format_ai_message, and format_images. The format_images function specifically handles the inclusion of image content in the conversation history. It formats a list of image URLs or base64-encoded data URIs into the appropriate structure required by Langchain, ensuring that multimedia elements are properly formatted alongside HumanMessage and AIMessage objects.
In particular, if the LLM supports data URI but not image URLs, the format_images function automatically downloads the image, encodes it into base64, and wraps it in a data URI. Conversely, if the model supports image URLs, the function will include the image directly as a URL. This ensures that images are processed and included correctly, depending on the capabilities of the selected model.
Table of tested providers
Legend:
✅ Supported
❌ Compatibility Error
🟠 Not supported by APIs
Related to issue #564
Type of change
Checklist: