Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodality #967

Merged
merged 19 commits into from
Dec 20, 2024
Merged

Multimodality #967

merged 19 commits into from
Dec 20, 2024

Conversation

Pingdred
Copy link
Member

@Pingdred Pingdred commented Nov 7, 2024

Description

This pull request introduces several enhancements to the Cheshire Cat project, focusing on improving the integration of language models with multimodal capabilities, whis focus on image support. These changes aim to provide a more standardized interface for developers to work with Cheshire Cat.

Multimodal Message Handling

The CatMessage and UserMessage classes have been enhanced to support the inclusion of images and audio, facilitating richer interactions. This update enables users to effortlessly send and receive multimedia content during their conversations with the Cheshire Cat.

Deprecation of content Field and update_conversation_history

The content field in the CatMessage class has been marked as deprecated, and it has been mapped to the text field to maintain backward compatibility. This change ensures a more intuitive and standardized approach to message handling.

Additionally, the update_conversation_history method has been deprecated in favor of the new update_history method, which enhances compatibility with UserMessage and CatMessage objects and ensures that the conversation history is properly updated and maintained.

Conversation History Management

The HistoryEntry class has been introduced to provide a structured representation of the conversation history. This class encapsulates the role, timestamp, and message content, allowing for more efficient and organized access to the conversation details.

The direct message, why, and who attributes have been deprecated in favor of accessing message details through the content attribute in HistoryEntry that can be a UserMessage or a CatMessage. This change promotes a more consistent and intuitive approach to working with the conversation history.

The update_history method has been added in WorkingMemory to enhance compatibility with UserMessage and CatMessage objects.

LLM Initialization and Multimodal Support

During the initialization of the selected LLM in the CheshireCat class, the _check_image_support method is used to verify support for multimodal inputs. This method checks if the LLM can process image inputs by testing both an image URL and a base64-encoded data URI of an image.

Documentation Improvements

Detailed docstrings have been added to several key classes and methods, including Role, ModeInteraction, LLMModelInteraction, EmbedderModeInteraction, MessageWhy, and CatMessage.

Add Images to Langchain Conversation History

The process of converting messages to Langchain format in the StrayCat method langchainfy_chat_history has been refactored to introduce new helper functions: format_human_message, format_ai_message, and format_images. The format_images function specifically handles the inclusion of image content in the conversation history. It formats a list of image URLs or base64-encoded data URIs into the appropriate structure required by Langchain, ensuring that multimedia elements are properly formatted alongside HumanMessage and AIMessage objects.

In particular, if the LLM supports data URI but not image URLs, the format_images function automatically downloads the image, encodes it into base64, and wraps it in a data URI. Conversely, if the model supports image URLs, the function will include the image directly as a URL. This ensures that images are processed and included correctly, depending on the capabilities of the selected model.

Table of tested providers

Provider Image URL Image URI
OpenAI API
Together AI
Google API
Anthropic API 🟠
Ollama 🟠

Legend:
✅ Supported
❌ Compatibility Error
🟠 Not supported by APIs

Related to issue #564

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas

Removed the `convert_to_Langchain_message` and `convert_to_Cat_message` functions, as they are no longer used in the code base and included in the `StrayCat class`.
Modified `CatMessage` and `UserMessage` classes to accept `images` and `audio` as either a single string or a list of strings.
- Marked the `content` field in `CatMessage` as deprecated and mapped it to the `text` field.
- Introduced a `deprecation_warning` to notify users of the deprecation.
- Added a `computed_field` and property for `content` to return the `text` value while maintaining backward compatibility.
- Added detailed docstrings to `Role`, `ModelInteraction`, `LLMModelInteraction`, `EmbedderModelInteraction`, `MessageWhy`, `CatMessage`, and `UserMessage` classes.
- Introduced `HistoryEntry` class for structured conversation history, encapsulating role, timestamp, and message content.

- Deprecated direct `message`, `why`, and `who` attributes in favor of accessing message details through the `content` attribute in `HistoryEntry`.

- Added `update_history` method, enhancing compatibility with `UserMessage` and `CatMessage` objects.

- Deprecated `update_conversation_history` in favor of `update_history`.
- Refactoring LLM Initialization: Extracted LLM initialization logic into a private method `_initialize_llm` to streamline the main method, making the initialization process more modular and easier to maintain.

- Added  `_test_llm_mulimodality` to check if the selected LLM supports image input, using a black pixel in base64 as a test input.
- Refactored message mormatting introducing helper functions `format_human_message`, `format_ai_message`, and `format_images` to modularize the formatting of `HumanMessage` and `AIMessage` objects for Langchain compatibility.
- Added `has_image_modality` to check if the LLM supports image inputs, leveraging this in `format_human_message` to handle image attachments when available.
If the selected LLM supports only image URIs, but a URL is provided, the image is downloaded and converted to base64 URI.
 The audio attribute was removed because it is not currently used.
The image used to verify LLM support for LLM image URLs is now downloaded and converted to base64 to verify support for URIs.
@pieroit
Copy link
Member

pieroit commented Nov 7, 2024

Great stuff!
Please be attentive to non over-engineer ;)
Less is more

P.S.: was the PR description created by an LLM?

@Pingdred
Copy link
Member Author

Pingdred commented Nov 7, 2024

Great stuff!
Please be attentive to non over-engineer ;)
Less is more

P.S.: was the PR description created by an LLM?

In part, it was one o'clock and I wanted to sleep ahahaha

@pieroit
Copy link
Member

pieroit commented Nov 7, 2024

In part, it was one o'clock and I wanted to sleep ahahaha

17309913470836991328327590263158

@valentimarco valentimarco added enhancement New feature or request LLM Related to language model / embedder labels Nov 7, 2024
})
return content

def _check_image_support(llm, image_type: str, image_value: str) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer having a good exception handling at the llm invocation layer.

If an admin selects an LLM not supporting images, and then sends it an image, can't we just handle that during LLM invocation with a try except? If any exception happens (images not supported or something else).

I would simplify this part a lot, take less responsibility on our shoulder and fail gracefully if something strange happen.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new changes, the call to a non-multimodal LLM will fail in the agent, returning the error like any other request error to the LLM. A little less robust but simpler, as you suggested

core/cat/looking_glass/cheshire_cat.py Outdated Show resolved Hide resolved
The role of the speaker (AI or Human).
when : float
The timestamp of the message.
content : Union[UserMessage, CatMessage]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why encapsulating UserMessage and CatMessage in content?
Can't we just have a list of instances of those two classes, and have role and when defined inside?
I sugget a class Message, parent for both CatMessage and HumanMessage, having those attributes.

If this is too complicated and breaks compatibility, we'll wait for multimodality in v2

Copy link
Member Author

@Pingdred Pingdred Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe encapsulation offers several advantages over inheritance in this case.

First, it provides a clear separation of responsibilities, where UserMessage and CatMessage represent the message content, while HistoryEntry encapsulates metadata like when and role, allowing us to add future conversation metadata without modifying the message classes.

Composition also helps us avoid potential inheritance issues if UserMessage and CatMessage evolve presenting significant differences and it makes it easier to add new message types without affecting the history structure.

For me, inheritance is not the answer in this case.

Modified `CatMessage` and `UserMessage` classes to use singular  image and audio
- Removed LLMSupportedModalities for cleaner architecture.
- Simplified image handling: image URLs are now always converted to base64.
- If an LLM that does not support images is selected, an error is shown as usual.
- `role`  is an immutable property. This attribute is useful to avoid checking the type of content to determine if it is a user message or a cat message.
- `when` uses a default factory to set the current time.

# If the image is a URL, download it and encode it as a data URI
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the image is an URL now is always downloaded and encoded to base64.

@pieroit
Copy link
Member

pieroit commented Dec 20, 2024

Thanks a lot for your amazing PR and patience, kudos @Pingdred .

@pieroit pieroit merged commit 10b4c2a into cheshire-cat-ai:develop Dec 20, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request LLM Related to language model / embedder
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants