Multimodality #967

Pingdred · 2024-11-07T00:02:51Z

Description

This pull request introduces several enhancements to the Cheshire Cat project, focusing on improving the integration of language models with multimodal capabilities, whis focus on image support. These changes aim to provide a more standardized interface for developers to work with Cheshire Cat.

Multimodal Message Handling

The CatMessage and UserMessage classes have been enhanced to support the inclusion of images and audio, facilitating richer interactions. This update enables users to effortlessly send and receive multimedia content during their conversations with the Cheshire Cat.

Deprecation of content Field and update_conversation_history

The content field in the CatMessage class has been marked as deprecated, and it has been mapped to the text field to maintain backward compatibility. This change ensures a more intuitive and standardized approach to message handling.

Additionally, the update_conversation_history method has been deprecated in favor of the new update_history method, which enhances compatibility with UserMessage and CatMessage objects and ensures that the conversation history is properly updated and maintained.

Conversation History Management

The HistoryEntry class has been introduced to provide a structured representation of the conversation history. This class encapsulates the role, timestamp, and message content, allowing for more efficient and organized access to the conversation details.

The direct message, why, and who attributes have been deprecated in favor of accessing message details through the content attribute in HistoryEntry that can be a UserMessage or a CatMessage. This change promotes a more consistent and intuitive approach to working with the conversation history.

The update_history method has been added in WorkingMemory to enhance compatibility with UserMessage and CatMessage objects.

LLM Initialization and Multimodal Support

During the initialization of the selected LLM in the CheshireCat class, the _check_image_support method is used to verify support for multimodal inputs. This method checks if the LLM can process image inputs by testing both an image URL and a base64-encoded data URI of an image.

Documentation Improvements

Detailed docstrings have been added to several key classes and methods, including Role, ModeInteraction, LLMModelInteraction, EmbedderModeInteraction, MessageWhy, and CatMessage.

Add Images to Langchain Conversation History

The process of converting messages to Langchain format in the StrayCat method langchainfy_chat_history has been refactored to introduce new helper functions: format_human_message, format_ai_message, and format_images. The format_images function specifically handles the inclusion of image content in the conversation history. It formats a list of image URLs or base64-encoded data URIs into the appropriate structure required by Langchain, ensuring that multimedia elements are properly formatted alongside HumanMessage and AIMessage objects.

In particular, if the LLM supports data URI but not image URLs, the format_images function automatically downloads the image, encodes it into base64, and wraps it in a data URI. Conversely, if the model supports image URLs, the function will include the image directly as a URL. This ensures that images are processed and included correctly, depending on the capabilities of the selected model.

Table of tested providers

Provider	Image URL	Image URI
OpenAI API	✅	✅
Together AI	✅	✅
Google API	✅	✅
Anthropic API	🟠	✅
Ollama	🟠	✅

Legend:
✅ Supported
❌ Compatibility Error
🟠 Not supported by APIs

Related to issue #564

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas

Removed the `convert_to_Langchain_message` and `convert_to_Cat_message` functions, as they are no longer used in the code base and included in the `StrayCat class`.

Modified `CatMessage` and `UserMessage` classes to accept `images` and `audio` as either a single string or a list of strings.

- Marked the `content` field in `CatMessage` as deprecated and mapped it to the `text` field. - Introduced a `deprecation_warning` to notify users of the deprecation. - Added a `computed_field` and property for `content` to return the `text` value while maintaining backward compatibility.

- Added detailed docstrings to `Role`, `ModelInteraction`, `LLMModelInteraction`, `EmbedderModelInteraction`, `MessageWhy`, `CatMessage`, and `UserMessage` classes.

- Introduced `HistoryEntry` class for structured conversation history, encapsulating role, timestamp, and message content. - Deprecated direct `message`, `why`, and `who` attributes in favor of accessing message details through the `content` attribute in `HistoryEntry`. - Added `update_history` method, enhancing compatibility with `UserMessage` and `CatMessage` objects. - Deprecated `update_conversation_history` in favor of `update_history`.

- Refactoring LLM Initialization: Extracted LLM initialization logic into a private method `_initialize_llm` to streamline the main method, making the initialization process more modular and easier to maintain. - Added `_test_llm_mulimodality` to check if the selected LLM supports image input, using a black pixel in base64 as a test input.

- Refactored message mormatting introducing helper functions `format_human_message`, `format_ai_message`, and `format_images` to modularize the formatting of `HumanMessage` and `AIMessage` objects for Langchain compatibility. - Added `has_image_modality` to check if the LLM supports image inputs, leveraging this in `format_human_message` to handle image attachments when available.

If the selected LLM supports only image URIs, but a URL is provided, the image is downloaded and converted to base64 URI.

core/cat/looking_glass/cheshire_cat.py

core/cat/looking_glass/stray_cat.py

core/pyproject.toml

core/cat/looking_glass/stray_cat.py

The audio attribute was removed because it is not currently used.

The image used to verify LLM support for LLM image URLs is now downloaded and converted to base64 to verify support for URIs.

pieroit · 2024-11-07T14:14:18Z

Great stuff!
Please be attentive to non over-engineer ;)
Less is more

P.S.: was the PR description created by an LLM?

Pingdred · 2024-11-07T14:48:41Z

Great stuff!
Please be attentive to non over-engineer ;)
Less is more

P.S.: was the PR description created by an LLM?

In part, it was one o'clock and I wanted to sleep ahahaha

pieroit · 2024-11-07T14:56:19Z

In part, it was one o'clock and I wanted to sleep ahahaha

core/cat/looking_glass/cheshire_cat.py

pieroit · 2024-12-09T15:13:29Z

core/cat/looking_glass/cheshire_cat.py

+            })
+            return content
+
+        def _check_image_support(llm, image_type: str, image_value: str) -> None:


I'd prefer having a good exception handling at the llm invocation layer.

If an admin selects an LLM not supporting images, and then sends it an image, can't we just handle that during LLM invocation with a try except? If any exception happens (images not supported or something else).

I would simplify this part a lot, take less responsibility on our shoulder and fail gracefully if something strange happen.

What do you think?

With the new changes, the call to a non-multimodal LLM will fail in the agent, returning the error like any other request error to the LLM. A little less robust but simpler, as you suggested

core/cat/looking_glass/cheshire_cat.py

pieroit · 2024-12-09T15:18:55Z

core/cat/memory/working_memory.py

+        The role of the speaker (AI or Human).
+    when : float
+        The timestamp of the message.
+    content : Union[UserMessage, CatMessage]


Why encapsulating UserMessage and CatMessage in content?
Can't we just have a list of instances of those two classes, and have role and when defined inside?
I sugget a class Message, parent for both CatMessage and HumanMessage, having those attributes.

If this is too complicated and breaks compatibility, we'll wait for multimodality in v2

I believe encapsulation offers several advantages over inheritance in this case.

First, it provides a clear separation of responsibilities, where UserMessage and CatMessage represent the message content, while HistoryEntry encapsulates metadata like when and role, allowing us to add future conversation metadata without modifying the message classes.

Composition also helps us avoid potential inheritance issues if UserMessage and CatMessage evolve presenting significant differences and it makes it easier to add new message types without affecting the history structure.

For me, inheritance is not the answer in this case.

Modified `CatMessage` and `UserMessage` classes to use singular image and audio

- Removed LLMSupportedModalities for cleaner architecture. - Simplified image handling: image URLs are now always converted to base64. - If an LLM that does not support images is selected, an error is shown as usual.

- `role` is an immutable property. This attribute is useful to avoid checking the type of content to determine if it is a user message or a cat message. - `when` uses a default factory to set the current time.

Pingdred · 2024-12-19T12:36:49Z

core/cat/looking_glass/stray_cat.py


+                # If the image is a URL, download it and encode it as a data URI


If the image is an URL now is always downloaded and encoded to base64.

pieroit · 2024-12-20T00:04:28Z

Thanks a lot for your amazing PR and patience, kudos @Pingdred .

Pingdred added 11 commits November 5, 2024 21:33

Add: Util function for deprecation warning

ad85c2a

Del: convert_to_Langchain_message and convert_to_Cat_message

e21703a

Removed the `convert_to_Langchain_message` and `convert_to_Cat_message` functions, as they are no longer used in the code base and included in the `StrayCat class`.

Add: image and audio to CatMessage and UserMessage

6cc8c04

Modified `CatMessage` and `UserMessage` classes to accept `images` and `audio` as either a single string or a list of strings.

Docs: documented messages.py

6e02977

- Added detailed docstrings to `Role`, `ModelInteraction`, `LLMModelInteraction`, `EmbedderModelInteraction`, `MessageWhy`, `CatMessage`, and `UserMessage` classes.

Fix: check image support only if there is a selected LLM

ce93029

Add: Check support for image uri and url

e1bf0dd

Add: Automatically download image

f5a4997

If the selected LLM supports only image URIs, but a URL is provided, the image is downloaded and converted to base64 URI.