Problem In LLM Caching Current Implementation (Not work individually as set in Global) #17176

hasansustcse13 · 2024-02-07T13:25:33Z

hasansustcse13
Feb 7, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

I want to implement RedisSemanticCache for LLM call. But I see the current implementation set/get the cache instance in/from a global variable. So If I want to make the cache for a call specific it's not possible. I have an API where embedding related information(api_key, model etc) pass in request body from user. So If I want to use the embedding information in the cache it's not possible as it has been set in globally(more like singletone).

As a solution of the above problem langchain team can pass the cache instance in the cache property of the BaseChatModel. Currently this field is bool. Instead of bool we can make it nullable BaseCache. I mean when we create a LLM(ChatOpenAI) we will pass the cache instance.
There is another problem. Embedding model has a token limit. But I see when update the cache embedding happen on the prompt. So imagine a scenario where prompt is so long (e.g RAG search). In that case we need to have a also a functionality where user can decide which part of the prompt should go in embedding.
In the RedisSemanticCache indexing happen on the llm_string. There should be a way where developer can pass the indexing value based on the user requirement.

These problems persist in all the cache strategy provided by the langchain. @baskaryan @hwchase17 do you have any plan on this?

Motivation

I have described the problem above. There is not enough control to the developer for the current cache strategy. Developer can't even override anything as the Cache set in global instance.

Proposal (If applicable)

Mentioned Above

eyurtsev · 2024-02-07T13:52:04Z

eyurtsev
Feb 7, 2024
Maintainer

We'd probably want a caching solution that wraps existing primitives rather than being passed to existing primitives. We don't want the primitives code to know about the fact that there's a caching layer. Otherwise that will lead to design issues down the road (much like the current global cache is problematic).

I'd prefer to see a generic caching layer built on top of any runnable object or if it needs to be specialized to a chat model, the caching layer can inherit from BaseChatModel and accept as an instance:

The underlying model whose responses we want to cache
Caching configuration (e.g., eviction policy)

0 replies

hasansustcse13 · 2024-02-08T07:13:45Z

hasansustcse13
Feb 8, 2024
Author

I see you want a generic abstract solution for Cache. But I am not sure you are referring to my solution mention in point 1.

class BaseChatModel(BaseLanguageModel[BaseMessage], ABC):
    """Base class for Chat models."""

    cache: Optional[bool] = None # This will be changed as  cache: Optional[BaseCache] = None
    """Whether to cache the response."""
    # Rest of the code

    async def _agenerate_with_cache(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[AsyncCallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> ChatResult:
        new_arg_supported = inspect.signature(self._agenerate).parameters.get(
            "run_manager"
        )
        disregard_cache = self.cache is not None and not self.cache
        # I am taling about this, instead of acces this from gloablly we can get this from self.cache and can call the llm_cache.lookup and llm_cache.update on that
        llm_cache = get_llm_cache()
        if llm_cache is None or disregard_cache:
            # This happens when langchain.cache is None, but self.cache is True
            if self.cache is not None and self.cache:
                raise ValueError(
                    "Asked to cache, but no cache found at `langchain.cache`."
                )
            if new_arg_supported:
                return await self._agenerate(
                    messages, stop=stop, run_manager=run_manager, **kwargs
                )
            else:
                return await self._agenerate(messages, stop=stop, **kwargs)
        else:
            llm_string = self._get_llm_string(stop=stop, **kwargs)
            prompt = dumps(messages)
            cache_val = llm_cache.lookup(prompt, llm_string)
            if isinstance(cache_val, list):
                return ChatResult(generations=cache_val)
            else:
                if new_arg_supported:
                    result = await self._agenerate(
                        messages, stop=stop, run_manager=run_manager, **kwargs
                    )
                else:
                    result = await self._agenerate(messages, stop=stop, **kwargs)
                llm_cache.update(prompt, llm_string, result.generations)
                return result

This is the code from langchain. Please see the comment in the code to better understanding the my solution(point 1).

BUG: Also I have found that current global cache implementation not work with stream response.

@eyurtsev I see you are a core member of Langchain. Could you please share your planning on this? Will these feature add in later? When? It would be nice if you create a Issue from this conversation (only if you have a plan to rewrite the cache strategy) so that other developer will know about this. This is important as LLM takes long time to response. And you also agreed that current cache implementation is problematic and not developer friendly.

2 replies

eyurtsev Feb 8, 2024
Maintainer

Got it. Yeah we can definitely add that. I'll create an issue.

casperdcl May 24, 2024

Just created #22140 for streaming

eyurtsev · 2024-02-08T15:38:50Z

eyurtsev
Feb 8, 2024
Maintainer

Added an issue here: #17242

Given that the change you propose is very minimal and leverages existing code, I think it makes sense to extend accept values for the cache.

1 reply

hasansustcse13 Feb 8, 2024
Author

@eyurtsev Thanks! Also, the cache should work with stream, stream_log and stream_event on the LCEL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem In LLM Caching Current Implementation (Not work individually as set in Global) #17176

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Problem In LLM Caching Current Implementation (Not work individually as set in Global) #17176

hasansustcse13 Feb 7, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 3 comments · 3 replies

eyurtsev Feb 7, 2024 Maintainer

hasansustcse13 Feb 8, 2024 Author

eyurtsev Feb 8, 2024 Maintainer

casperdcl May 24, 2024

eyurtsev Feb 8, 2024 Maintainer

hasansustcse13 Feb 8, 2024 Author

hasansustcse13
Feb 7, 2024

Replies: 3 comments 3 replies

eyurtsev
Feb 7, 2024
Maintainer

hasansustcse13
Feb 8, 2024
Author

eyurtsev Feb 8, 2024
Maintainer

eyurtsev
Feb 8, 2024
Maintainer

hasansustcse13 Feb 8, 2024
Author