[inference] Retrieve and expose LLM capabilities

Even with a normalized API, underlying LLM's capabilities vary from one model to another, and often from one provider to another.

For that reason, it would be very useful to define a set of "capabilities", and having a way to evaluate which capabilities a given LLM supports.

## Identified capabilities

Right now, it would mostly be about two things:

### native function calling 

If we do have a `simulated function calling` option for LLMs that don't have native support for function calling, we're not able to automatically set that option, as we don't know if the LLM we're using supports native FC or not. 

Being able to know if the model supports it would allow to automatically fallback to SFC if we know the model does not have native support. (We could then introduce that `automatic` value for the simulated function calling option)

### context window length

RAG tasks often retrieve large amounts of data, either by volume of documents, or length of those documents. Without knowing the LLM's max context length, it's difficult for those RAG tasks to properly process that data.

For example, if a RAG task retrieves a 15k token document, it might need to truncate / summarize / reduce it before sending it back to the LLM, as there could be a risk of exceeding the context window. On the other hand, if we do know that the LLM has way larger context window (say 512k), we can send it raw to make sure all the info are present.

Knowing about the LLM's context window is somewhat critical for RAG workflows to adapt.

## How to do it

###  As configuration on the connectors

We could handle that via configuration, asking the user to specify this when creating the connector

Note that this would be more of a stopgap than a real solution, given it moves the burden to the end user, and would also be error prone due to that fact.

### Retrieving the information from the LLM provider

Ideally, we would be able to retrieve (or deduce) this information from the provider. 

For that, we would need to check if there is a way to retrieve that info from each of our providers, but this might be complicated given that the openAI connector allows virtually any openAI compatible LLM, and there is, AFAIK, no openAI API to retrieve that information.

This might be even more complicated later, when we get to support the ES inference API / inference connector, as we would have an additional abstraction layer obstructing those informations 

### Just check?

We could think of just "checking" for capabilities, and store the info somewhere.

For FC it could be relatively easy, we just try to call with FC, and if we got an error, we know the provider does not support it.
For context length it's more tricky, we can send a very large input to make sure to exceed any LLM's context, but even if the info about the max context length is usually exposed in context length errors, it might be tricky to extract it given the message will vary.

Not sure if that approach would be viable for other capabilities we might identify later, too.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inference] Retrieve and expose LLM capabilities #199087

pgayvallet
openedon Nov 6, 2024

Identified capabilities

native function calling

context window length

How to do it

As configuration on the connectors

Retrieving the information from the LLM provider

Just check?

Assignees

Labels

Type

Projects

Milestone

Relationships

Development