-
Notifications
You must be signed in to change notification settings - Fork 104
Adds custom inference service API docs #4852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/** | ||
* Specifies the JSON parser that is used to parse the response from the custom service. | ||
* Different task types require different json_parser parameters. | ||
* For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonathan-buttner Do you think we should specify a JsonParser class for each task type, or is this list sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think it might be better if we give an example of the response structure for each task type and explain how to create the parser from that.
We should also say that the format is a less featured version of JSONPath: https://en.wikipedia.org/wiki/JSONPath
Here are some examples:
Text Embeddings
For a response that looks like:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
0.014539449,
-0.015288644
]
}
],
"model": "text-embedding-ada-002-v2",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}
We'd need this definition:
"response": {
"json_parser": {
"text_embeddings": "$.data[*].embedding[*]"
}
}
Rerank
For a response that looks like:
{
"results": [
{
"index": 3,
"relevance_score": 0.999071,
"document": "abc"
},
{
"index": 4,
"relevance_score": 0.7867867,
"document": "123"
},
{
"index": 0,
"relevance_score": 0.32713068,
"document": "super"
}
]
}
We'd need this definition:
"response": {
"json_parser": {
"reranked_index":"$.results[*].index",
"relevance_score":"$.results[*].relevance_score",
"document_text":"$.results[*].document"
}
}
reranked_index
and document_text
are optional.
Completion
For a response that looks like:
{
"id": "chatcmpl-B9MBs8CjcvOU2jLn4n570S5qMJKcT",
"object": "chat.completion",
"created": 1741569952,
"model": "gpt-4.1-2025-04-14",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?",
"refusal": null,
"annotations": []
},
"logprobs": null,
"finish_reason": "stop"
}
]
}
We'd need this definition:
"response": {
"json_parser": {
"completion_result":"$.choices[*].message.content"
}
}
Sparse embedding
For a response that looks like:
{
"request_id": "75C50B5B-E79E-4930-****-F48DBB392231",
"latency": 22,
"usage": {
"token_count": 11
},
"result": {
"sparse_embeddings": [
{
"index": 0,
"embedding": [
{
"token_id": 6,
"weight": 0.101
},
{
"token_id": 163040,
"weight": 0.28417
}
]
}
]
}
}
We'd need this definition:
"response": {
"json_parser": {
"token_path": "$.result.sparse_embeddings[*].embedding[*].token_id",
"weight_path": "$.result.sparse_embeddings[*].embedding[*].weight"
}
}
If the token_path
resulting value (token_id
in this example) refers to a non-string (an integer in this example), it'll be converted to a string using Java's .toString()
method. Not sure how we want to articulate that though 🤔
} | ||
|
||
export enum CustomServiceType { | ||
custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonathan-buttner Should the ServiceType be custom
whenever it's specified for this service type? Or can it be anything, for example custom-model
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it must be custom
, just like to use openai it must be openai
.
/** | ||
* Create a custom inference endpoint. | ||
* | ||
* You can create an inference endpoint to perform an inference task with a custom model that supports the HTTP format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonathan-buttner Please suggest an alternative description if you think this is not sufficient. I tried to come up with something that is meaningful to me based on my limited knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe something like this:
The custom service gives more control over how to interact with external inference services that aren't explicitly supported through dedicated integrations. The custom service gives users the ability to define the headers, url, query parameters, request body, and secrets.
* The chunking configuration object. | ||
* @ext_doc_id inference-chunking | ||
*/ | ||
chunking_settings?: InferenceChunkingSettings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are chunking settings relevant for this service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
…ticsearch-specification into szabosteve/infer-put-custom
Following you can find the validation changes against the target branch for the APIs. No changes detected. You can validate these APIs yourself by using the |
…ticsearch-specification into szabosteve/infer-put-custom
WIP (I'll update this comment with a bunch of examples). Here are some examples: OpenAI Text Embedding
Cohere APIv2 Rerank
Cohere APIv2 Text Embedding
Jina AI Rerank
Hugging Face Text Embedding for model Qwen/Qwen3-Embedding-8B (other will be very similar)
TODO
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! We'll want to add a blurb about how the custom service performs template replacement.
The template replacement functionality allows templates (portions of a string that start with ${
and end with }
) to be replaced with the contents of a value that defines that key.
We look in secret_parameters
and task_settings
for keys to do template replacement.
We replace templates in the fields request
, headers
, url
, and query_parameters
.
If we fail to find the definition (key) for a template we emit an error.
So for example if we had the endpoint definition like this:
PUT _inference/text_embedding/test-text-embedding
{
"service": "custom",
"service_settings": {
"secret_parameters": {
"api_key": "<some api key>"
},
"url": "...endpoints.huggingface.cloud/v1/embeddings",
"headers": {
"Authorization": "Bearer ${api_key}",
"Content-Type": "application/json"
},
"request": "{\"input\": ${input}}",
"response": {
"json_parser": {
"text_embeddings":"$.data[*].embedding[*]"
}
}
}
}
We'll look to replace ${api_key}
from secret_parameters
and task_settings
. We should also make a note explicitly that the templates should not be surrounded by quotes (we add the quotes internally).
There are a few "special" templates:
${input}
this refers to the array of input strings that comes from theinput
field of the subsequent inference requests${input_type}
this refers to the input type translation values (I explain this below)${query}
this refers to thequery
field used specifically for rerank${top_n} this refers to the
top_n` field available when performing rerank requests${return_documents} this refers to the
return_documents` field available when performing rerank requests
@@ -758,6 +758,136 @@ export class CohereTaskSettings { | |||
truncate?: CohereTruncateType | |||
} | |||
|
|||
export class CustomServiceSettings { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also support query parameters:
"query_parameters": [
["test_key", "test_value"]
]
It's a list of tuples (the inner array must be 2 items).
This would be invalid:
"query_parameters": [
["test_key", "test_value", "some_other_value"]
]
This is valid:
"query_parameters": [
["test_key", "test_value"],
["test_key", "some_other_value"]
]
* For example: | ||
* ``` | ||
* "request":{ | ||
* "content":"{\"input\":${input}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We flattened this so it's "request": "{\"input\":${input}}"
now. We removed content
.
* } | ||
* ``` | ||
*/ | ||
error_parser: UserDefinedValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this field, we simplified the error handling logic and removed this.
/** | ||
* Specifies the JSON parser that is used to parse the response from the custom service. | ||
* Different task types require different json_parser parameters. | ||
* For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think it might be better if we give an example of the response structure for each task type and explain how to create the parser from that.
We should also say that the format is a less featured version of JSONPath: https://en.wikipedia.org/wiki/JSONPath
Here are some examples:
Text Embeddings
For a response that looks like:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
0.014539449,
-0.015288644
]
}
],
"model": "text-embedding-ada-002-v2",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}
We'd need this definition:
"response": {
"json_parser": {
"text_embeddings": "$.data[*].embedding[*]"
}
}
Rerank
For a response that looks like:
{
"results": [
{
"index": 3,
"relevance_score": 0.999071,
"document": "abc"
},
{
"index": 4,
"relevance_score": 0.7867867,
"document": "123"
},
{
"index": 0,
"relevance_score": 0.32713068,
"document": "super"
}
]
}
We'd need this definition:
"response": {
"json_parser": {
"reranked_index":"$.results[*].index",
"relevance_score":"$.results[*].relevance_score",
"document_text":"$.results[*].document"
}
}
reranked_index
and document_text
are optional.
Completion
For a response that looks like:
{
"id": "chatcmpl-B9MBs8CjcvOU2jLn4n570S5qMJKcT",
"object": "chat.completion",
"created": 1741569952,
"model": "gpt-4.1-2025-04-14",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?",
"refusal": null,
"annotations": []
},
"logprobs": null,
"finish_reason": "stop"
}
]
}
We'd need this definition:
"response": {
"json_parser": {
"completion_result":"$.choices[*].message.content"
}
}
Sparse embedding
For a response that looks like:
{
"request_id": "75C50B5B-E79E-4930-****-F48DBB392231",
"latency": 22,
"usage": {
"token_count": 11
},
"result": {
"sparse_embeddings": [
{
"index": 0,
"embedding": [
{
"token_id": 6,
"weight": 0.101
},
{
"token_id": 163040,
"weight": 0.28417
}
]
}
]
}
}
We'd need this definition:
"response": {
"json_parser": {
"token_path": "$.result.sparse_embeddings[*].embedding[*].token_id",
"weight_path": "$.result.sparse_embeddings[*].embedding[*].weight"
}
}
If the token_path
resulting value (token_id
in this example) refers to a non-string (an integer in this example), it'll be converted to a string using Java's .toString()
method. Not sure how we want to articulate that though 🤔
} | ||
|
||
export enum CustomServiceType { | ||
custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it must be custom
, just like to use openai it must be openai
.
/** | ||
* Create a custom inference endpoint. | ||
* | ||
* You can create an inference endpoint to perform an inference task with a custom model that supports the HTTP format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe something like this:
The custom service gives more control over how to interact with external inference services that aren't explicitly supported through dedicated integrations. The custom service gives users the ability to define the headers, url, query parameters, request body, and secrets.
* The chunking configuration object. | ||
* @ext_doc_id inference-chunking | ||
*/ | ||
chunking_settings?: InferenceChunkingSettings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
* The URL endpoint to use for the requests. | ||
*/ | ||
url?: string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also parse an optional input_type
field. Here's an example:
PUT _inference/text_embedding/test-text-embedding
{
"service": "custom",
"service_settings": {
"secret_parameters": {
"api_key": "<api key>"
},
"url": "https://api.cohere.com/v2/embed",
"headers": {
"Authorization": "bearer ${api_key}",
"Content-Type": "application/json"
},
"request": "{\"texts\": ${input}, \"model\": \"embed-v4.0\", \"input_type\": ${input_type}}",
"response": {
"json_parser": {
"text_embeddings":"$.embeddings.float[*]"
}
},
"input_type": {
"translation": {
"ingest": "search_document",
"search": "search_query"
},
"default": "search_document"
}
}
}
${input_type}
this refers to the input type translation values
"input_type": {
"translation": {
"ingest": "do_ingest",
"search": "do_search"
},
"default": "a_default"
},
If the subsequent inference requests come from a search context we'll use the search
key here and replace the template with do_search
. If it comes from the ingest
context we'll use do_ingest
. If it's a different context that we haven't specified well fallback to the default
provided. If no default is specified we use an empty string.
The keys we allow in translation
are:
classification
clustering
ingest
search
This is particularly useful for integrations like Cohere that allow an input type field in their API: https://docs.cohere.com/reference/embed#request.body.input_type
Overview
Related issue: https://github.com/elastic/developer-docs-team/issues/307
This PR adds documentation about the custom inference service.
@jonathan-buttner Could you please provide an example request that I can add to the docs?