-
Notifications
You must be signed in to change notification settings - Fork 438
feat: add token completions endpoint #1565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughSupport for a new Dynamo Token Completion API has been added. This includes a new protocol, engine types, model management, and an HTTP endpoint for token-level completions. The implementation integrates with the existing service and model watcher, exposes new configuration options, and makes several traits and functions public for external use. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTP_Service
participant ModelManager
participant TokenCompletionEngine
Client->>HTTP_Service: POST /tokens/completion (DynamoTokenCompletionRequest)
HTTP_Service->>ModelManager: get_token_completion_engine(model)
ModelManager-->>HTTP_Service: TokenCompletionStreamingEngine
HTTP_Service->>TokenCompletionEngine: generate(request)
TokenCompletionEngine-->>HTTP_Service: Stream<CompletionResponse>
HTTP_Service-->>Client: JSON CompletionResponse
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
lib/llm/src/discovery/model_manager.rs (1)
68-72
:has_model_any
forgets token-completion models
has_model_any
returnstrue
for chat/completions engines only. Calls that rely on it (e.g. health probes, routing decisions) will incorrectly report that a token-completion model is missing.- self.chat_completion_engines.read().unwrap().contains(model) - || self.completion_engines.read().unwrap().contains(model) + self.chat_completion_engines.read().unwrap().contains(model) + || self.completion_engines.read().unwrap().contains(model) + || self.token_completion_engines.read().unwrap().contains(model)
🧹 Nitpick comments (8)
lib/llm/src/protocols/openai.rs (1)
135-155
: Visibility bump – document in the public changelogMaking these traits
pub
is fine, but it’s a semver-visible surface change.
Consider adding release-notes explaining the new public API so external users know it’s supported.lib/llm/src/http/service/service_v2.rs (1)
97-99
: Missing doc comment for new config flag
enable_dynamo_endpoints
lacks a short doc comment describing what “Dynamo endpoints” are and why one might disable them.lib/llm/src/discovery/watcher.rs (2)
284-327
: Pipeline construction duplicated – extract helperThe token-completion engine setup repeats ~40 lines already present for chat & text completions.
Factoring a helper likebuild_backend_pipeline<F, R>(...)
would:• remove copy-paste risk
• centralise future fixes (router-mode branches, KV chooser, etc.)No functional bug, but maintainability will suffer as each pipeline evolves.
250-252
: Unnecessary clone?
client.clone()
is used for the completions pipeline but the originalclient
is later moved into the token-completion router, making the first clone redundant. Dropping the clone reduces one cheap but avoidable Arc bump.lib/llm/src/types.rs (1)
20-34
: Avoidsuper::*
– keep the public surface explicitUsing
use super::*;
inside the nested module brings every item from the parent scope into the module, making it easy to shadow names accidentally and hiding the real dependencies. Prefer importing only what the module actually needs (e.g.Annotated
,TokenIdType
,protocols
) for clearer intent and faster compile times.lib/llm/src/http/service/openai.rs (2)
389-392
: Restrict visibility topub(crate)
check_ready
is nowpub
, but it is still consumed only inside the crate. Exposing it publicly enlarges the API surface and makes breaking-changes harder. Change topub(crate)
(or keep it private) unless there is a concrete cross-crate consumer.
512-515
: Cache the env-var lookup
process_event_converter
callsstd::env::var("DYN_RICH_EVENT_STREAM")
for every chunk. This shows up on the hot path when streaming many tokens.Store the boolean once:
static RICH_STREAM: once_cell::sync::Lazy<bool> = once_cell::sync::Lazy::new(|| std::env::var("DYN_RICH_EVENT_STREAM").is_ok());and replace the runtime check with
if !*RICH_STREAM && annotated.event.as_deref() == Some(ANNOTATION_LLM_METRICS) { ... }
.lib/llm/src/protocols/token_completions.rs (1)
140-147
: Validatemodel
field as wellWhile
token_ids
is checked for emptiness, an emptymodel
string would propagate downstream and surface as a less actionable backend error. Add an early guard:if request.model.trim().is_empty() { return Err(anyhow::anyhow!("model cannot be empty")); }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
lib/llm/src/discovery/model_manager.rs
(7 hunks)lib/llm/src/discovery/watcher.rs
(4 hunks)lib/llm/src/http/service.rs
(1 hunks)lib/llm/src/http/service/openai.rs
(4 hunks)lib/llm/src/http/service/service_v2.rs
(2 hunks)lib/llm/src/http/service/token_completions.rs
(1 hunks)lib/llm/src/protocols.rs
(1 hunks)lib/llm/src/protocols/openai.rs
(2 hunks)lib/llm/src/protocols/token_completions.rs
(1 hunks)lib/llm/src/types.rs
(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Rust pre-merge checks
lib/llm/src/protocols/token_completions.rs
[error] 193-193: Clippy lint 'useless_conversion': useless conversion to the same type anyhow::Error
. Consider removing Error::from()
and use e
directly.
🪛 GitHub Actions: NVIDIA Test Github Validation
lib/llm/src/protocols/token_completions.rs
[error] 193-193: Clippy lint 'useless_conversion': useless conversion to the same type anyhow::Error
. Consider removing Error::from()
and use e
directly. (clippy::useless-conversion)
🔇 Additional comments (4)
lib/llm/src/protocols.rs (1)
29-29
: Module exposed correctly – no further action requiredThe public declaration neatly wires the new protocol into the crate.
lib/llm/src/http/service.rs (1)
22-22
: Router module added – looks good
mod token_completions;
cleanly registers the new HTTP routes.lib/llm/src/http/service/service_v2.rs (1)
196-201
: Route injection mirrors existing pattern – goodConditional merge follows the same pattern as other endpoint families; no concerns.
lib/llm/src/discovery/watcher.rs (1)
157-158
: Ensureremove_token_completion_model
is implemented thread-safeConfirm that the new
ModelManager
removal method acquires the same write-lock strategy as other removals to avoid dead-reader races.
let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new( | ||
"model".to_string(), | ||
Default::default(), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Preserve the requested model name in the response stream
DeltaGenerator::new("model".to_string(), …)
hard-codes the model id, losing the real request.model
. Down-stream consumers (metrics, client libraries) will mis-identify the source model.
- let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new(
- "model".to_string(),
- Default::default(),
- );
+ let model_name = token_request.model.clone();
+ let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new(
+ model_name,
+ Default::default(),
+ );
Clone before the move so token_request
ownership is not lost.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new( | |
"model".to_string(), | |
Default::default(), | |
); | |
let model_name = token_request.model.clone(); | |
let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new( | |
model_name, | |
Default::default(), | |
); |
🤖 Prompt for AI Agents
In lib/llm/src/protocols/token_completions.rs around lines 200 to 203, the model
name is hard-coded as "model" when creating DeltaGenerator, which causes loss of
the actual requested model name from token_request. Fix this by cloning the real
model name from token_request before passing it to DeltaGenerator::new,
preserving the original ownership and ensuring downstream consumers receive the
correct model identifier.
@@ -114,6 +122,15 @@ impl ModelManager { | |||
clients.add(model, engine) | |||
} | |||
|
|||
pub fn add_token_completion_model( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specific to this MR, but all these functions seem brutally repetitive, especially if we plan on adding more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments on possibly re-using completions endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level question - can we just re-use the existing completions endpoint for this functionality?
Rough understanding so far:
- CreateCompletionRequest already accepts prompt being an array of Token Ids and so does the rust type we're using
- We're already returning CompletionResponses from this new endpoint
- So why not re-use CreateCompletionRequest?
Thoughts around using existing completions endpoint:
- Can we just skip tokenization if we detect prompt is vec and not string?
- Similar idea to how use_raw_prompt is used in the code to skip chat template
- We may just be able to detect if
Prompt
isPrompt::IntegerArray
orPrompt::ArrayOfIntegerArray
and skiptokenizer.encode()
step if so - Then we can just directly set
builder.token_ids
in preprocessor from request.prompt here
Other relevant spots in the code:
- Currently seems like tokens input might get converted to string with prompt_to_string around here
Backup Idea if above won't work for some reason:
- If there are issues with above and we actually need/want a separate engine, what about creating both engines, and dynamically selecting "string" engine or "tokens" engine in completions endpoint here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ex: Rough idea if re-using same engine and just dynamically skipping tokenization / template if prompt is already token ID array:
// preprocessor.rs
if is_token_ids_prompt(request.prompt) {
token_ids = request.prompt;
} else {
token_ids = tokenizer.encode(request.prompt)
}
...
builder.token_ids = token_ids;
Can try this on client-side today, but it doesn't work as expected because it seems like it casts token ids to a string:
root@ced35d0-lcedt:/workspace# curl localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-0.6B",
"prompt": [785, 1850, 1616, 311, 8180, 264, 7795, 28041, 374],
"stream": false,
"max_tokens": 30
}' | jq
{
"id": "cmpl-ec552f06-914d-4e3f-ae0e-01c7e94708f6",
"choices": [
{
"text": "<think>\nOkay, let's see. I need to figure out the pattern in the sequence: 785 1850 ",
"index": 0,
"finish_reason": null
}
],
"created": 1750270685,
"model": "Qwen/Qwen3-0.6B",
"object": "text_completion",
"usage": {
"completion_tokens": 29,
"prompt_tokens": 49,
"total_tokens": 0
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this would reduce necessary changes from ~500 lines of code to <50 and "fix" support for accepting prompt of type IntegerArray
/ArrayOfIntegerArray
as defined in openai completions spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this offline. I think this is the way to go. Will work on a refactor
Summary by CodeRabbit
New Features
Improvements