Skip to content

feat: add token completions endpoint #1565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 49 commits into
base: main
Choose a base branch
from
Open

Conversation

ishandhanani
Copy link
Contributor

@ishandhanani ishandhanani commented Jun 17, 2025

Summary by CodeRabbit

  • New Features

    • Introduced a new token-level completion API, allowing direct token-based LLM inference with support for streaming and advanced generation parameters.
    • Added new HTTP endpoints for token completions, enabled by default in the service configuration.
    • Expanded model management to support token completion engines, including listing, adding, and removing models.
  • Improvements

    • Enhanced protocol and type definitions to support token completion requests and responses.
    • Made additional traits and functions public for broader integration and extensibility.

Copy link
Contributor

coderabbitai bot commented Jun 17, 2025

Walkthrough

Support for a new Dynamo Token Completion API has been added. This includes a new protocol, engine types, model management, and an HTTP endpoint for token-level completions. The implementation integrates with the existing service and model watcher, exposes new configuration options, and makes several traits and functions public for external use.

Changes

File(s) Change Summary
lib/llm/src/discovery/model_manager.rs Extended ModelManager to manage token completion engines; added related methods.
lib/llm/src/discovery/watcher.rs Added logic to handle token completion models in model lifecycle management.
lib/llm/src/http/service.rs Added module declaration for token_completions.
lib/llm/src/http/service/openai.rs Made several functions and a struct public; adjusted event filtering based on environment variable.
lib/llm/src/http/service/service_v2.rs Added enable_dynamo_endpoints config; conditionally registers token completions endpoint.
lib/llm/src/http/service/token_completions.rs New module: Implements token completions HTTP endpoint and router.
lib/llm/src/protocols.rs Added public module declaration for token_completions.
lib/llm/src/protocols/openai.rs Made traits OpenAISamplingOptionsProvider and OpenAIStopConditionsProvider public.
lib/llm/src/protocols/token_completions.rs New module: Defines Dynamo token completions protocol, request struct, operator, and conversions.
lib/llm/src/types.rs Added token_completions module; exports request/response types and engine aliases.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTP_Service
    participant ModelManager
    participant TokenCompletionEngine

    Client->>HTTP_Service: POST /tokens/completion (DynamoTokenCompletionRequest)
    HTTP_Service->>ModelManager: get_token_completion_engine(model)
    ModelManager-->>HTTP_Service: TokenCompletionStreamingEngine
    HTTP_Service->>TokenCompletionEngine: generate(request)
    TokenCompletionEngine-->>HTTP_Service: Stream<CompletionResponse>
    HTTP_Service-->>Client: JSON CompletionResponse
Loading

Poem

In the warren, code hops anew,
Token completions now stream right through!
Engines managed, requests fly—
Dynamo rabbits reaching high.
With endpoints fresh and traits set free,
The meadow grows for all to see.
🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
lib/llm/src/discovery/model_manager.rs (1)

68-72: has_model_any forgets token-completion models

has_model_any returns true for chat/completions engines only. Calls that rely on it (e.g. health probes, routing decisions) will incorrectly report that a token-completion model is missing.

-        self.chat_completion_engines.read().unwrap().contains(model)
-            || self.completion_engines.read().unwrap().contains(model)
+        self.chat_completion_engines.read().unwrap().contains(model)
+            || self.completion_engines.read().unwrap().contains(model)
+            || self.token_completion_engines.read().unwrap().contains(model)
🧹 Nitpick comments (8)
lib/llm/src/protocols/openai.rs (1)

135-155: Visibility bump – document in the public changelog

Making these traits pub is fine, but it’s a semver-visible surface change.
Consider adding release-notes explaining the new public API so external users know it’s supported.

lib/llm/src/http/service/service_v2.rs (1)

97-99: Missing doc comment for new config flag

enable_dynamo_endpoints lacks a short doc comment describing what “Dynamo endpoints” are and why one might disable them.

lib/llm/src/discovery/watcher.rs (2)

284-327: Pipeline construction duplicated – extract helper

The token-completion engine setup repeats ~40 lines already present for chat & text completions.
Factoring a helper like build_backend_pipeline<F, R>(...) would:

• remove copy-paste risk
• centralise future fixes (router-mode branches, KV chooser, etc.)

No functional bug, but maintainability will suffer as each pipeline evolves.


250-252: Unnecessary clone?

client.clone() is used for the completions pipeline but the original client is later moved into the token-completion router, making the first clone redundant. Dropping the clone reduces one cheap but avoidable Arc bump.

lib/llm/src/types.rs (1)

20-34: Avoid super::* – keep the public surface explicit

Using use super::*; inside the nested module brings every item from the parent scope into the module, making it easy to shadow names accidentally and hiding the real dependencies. Prefer importing only what the module actually needs (e.g. Annotated, TokenIdType, protocols) for clearer intent and faster compile times.

lib/llm/src/http/service/openai.rs (2)

389-392: Restrict visibility to pub(crate)

check_ready is now pub, but it is still consumed only inside the crate. Exposing it publicly enlarges the API surface and makes breaking-changes harder. Change to pub(crate) (or keep it private) unless there is a concrete cross-crate consumer.


512-515: Cache the env-var lookup

process_event_converter calls std::env::var("DYN_RICH_EVENT_STREAM") for every chunk. This shows up on the hot path when streaming many tokens.

Store the boolean once:

static RICH_STREAM: once_cell::sync::Lazy<bool> =
    once_cell::sync::Lazy::new(|| std::env::var("DYN_RICH_EVENT_STREAM").is_ok());

and replace the runtime check with if !*RICH_STREAM && annotated.event.as_deref() == Some(ANNOTATION_LLM_METRICS) { ... }.

lib/llm/src/protocols/token_completions.rs (1)

140-147: Validate model field as well

While token_ids is checked for emptiness, an empty model string would propagate downstream and surface as a less actionable backend error. Add an early guard:

if request.model.trim().is_empty() {
    return Err(anyhow::anyhow!("model cannot be empty"));
}
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fbf1ffd and a7bb860.

📒 Files selected for processing (10)
  • lib/llm/src/discovery/model_manager.rs (7 hunks)
  • lib/llm/src/discovery/watcher.rs (4 hunks)
  • lib/llm/src/http/service.rs (1 hunks)
  • lib/llm/src/http/service/openai.rs (4 hunks)
  • lib/llm/src/http/service/service_v2.rs (2 hunks)
  • lib/llm/src/http/service/token_completions.rs (1 hunks)
  • lib/llm/src/protocols.rs (1 hunks)
  • lib/llm/src/protocols/openai.rs (2 hunks)
  • lib/llm/src/protocols/token_completions.rs (1 hunks)
  • lib/llm/src/types.rs (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Rust pre-merge checks
lib/llm/src/protocols/token_completions.rs

[error] 193-193: Clippy lint 'useless_conversion': useless conversion to the same type anyhow::Error. Consider removing Error::from() and use e directly.

🪛 GitHub Actions: NVIDIA Test Github Validation
lib/llm/src/protocols/token_completions.rs

[error] 193-193: Clippy lint 'useless_conversion': useless conversion to the same type anyhow::Error. Consider removing Error::from() and use e directly. (clippy::useless-conversion)

🔇 Additional comments (4)
lib/llm/src/protocols.rs (1)

29-29: Module exposed correctly – no further action required

The public declaration neatly wires the new protocol into the crate.

lib/llm/src/http/service.rs (1)

22-22: Router module added – looks good

mod token_completions; cleanly registers the new HTTP routes.

lib/llm/src/http/service/service_v2.rs (1)

196-201: Route injection mirrors existing pattern – good

Conditional merge follows the same pattern as other endpoint families; no concerns.

lib/llm/src/discovery/watcher.rs (1)

157-158: Ensure remove_token_completion_model is implemented thread-safe

Confirm that the new ModelManager removal method acquires the same write-lock strategy as other removals to avoid dead-reader races.

Comment on lines +200 to +203
let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new(
"model".to_string(),
Default::default(),
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Preserve the requested model name in the response stream

DeltaGenerator::new("model".to_string(), …) hard-codes the model id, losing the real request.model. Down-stream consumers (metrics, client libraries) will mis-identify the source model.

-        let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new(
-            "model".to_string(),
-            Default::default(),
-        );
+        let model_name = token_request.model.clone();
+        let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new(
+            model_name,
+            Default::default(),
+        );

Clone before the move so token_request ownership is not lost.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new(
"model".to_string(),
Default::default(),
);
let model_name = token_request.model.clone();
let mut delta_generator = crate::protocols::openai::completions::DeltaGenerator::new(
model_name,
Default::default(),
);
🤖 Prompt for AI Agents
In lib/llm/src/protocols/token_completions.rs around lines 200 to 203, the model
name is hard-coded as "model" when creating DeltaGenerator, which causes loss of
the actual requested model name from token_request. Fix this by cloning the real
model name from token_request before passing it to DeltaGenerator::new,
preserving the original ownership and ensuring downstream consumers receive the
correct model identifier.

@@ -114,6 +122,15 @@ impl ModelManager {
clients.add(model, engine)
}

pub fn add_token_completion_model(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specific to this MR, but all these functions seem brutally repetitive, especially if we plan on adding more.

Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on possibly re-using completions endpoint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level question - can we just re-use the existing completions endpoint for this functionality?

Rough understanding so far:

Thoughts around using existing completions endpoint:

  • Can we just skip tokenization if we detect prompt is vec and not string?
    • Similar idea to how use_raw_prompt is used in the code to skip chat template
    • We may just be able to detect if Prompt is Prompt::IntegerArray or Prompt::ArrayOfIntegerArray and skip tokenizer.encode() step if so
    • Then we can just directly set builder.token_ids in preprocessor from request.prompt here

Other relevant spots in the code:

Backup Idea if above won't work for some reason:

  • If there are issues with above and we actually need/want a separate engine, what about creating both engines, and dynamically selecting "string" engine or "tokens" engine in completions endpoint here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ex: Rough idea if re-using same engine and just dynamically skipping tokenization / template if prompt is already token ID array:

// preprocessor.rs
if is_token_ids_prompt(request.prompt) {
  token_ids = request.prompt;
} else {
  token_ids = tokenizer.encode(request.prompt)
}

...
builder.token_ids = token_ids;

Can try this on client-side today, but it doesn't work as expected because it seems like it casts token ids to a string:

root@ced35d0-lcedt:/workspace# curl localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen3-0.6B",
    "prompt": [785, 1850, 1616, 311, 8180, 264, 7795, 28041, 374],
    "stream": false,
    "max_tokens": 30 
  }' | jq

{
  "id": "cmpl-ec552f06-914d-4e3f-ae0e-01c7e94708f6",
  "choices": [
    {
      "text": "<think>\nOkay, let's see. I need to figure out the pattern in the sequence: 785 1850 ",
      "index": 0,
      "finish_reason": null
    }
  ],
  "created": 1750270685,
  "model": "Qwen/Qwen3-0.6B",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 29,
    "prompt_tokens": 49,
    "total_tokens": 0
  }
}

Copy link
Contributor

@rmccorm4 rmccorm4 Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this would reduce necessary changes from ~500 lines of code to <50 and "fix" support for accepting prompt of type IntegerArray/ArrayOfIntegerArray as defined in openai completions spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this offline. I think this is the way to go. Will work on a refactor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants