Conversation
Signed-off-by: Haoyu Sun <hasun@redhat.com>
WalkthroughAdds token-counting instrumentation for LLM turns. Introduces update_llm_token_count_from_turn in metrics.utils. Wires token metrics into query and streaming endpoints. Extends retrieve_response with a provider_id parameter and forwards it. Updates tests to mock metrics and adds a unit test for the new utility. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant API as QueryEndpoint
participant Core as retrieve_response
participant LLM as Provider
participant Metrics as metrics.utils
Client->>API: POST /query
API->>Core: retrieve_response(..., provider_id)
Core->>LLM: invoke(model_id, input)
LLM-->>Core: assistant output
Note over Core,Metrics: New: token counting for turn
Core->>Metrics: update_llm_token_count_from_turn(turn, model_label, provider_id, system_prompt)
Core-->>API: TurnSummary, conversation_id
API-->>Client: 200 OK
sequenceDiagram
autonumber
actor Client
participant API as StreamingEndpoint
participant Stream as StreamLoop
participant Metrics as metrics.utils
Client->>API: POST /streaming_query (SSE)
API->>Stream: start stream loop
loop tokens/events
Stream-->>Client: events...
end
Note over Stream,Metrics: New: on turn_complete
Stream->>Metrics: update_llm_token_count_from_turn(turn, model_id, provider_id, system_prompt)
Stream-->>Client: stream end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
tests/unit/app/endpoints/test_streaming_query.py (1)
357-358: Reduce repetition by auto-mocking metrics for this module.
Consider a module-scoped autouse fixture so future tests don’t forget to patch metrics.Apply:
+@pytest.fixture(autouse=True) +def _mock_metrics_autouse(mocker): + mocker.patch( + "app.endpoints.streaming_query.update_llm_token_count_from_turn", + return_value=None, + )Also applies to: 364-365
tests/unit/app/endpoints/test_query.py (1)
454-454: Avoid repetitive calls to mock_metrics across tests.
Use an autouse fixture to patch once per module and prevent omissions.Apply:
@@ -def mock_metrics(mocker): - """Helper function to mock metrics operations for query endpoints.""" - mocker.patch( - "app.endpoints.query.update_llm_token_count_from_turn", - return_value=None, - ) +@pytest.fixture(autouse=True) +def _mock_metrics_autouse(mocker): + """Auto-mock metrics for all tests in this module.""" + mocker.patch("app.endpoints.query.update_llm_token_count_from_turn", return_value=None)Then remove individual mock_metrics(...) calls in tests.
Also applies to: 486-486, 519-519, 559-559, 609-609, 662-662, 716-716, 774-774, 829-829, 885-885, 955-955, 1016-1016, 1351-1351, 1402-1402
src/app/endpoints/streaming_query.py (1)
625-631: Token metrics on turn_complete: OK; minor nit on recomputing system_prompt.
You recompute system_prompt inside the loop; it’s constant per request. Compute once before the async for and close over it.Example (outside the generator’s loop):
system_prompt = get_system_prompt(query_request, configuration) async for chunk in turn_response: ...tests/unit/metrics/test_utis.py (1)
1-1: Nit: filename typo.
Consider renaming test_utis.py → test_utils.py for clarity.src/app/endpoints/query.py (1)
392-400: Make provider_id default safer and explicit.Avoid empty-string labels in metrics; prefer a sentinel and validate early.
Apply:
-async def retrieve_response( # pylint: disable=too-many-locals,too-many-branches,too-many-arguments +async def retrieve_response( # pylint: disable=too-many-locals,too-many-branches,too-many-arguments client: AsyncLlamaStackClient, model_id: str, query_request: QueryRequest, token: str, mcp_headers: dict[str, dict[str, str]] | None = None, *, - provider_id: str = "", + provider_id: str = "unknown", ) -> tuple[TurnSummary, str]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
src/app/endpoints/query.py(5 hunks)src/app/endpoints/streaming_query.py(2 hunks)src/metrics/utils.py(2 hunks)tests/unit/app/endpoints/test_query.py(16 hunks)tests/unit/app/endpoints/test_streaming_query.py(2 hunks)tests/unit/metrics/test_utis.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
src/metrics/utils.py (2)
src/client.py (1)
AsyncLlamaStackClientHolder(18-50)src/configuration.py (1)
configuration(60-64)
tests/unit/app/endpoints/test_streaming_query.py (1)
tests/unit/app/endpoints/test_query.py (1)
mock_metrics(49-54)
src/app/endpoints/streaming_query.py (3)
src/metrics/utils.py (1)
update_llm_token_count_from_turn(60-77)src/utils/endpoints.py (1)
get_system_prompt(70-101)src/configuration.py (1)
configuration(60-64)
tests/unit/app/endpoints/test_query.py (1)
tests/unit/app/endpoints/test_streaming_query.py (1)
mock_metrics(61-66)
tests/unit/metrics/test_utis.py (1)
src/metrics/utils.py (2)
setup_model_metrics(20-57)update_llm_token_count_from_turn(60-77)
src/app/endpoints/query.py (3)
src/metrics/utils.py (1)
update_llm_token_count_from_turn(60-77)src/app/endpoints/streaming_query.py (1)
retrieve_response(685-796)src/models/requests.py (1)
QueryRequest(72-222)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: e2e_tests
🔇 Additional comments (10)
tests/unit/app/endpoints/test_streaming_query.py (1)
61-67: Good isolation of metrics side effects in streaming tests.
Mock cleanly patches update_llm_token_count_from_turn, keeping tests deterministic.tests/unit/app/endpoints/test_query.py (1)
49-55: Good helper to null metrics during unit tests.
Keeps tests focused on behavior without counting tokens.src/app/endpoints/streaming_query.py (1)
29-29: Import of token-counting utility is appropriate.
Matches usage in the generator; import location is fine.src/metrics/utils.py (2)
3-10: New imports align with the added functionality.
Scope and ordering look good.
60-77: I’ve run the checks and confirmed thatturn.output_messageandturn.input_messagesare alreadyRawMessageinstances and thatChatFormat.encode_dialog_promptaccepts them directly—no runtime mismatch. The system prompt should indeed berole="system", not"user". Removing the proposed defensive normalization, simply update the system prompt’s role:- input_messages = [RawMessage(role="user", content=system_prompt)] + cast( - list[RawMessage], turn.input_messages - ) + input_messages = [] + if system_prompt: + input_messages.append(RawMessage(role="system", content=system_prompt)) + input_messages += cast(list[RawMessage], turn.input_messages)No provider‐specific mapping needed here.
tests/unit/metrics/test_utis.py (2)
3-3: Import coverage looks good.
You’re testing both setup and the new token-count function.
79-124: Solid unit test for token-count updates with clear expectations.
Mocks are sufficient and assertions target the right metric labels.src/app/endpoints/query.py (3)
27-27: Import for token metrics is correct.Import location and symbol name match usage below. No issues.
222-223: Good: provider_id forwarded explicitly.Keyword-only usage avoids positional breakage and keeps call sites clear.
418-419: Docstring updated correctly.Parameter docs align with signature and intent.
| # Update token count metrics for the LLM call | ||
| model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id | ||
| update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Harden metrics update to never break the endpoint; use last path segment.
Instrumentation must be best-effort. If Tokenizer/formatter or turn fields are missing, this can raise and 500 the request. Also, rsplit is safer for nested identifiers.
Apply:
- # Update token count metrics for the LLM call
- model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id
- update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt)
+ # Update token count metrics for the LLM call (best-effort; do not disrupt response)
+ try:
+ model_label = model_id.rsplit("/", 1)[-1]
+ update_llm_token_count_from_turn(
+ response, model_label, provider_id or "unknown", system_prompt
+ )
+ except Exception as err: # pylint: disable=broad-except
+ logger.warning("Failed to update token count metrics: %s", err)If you prefer, we can centralize this “safe update” in a tiny helper to reuse here and in streaming_query.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Update token count metrics for the LLM call | |
| model_label = model_id.split("/", 1)[1] if "/" in model_id else model_id | |
| update_llm_token_count_from_turn(response, model_label, provider_id, system_prompt) | |
| # Update token count metrics for the LLM call (best-effort; do not disrupt response) | |
| try: | |
| model_label = model_id.rsplit("/", 1)[-1] | |
| update_llm_token_count_from_turn( | |
| response, model_label, provider_id or "unknown", system_prompt | |
| ) | |
| except Exception as err: # pylint: disable=broad-except | |
| logger.warning("Failed to update token count metrics: %s", err) |
🤖 Prompt for AI Agents
In src/app/endpoints/query.py around lines 518-521, the metrics update can raise
if tokenizer/formatter or expected fields on the turn/response are missing and
using split("/") can pick the wrong segment for nested identifiers; change to
use rsplit("/", 1) to get the last path segment for model_label and wrap the
call to update_llm_token_count_from_turn in a try/except that swallows
exceptions (and logs a debug/warn) so instrumentation is best-effort and cannot
500 the endpoint; optionally factor this logic into a small helper (used here
and in streaming_query) that extracts the safe model_label and invokes the
update inside a non-throwing guard.
do not merge
Summary by CodeRabbit
New Features
Tests
Chores