Skip to content

Comments

Refine model providers metrics#416

Merged
KaQuMiQ merged 1 commit intomainfrom
feature/metrics
Sep 16, 2025
Merged

Refine model providers metrics#416
KaQuMiQ merged 1 commit intomainfrom
feature/metrics

Conversation

@KaQuMiQ
Copy link
Collaborator

@KaQuMiQ KaQuMiQ commented Sep 16, 2025

No description provided.

@coderabbitai
Copy link

coderabbitai bot commented Sep 16, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This PR standardizes observability across multiple model providers (Anthropic, Bedrock, Gemini, Mistral, Ollama, vLLM) by moving verbose fields (model.instructions, model.tools, model.context) out of INFO records into adjacent DEBUG records, and updating INFO records to include concise metrics: model.tools.count, model.tools.selection (replacing model.tool_selection), and model.stream (replacing model.streaming). Some providers also surface model.temperature and model.output in INFO. Separately, Stage.result_evaluation and Stage.context_evaluation gain a keyword-only raises: bool parameter; when raises=True a failed evaluation raises StageException (default remains non-raising).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Add evaluation stage #345 — Directly related: modifies Stage.result_evaluation/context_evaluation APIs that this PR also updates by adding raises and changing failure behavior.
  • Switch to non-strict OpenAI tools #412 — Related: makes similar observability/logging changes in provider completion paths (tool-selection keys and logging level adjustments).
  • Add unit tests #382 — Related tests: updates test coverage for Stage evaluation behavior and overlaps with the new raises behavior exercised in tests.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request has no author-provided description; an empty PR body does not describe any part of the changeset and therefore fails this check. While the raw_summary shows substantial observability/metrics changes and a small API change to Stage.result_evaluation, none of that context is present in the PR description to help reviewers. Please add a concise PR description summarizing the key changes (observability/metrics refactor across provider modules and the addition/renaming of fields such as model.tools.count, model.tools.selection, and model.stream), note the Stage.result_evaluation signature change (raises: bool) and its test adjustment, and include the rationale and any reviewer action needed or migration guidance.
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Refine model providers metrics" accurately captures the primary intent of the changeset, which restructures observability and metrics across multiple model provider modules (e.g., Anthropic, Bedrock, Gemini, Mistral, Ollama, vLLM) by moving verbose fields to DEBUG and adding standardized fields like model.tools.count, model.tools.selection, and model.stream. It is concise, clear, and focused on the main change a reviewer would care about. It does not need to mention every minor change (such as the new raises parameter on Stage.result_evaluation), so the title is appropriate as-is.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b83fb6 and 60e2dae.

📒 Files selected for processing (8)
  • src/draive/anthropic/messages.py (1 hunks)
  • src/draive/bedrock/converse.py (1 hunks)
  • src/draive/gemini/generating.py (2 hunks)
  • src/draive/mistral/completions.py (2 hunks)
  • src/draive/ollama/chat.py (1 hunks)
  • src/draive/stages/stage.py (6 hunks)
  • src/draive/vllm/messages.py (2 hunks)
  • tests/test_stage.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
src/draive/bedrock/converse.py (1)

249-273: Add INFO/DEBUG records for streaming fallback to keep dashboards accurate.

currently streaming attempts only log a warning and then call the non‑streaming path, which records model.stream=False. This makes stream attempts indistinguishable in metrics. Emit a lightweight INFO/DEBUG pair here with model.stream=True before falling back.

Apply this diff near the start of _completion_stream:

 async def _completion_stream(
@@
-        async with ctx.scope("model.completion.stream"):
-            ctx.log_warning(
+        async with ctx.scope("model.completion.stream"):
+            ctx.record(
+                ObservabilityLevel.INFO,
+                attributes={
+                    "model.provider": "bedrock",
+                    "model.name": config.model,
+                    "model.temperature": config.temperature,
+                    "model.output": str(output),
+                    "model.tools.count": len(tools.specifications),
+                    "model.tools.selection": tools.selection,
+                    "model.stream": True,
+                },
+            )
+            ctx.record(
+                ObservabilityLevel.DEBUG,
+                attributes={
+                    "model.instructions": instructions,
+                    "model.tools": [tool.name for tool in tools.specifications],
+                    "model.context": [element.to_str() for element in context],
+                },
+            )
+            ctx.log_warning(
                 "Bedrock completion streaming is not supported yet, using regular response instead."
             )
src/draive/ollama/chat.py (1)

201-214: Consider emitting INFO/DEBUG for streaming fallback like other providers.

Add a small INFO/DEBUG pair with model.stream=True before warning/fallback to preserve accurate stream attempt metrics.

 async def _completion_stream(
@@
-        async with ctx.scope("model.completion.stream"):
-            ctx.log_warning(
+        async with ctx.scope("model.completion.stream"):
+            ctx.record(
+                ObservabilityLevel.INFO,
+                attributes={
+                    "model.provider": "ollama",
+                    "model.name": config.model,
+                    "model.temperature": config.temperature,
+                    "model.output": str(output),
+                    "model.tools.count": len(tools.specifications),
+                    "model.tools.selection": tools.selection,
+                    "model.stream": True,
+                },
+            )
+            ctx.record(
+                ObservabilityLevel.DEBUG,
+                attributes={
+                    "model.instructions": instructions,
+                    "model.tools": [tool.name for tool in tools.specifications],
+                    "model.context": [element.to_str() for element in context],
+                },
+            )
+            ctx.log_warning(
                 "ollama completion streaming is not supported yet, using regular response instead."
             )
src/draive/mistral/completions.py (2)

236-239: Typo in comment (“dafault”).

Minor nit: fix spelling.

-                        # "null" is a dafault value...
+                        # "null" is a default value...

183-199: Unify streaming usage metrics with the global naming (model.*).

Only code occurrences of lmm. are the two metrics in src/draive/mistral/completions.py; other matches are docstrings in client files.

-                ctx.record(
-                    ObservabilityLevel.INFO,
-                    metric="lmm.input_tokens",
-                    value=usage.prompt_tokens or 0,
-                    unit="tokens",
-                    kind="counter",
-                    attributes={"lmm.model": event.data.model},
-                )
-                ctx.record(
-                    ObservabilityLevel.INFO,
-                    metric="lmm.output_tokens",
-                    value=usage.completion_tokens or 0,
-                    unit="tokens",
-                    kind="counter",
-                    attributes={"lmm.model": event.data.model},
-                )
+                ctx.record(
+                    ObservabilityLevel.INFO,
+                    metric="model.input_tokens",
+                    value=usage.prompt_tokens or 0,
+                    unit="tokens",
+                    kind="counter",
+                    attributes={"model.provider": "mistral", "model.name": event.data.model},
+                )
+                ctx.record(
+                    ObservabilityLevel.INFO,
+                    metric="model.output_tokens",
+                    value=usage.completion_tokens or 0,
+                    unit="tokens",
+                    kind="counter",
+                    attributes={"model.provider": "mistral", "model.name": event.data.model},
+                )
src/draive/vllm/messages.py (1)

637-642: Consider not sending tools when selection is “none”.

To reduce payload and avoid provider-side ambiguity, return NOT_GIVEN for tools in this case (similar to non‑stream path logic elsewhere).

-    if tool_selection == "none":
-        return ("none", tools_list)
+    if tool_selection == "none":
+        return ("none", NOT_GIVEN)

Confirm vLLM/OpenAI compatibility: some backends accept tool_choice="none" with tools present, others prefer omitting tools.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9596b60 and 1b83fb6.

📒 Files selected for processing (8)
  • src/draive/anthropic/messages.py (1 hunks)
  • src/draive/bedrock/converse.py (1 hunks)
  • src/draive/gemini/generating.py (2 hunks)
  • src/draive/mistral/completions.py (2 hunks)
  • src/draive/ollama/chat.py (1 hunks)
  • src/draive/stages/stage.py (6 hunks)
  • src/draive/vllm/messages.py (2 hunks)
  • tests/test_stage.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use absolute imports from the draive package (e.g., from draive import ...), not relative imports
Follow Ruff import ordering: standard library, third-party, then local imports
Use Python 3.12+ typing features (e.g., | unions, PEP 695 generics)
Prefer abstract/base collection types (Sequence, Mapping, Set, Iterable) over concrete list, dict, set in type hints
Define and raise custom exception types for specific error cases
Use NumPy docstring convention for all functions, classes, and methods
Skip module-level docstrings unless explicitly requested
Docstrings should include Parameters, Returns, Raises, and Notes (if needed) sections

Format all Python code with Ruff formatter (run via make format); do not use any other formatter

Files:

  • src/draive/bedrock/converse.py
  • src/draive/mistral/completions.py
  • src/draive/ollama/chat.py
  • src/draive/vllm/messages.py
  • tests/test_stage.py
  • src/draive/anthropic/messages.py
  • src/draive/gemini/generating.py
  • src/draive/stages/stage.py
src/draive/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/draive/**/*.py: Import Haiway symbols directly as from haiway import State, ctx
Use ctx.scope(...) to bind active State instances; avoid global state
Route all logging through ctx.log_* and never use print
Enforce strict typing on public APIs; avoid untyped definitions and Any except at verified third‑party boundaries
Prefer explicit attribute access with static types; avoid dynamic getattr except at narrow boundaries
Use abstract immutable protocols (Mapping, Sequence, Iterable) instead of concrete dict/list/set in public types
Use final where applicable; prefer composition over complex inheritance
Use precise unions (|), narrow with match/isinstance, and avoid cast unless provably safe and localized
Model immutable configuration/data as haiway.State; avoid in‑place mutation and prefer State.updated(...)/functional builders
Access active state via haiway.ctx within async scopes (ctx.scope(...))
Public state methods that dispatch on the active instance must use @statemethod
Add structured, concise logs around generation calls, tool dispatch, and provider requests/responses (without secrets); add metrics via ctx.record where applicable
All I/O is async; keep boundaries async, use ctx.spawn for detached tasks, and avoid custom threading
Translate provider/SDK errors into typed exceptions; never raise bare Exception; preserve context
Compose multimodal content with MultimodalContent.of(...); use ResourceContent/Reference for blobs; wrap custom data in ArtifactContent (use hidden when needed)
Public symbols require NumPy-style docstrings with Parameters/Returns/Raises and rationale when non-obvious
Avoid docstrings for internal helpers; prefer self-explanatory names; skip module-level docstrings
Never log secrets or full request bodies containing keys/tokens
Read credentials from environment variables; resolve via helpers like getenv_str

Files:

  • src/draive/bedrock/converse.py
  • src/draive/mistral/completions.py
  • src/draive/ollama/chat.py
  • src/draive/vllm/messages.py
  • src/draive/anthropic/messages.py
  • src/draive/gemini/generating.py
  • src/draive/stages/stage.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Place tests in the tests/ directory and use pytest (with async support)
Mark asynchronous tests with pytest.mark.asyncio
Mock dependencies within test scope using stubbed functionality state (e.g., via ctx.scope)

tests/**/*.py: Unit tests must not perform network I/O; mock providers/HTTP
Use pytest-asyncio (@pytest.mark.asyncio) for coroutine tests
Prefer scoping with ctx.scope(...) in tests and bind required State instances explicitly
Avoid real I/O in tests; stub provider calls and HTTP

Files:

  • tests/test_stage.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
PR: miquido/draive#0
File: AGENTS.md:0-0
Timestamp: 2025-09-11T15:22:15.967Z
Learning: Applies to src/draive/**/*.py : Add structured, concise logs around generation calls, tool dispatch, and provider requests/responses (without secrets); add metrics via `ctx.record` where applicable
📚 Learning: 2025-09-11T15:22:15.967Z
Learnt from: CR
PR: miquido/draive#0
File: AGENTS.md:0-0
Timestamp: 2025-09-11T15:22:15.967Z
Learning: Applies to src/draive/**/*.py : Add structured, concise logs around generation calls, tool dispatch, and provider requests/responses (without secrets); add metrics via `ctx.record` where applicable

Applied to files:

  • src/draive/bedrock/converse.py
  • src/draive/mistral/completions.py
  • src/draive/anthropic/messages.py
  • src/draive/gemini/generating.py
🧬 Code graph analysis (6)
src/draive/bedrock/converse.py (5)
src/draive/models/types.py (2)
  • tools (465-467)
  • tools (633-635)
src/draive/multimodal/artifact.py (1)
  • to_str (82-96)
src/draive/multimodal/content.py (2)
  • to_str (288-297)
  • to_str (651-664)
src/draive/multimodal/text.py (1)
  • to_str (61-70)
src/draive/resources/types.py (1)
  • to_str (181-203)
src/draive/mistral/completions.py (6)
src/draive/models/types.py (2)
  • tools (465-467)
  • tools (633-635)
src/draive/models/tools/types.py (1)
  • name (73-73)
src/draive/multimodal/artifact.py (1)
  • to_str (82-96)
src/draive/multimodal/content.py (2)
  • to_str (288-297)
  • to_str (651-664)
src/draive/multimodal/text.py (1)
  • to_str (61-70)
src/draive/resources/types.py (1)
  • to_str (181-203)
tests/test_stage.py (2)
src/draive/stages/stage.py (2)
  • Stage (76-2040)
  • result_evaluation (929-990)
src/draive/stages/types.py (1)
  • StageException (272-303)
src/draive/anthropic/messages.py (5)
src/draive/models/types.py (2)
  • tools (465-467)
  • tools (633-635)
src/draive/multimodal/artifact.py (1)
  • to_str (82-96)
src/draive/multimodal/content.py (2)
  • to_str (288-297)
  • to_str (651-664)
src/draive/multimodal/text.py (1)
  • to_str (61-70)
src/draive/resources/types.py (1)
  • to_str (181-203)
src/draive/gemini/generating.py (4)
src/draive/multimodal/artifact.py (1)
  • to_str (82-96)
src/draive/multimodal/content.py (2)
  • to_str (288-297)
  • to_str (651-664)
src/draive/multimodal/text.py (1)
  • to_str (61-70)
src/draive/resources/types.py (1)
  • to_str (181-203)
src/draive/stages/stage.py (3)
src/draive/evaluation/scenario.py (1)
  • passed (48-59)
src/draive/evaluation/suite.py (2)
  • passed (76-87)
  • passed (188-189)
src/draive/evaluation/evaluator.py (1)
  • passed (98-107)
🔇 Additional comments (14)
src/draive/bedrock/converse.py (1)

131-143: INFO/DEBUG split looks good; concise INFO, verbose DEBUG.

The new attributes (tools.count/selection, stream=False) at INFO and moving instructions/tools/context to DEBUG align with the PR intent and logging guidelines.

src/draive/ollama/chat.py (2)

118-124: Good INFO payload: adds temperature/output/tools/stream.

This matches the new observability schema and keeps INFO concise.


125-132: DEBUG payload is appropriate.

Verbose fields moved to DEBUG; tool names only, and context via safe to_str().

src/draive/anthropic/messages.py (1)

154-166: Observability rework is correct and consistent.

  • INFO: tools.count/selection and stream flag included.
  • DEBUG: instructions/tools/context only.
    This adheres to the new schema and avoids logging bodies or secrets at INFO.

Please confirm Anthropic instructions may include sensitive prompts; keeping them at DEBUG is acceptable, but ensure any deployment that ships DEBUG to external sinks is gated appropriately.

src/draive/gemini/generating.py (2)

141-153: Non‑stream INFO/DEBUG instrumentation looks good.

Concise INFO with tool metrics and stream flag; detailed fields in DEBUG. Matches PR intent.


291-303: Stream path instrumentation is consistent.

Accurate model.stream=True plus verbose DEBUG fields. Good.

src/draive/mistral/completions.py (2)

130-150: Stream path adds correct INFO/DEBUG attributes.

Nice parity with non‑stream path and other providers.


318-330: Non‑stream INFO/DEBUG instrumentation is correct.

Matches the PR’s schema changes and is aligned with streaming path.

src/draive/stages/stage.py (2)

935-936: API addition LGTM: raises: bool = False.

Clear opt‑in failure behavior without breaking defaults.


973-985: Failure path logic is correct and preserves useful meta.

Early return when passed or non‑raising; otherwise raises with performance/report in meta.

tests/test_stage.py (1)

973-977: Test updated correctly for raises=True behavior.

Asserts StageException and verifies evaluation_performance meta; matches the new API.

src/draive/vllm/messages.py (3)

129-136: INFO attributes look good and consistent with schema.

Includes temperature/max tokens/output/tools metrics and model.stream=False.


137-144: DEBUG attributes are appropriate.

Verbose fields moved to DEBUG; safe serialization with to_str().


282-289: Stream path instrumentation matches the pattern.

Accurate model.stream=True and verbose DEBUG fields.

Also applies to: 291-297

Comment on lines 949 to 952
raises: bool = False
Defaines if exception raises on failing evaluation.
meta: Meta | MetaValues | None = None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Docstring wording/typos; add explicit Raises section.

Fix “Defaines” and document when StageException is raised per guidelines.

-        raises: bool = False
-            Defaines if exception raises on failing evaluation.
+        raises : bool, default False
+            Determines whether to raise ``StageException`` when the evaluation fails.
+            When ``False``, the stage returns the input state unchanged on failure.
@@
-        raises: bool = False
-            Defaines if exception raises on failing evaluation.
+        raises : bool, default False
+            Determines whether to raise ``StageException`` when the evaluation fails.
+            When ``False``, the stage returns the input state unchanged on failure.

Also consider appending a “Raises” section to both docstrings:

         Returns
         -------
         Self
             A new Stage instance that evaluates the result.
 
+        Raises
+        ------
+        StageException
+            When ``raises`` is True and evaluation fails.

Also applies to: 1012-1013

🤖 Prompt for AI Agents
In src/draive/stages/stage.py around lines 949-951 (and similarly at 1012-1013),
fix the docstring typo "Defaines" to "Defines" for the raises parameter and
reword the parameter description to be clearer (e.g., "raises: bool = False —
defines whether a StageException is raised when evaluation fails"). Add an
explicit "Raises" section to the docstring that states: "Raises StageException
when evaluation fails and raises is True" (include any relevant context such as
which method or condition triggers the exception). Ensure the wording matches
project docstring style and apply the same changes to the other docstring at
lines 1012-1013.

@KaQuMiQ KaQuMiQ merged commit bf7ecbf into main Sep 16, 2025
4 of 5 checks passed
@KaQuMiQ KaQuMiQ deleted the feature/metrics branch September 16, 2025 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant