Refactor LLM chats: separate streaming logic and enforce strict typing #12

s-alexey · 2026-01-07T15:59:22Z

Major refactor of the LLM chat architecture to improve code organization,
maintainability, and type safety.

Key Changes:

Split LLMChat subclasses into distinct Non-Streaming and Streaming
implementations. Streaming logic (primarily for notebooks) was
complicating the core classes; this split makes primary actors more
concise and less error-prone.
Moved provider-specific implementations into separate files:
openai.py and genai.py.
Replaced the generic LLMResponse with a strictly typed version,
specifically enforcing types for tool_usage and token_usage.
Updated invoke method to accept explicit arguments.
Migrated OpenAI integration from the completion API to the more
user-friendly responses API.

Testing:

Added coverage for common use cases using real APIs (tests run
conditionally if environment keys are present).

dolaameng

Thanks for the refactoring, which is really helpful and clearer! Left some questions/comments to understand more.

We can keep iterating it.

tests/test_genai.py

dolaameng · 2026-01-08T23:13:11Z

src/kaggle_benchmarks/actors/llms.py

 """
 Defines a chat agent that interacts with a Large Language Model (LLM).

 Design Note:


nit: shall we update here too?

good idea, I will get back to that once the internal api is more stable

dolaameng · 2026-01-09T00:04:02Z

src/kaggle_benchmarks/actors/llms.py

+
+
+@dataclasses.dataclass
+class FunctionCall:


Shall we also consider thought signature for reasoning models? Context here. Or it's already in "LLMMessage.thinking"?

I would love to have it as well, perhaps in a separate PR as this one is already overcomplicated

dolaameng · 2026-01-09T00:42:04Z

src/kaggle_benchmarks/actors/llms.py

+        temperature: float | None = 0,
+        seed: int = 0,
+        tools: list[Any] | None = None,
+    ) -> LLMMessage[str]:


Being explicit is nice! Wonder if we also want to keep the flexibility for other parameters, e.g., the config

dolaameng · 2026-01-09T00:45:32Z

src/kaggle_benchmarks/actors/openai.py

+        return result
+
+
+class ModelProxyOpenAI(OpenAI):


Do we still support streaming output for Model Proxy?

We should, but I would rather wait till they fully support responses API so we don't have to implement it twice

src/kaggle_benchmarks/actors/openai.py

Major refactor of the LLM chat architecture to improve code organization, maintainability, and type safety. Key Changes: - Split `LLMChat` subclasses into distinct Non-Streaming and Streaming implementations. Streaming logic (primarily for notebooks) was complicating the core classes; this split makes primary actors more concise and less error-prone. - Moved provider-specific implementations into separate files: `openai.py` and `genai.py`. - Replaced the generic `LLMResponse` with a strictly typed version, specifically enforcing types for `tool_usage` and `token_usage`. - Updated `invoke` method to accept explicit arguments. - Migrated OpenAI integration from the `completion` API to the more user-friendly `responses` API. Testing: - Added coverage for common use cases using real APIs (tests run conditionally if environment keys are present).

dolaameng · 2026-01-16T00:21:26Z

src/kaggle_benchmarks/actors/llms.py


-        answer = response.content
-        response._meta.update(chat=chat, schema=schema, raw_content=answer, **kwargs)
+        response._meta.update(


Do we still use _meta?

develra

LGTM - it's a bit hard for me to tell as a person pretty ignorant of this code if the test coverage is sufficient to be confident that these changes are safe. I think that it would be good to think through what might break as a result of these changes and make sure we have test coverage for it - especially given the somewhat sensitive timing of a new launch.

develra · 2026-01-16T16:04:43Z

cicd/build/cloudbuild_branch.yaml

    args: ["run", "--group", "test",  "pytest", "tests"]
    waitFor: ["push-image"]
    env:
      - "LLM_DEFAULT=x"


it really doesn't matter but wondering why the MODEL_PROXY_* ones dropped the letter but LLM_DEFAULT didn't?

develra · 2026-01-16T16:07:02Z

src/kaggle_benchmarks/actors/openai.py

+                        arguments=json.loads(item.arguments),
+                        output=output,
+                    )
+                    # {"name": item.name, "arguments": item.arguments, "output": output}


nit - clean up if this is unused

develra · 2026-01-16T16:08:04Z

src/kaggle_benchmarks/actors/openai.py

+            {
+                "role": message.sender.role
+                if message.sender.role != "tool"
+                else "system",  # TODO: Remove this renaming once ModelProxy supports tools


Looking at this TODO - do we know if that is on the roadmap for ModelProxy?

dolaameng · 2026-01-16T01:28:33Z

src/kaggle_benchmarks/actors/openai.py

+        if not self.support_structured_outputs and schema_instructions:
+            raw_messages.append(
+                {
+                    "role": "system",


Some models don't support system role, shall we use user?

dolaameng · 2026-01-16T01:39:35Z

src/kaggle_benchmarks/actors/llms.py

+        temperature: float | None = 0,
+        seed: int = 0,
+        tools: list[Any] | None = None,
+    ) -> LLMMessage[str]:


For invoke, shall we return LLMMessage[T] for image output etc?

dolaameng · 2026-01-16T03:08:10Z

src/kaggle_benchmarks/actors/llms.py

+    tool_calls: list[FunctionCall] | None = None
+    usage: Usage | None = None
+
+    def add_chunk(self, chunk: str):


Curious what's the different uses of this new method and Message.stream?

dolaameng · 2026-01-16T03:13:07Z

src/kaggle_benchmarks/actors/llms.py

+
+@dataclasses.dataclass
+class LLMMessage(messages.Message[T]):
+    content: T


Look at the add_chunk method, do we essentially assume content is always string? If so shall we just rename it to text etc?

We can add another filed as image for image output in the future?

dolaameng · 2026-01-16T03:14:12Z

src/kaggle_benchmarks/actors/llms.py

+    content: T
+    _status: utils.Status = utils.Status.RUNNING
+    thinking: str | None = None
+    tool_calls: list[FunctionCall] | None = None


I think these fields are also useful to Message, do you think so?

dolaameng · 2026-01-16T19:03:19Z

tests/kaggle/test_integration.py

Discussed with @s-alexey that it will be great for us to test more existing examples to avoid back-incompatibility.

s-alexey requested a review from dolaameng January 7, 2026 16:04

s-alexey added the wip Work in progress label Jan 7, 2026

s-alexey force-pushed the genai branch from 519e732 to 1b24919 Compare January 7, 2026 17:53

dolaameng reviewed Jan 9, 2026

View reviewed changes

src/kaggle_benchmarks/actors/openai.py Show resolved Hide resolved

s-alexey force-pushed the genai branch from cb6ae40 to 6e2a34f Compare January 14, 2026 14:11

s-alexey added 6 commits January 14, 2026 14:11

Remove outdated tests

0d65973

Fix vision support and add tool calling

92b7be6

make tool calling work with openai api

a996762

Remove old genai tests

79b5290

Fix images

7103af5

s-alexey force-pushed the genai branch from 6e2a34f to 7103af5 Compare January 14, 2026 14:11

rename test file

9b62a2d

s-alexey force-pushed the genai branch from 74c099d to 9b62a2d Compare January 14, 2026 16:00

s-alexey added 2 commits January 14, 2026 17:08

fixup cicd

f1c44e4

skip tools in streaming tests

d7817ac

dolaameng reviewed Jan 16, 2026

View reviewed changes

dolaameng requested review from develra and yibinlin-google January 16, 2026 15:39

develra approved these changes Jan 16, 2026

View reviewed changes

dolaameng reviewed Jan 16, 2026

View reviewed changes



		@dataclasses.dataclass
		class FunctionCall:

Refactor LLM chats: separate streaming logic and enforce strict typing #12

Are you sure you want to change the base?

Refactor LLM chats: separate streaming logic and enforce strict typing #12

Conversation

s-alexey commented Jan 7, 2026

Uh oh!

dolaameng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

develra left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dolaameng left a comment •

edited

Loading