-
Notifications
You must be signed in to change notification settings - Fork 10
Fix: add sources after web search in stream mode #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request fixes a streaming chat completion endpoint to properly include web search sources in the final streamed payload when web search is enabled. The implementation ensures sources are only added to the last chunk with finish_reason "stop".
Key changes:
- Refactored streaming logic to buffer payloads and append sources only on the final chunk
- Added comprehensive unit test to verify sources are included correctly in streamed responses
- Improved error handling and token usage tracking in the streaming generator
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| nilai-api/src/nilai_api/routers/private.py | Refactored streaming logic to buffer payloads, add sources to final chunk, and improve error handling |
| tests/unit/nilai_api/routers/test_private.py | Added unit test to verify sources are properly included in final streamed payload |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| try: | ||
| stop_condition = payload["choices"][0].get("finish_reason") | ||
| except Exception: | ||
| stop_condition = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important issue.
In most programming languages, try/except are expensive routines to run, and slow down most programs https://paltman.com/try-except-performance-in-python-a-simple-test/.
It is not a problem if you're executing it just once and if it normally goes through the "try" branch, but this is executed for every single token that is produced through the stream routine and it is executed through the except branch even more times. If this snippet slowed the program down by 1ms, it would mean for a 1000 token generation it would already eat up a second from that user.
So, easy option that works most times, use get("abc", None) + check if None.
Even easier option for this case. I see you're updating to if chunk.usage is not None. For the current setup, the chunk.usage is only reported on the last streamed chunk. This means that you can profit from chunk.usage not being None in that case to know it is the last request. At that point, you can reply with the sources.
| payload = chunk.model_dump(exclude_unset=True) | ||
|
|
||
| if chunk.usage is not None: | ||
| last_seen_usage = payload.get("usage", last_seen_usage) | ||
| if last_seen_usage: | ||
| prompt_token_usage = last_seen_usage.get( | ||
| "prompt_tokens", prompt_token_usage | ||
| ) | ||
| completion_token_usage = last_seen_usage.get( | ||
| "completion_tokens", completion_token_usage | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This aligns with what I told you on Friday. You should not dump the model. The response is a Pydantic model from the OpenAI client. This enforces you to see if fields such as "choices", or. "finish_reason" exists. They are there. They may be None if they are optional, which you can easily check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usage parameter is used for the stop condition now. It's possible as I updated the continuous_usage_stats so that it only appears in the last chunk.
The pydantic object is directly used instead of using a dictionary
jcabrero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have a look at the comments please
b694367 to
5425fcc
Compare
jcabrero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM
This pull request fixes the streaming chat completion endpoint to include sources when web search is on. It also adds a unit test to verify that sources are properly included in the final streamed payload.
Streaming response improvements:
chat_completion_stream_generatorto buffer payloads, append sources only on the final chunk (whenfinish_reason == "stop"), and yield JSON-encoded SSE events; also improved token usage tracking and error handling in the stream. [1] [2] [3]Testing enhancements:
test_chat_completion_stream_includes_sourcesto verify that web search sources are only included in the final streamed payload, ensuring correct SSE formatting and payload structure.Code cleanup:
asyncio) and added necessary ones (json) in both the main router and test files. [1] [2]