feat: Web Search V2 with context extraction + multi-model adjustment #154

blefo · 2025-09-10T15:38:28Z

This pull request significantly enhances the web search capabilities by introducing topic-based multi-query web search, improving result enrichment with main page content extraction, and refactoring message handling for better integration with LLM-driven workflows.

-> New agent that detects key information in the user prompt and determines whether a web search is necessary
-> Launch X web searches and compile them to obtain the final result

…ntent

…andling

…ed query handling

Copilot

Pull Request Overview

This PR introduces a significantly enhanced web search capability with topic-based multi-query support, content extraction from web pages, and improved LLM integration for better search query generation and result processing.

Adds topic analysis agent to detect relevant information in user prompts and determine web search necessity
Implements multi-query web search with parallel execution across different topics
Enhances result quality by extracting main content from web pages using trafilatura

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/unit/nilai_api/test_web_search.py	Updates test to reflect new message merging behavior (2 messages instead of 3)
packages/nilai-common/src/nilai_common/api_model.py	Adds new data models for topic analysis, result content extraction, and message merging utilities
packages/nilai-common/src/nilai_common/init.py	Exports new model classes for external use
nilai-api/src/nilai_api/handlers/web_search.py	Major refactor implementing topic-based multi-search with content extraction and improved query generation
nilai-api/pyproject.toml	Adds trafilatura dependency for web content extraction

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

nilai-api/src/nilai_api/handlers/web_search.py

packages/nilai-common/src/nilai_common/api_model.py

nilai-api/src/nilai_api/handlers/web_search.py

jcabrero · 2025-09-11T16:39:28Z

nilai-api/src/nilai_api/handlers/web_search.py

    return results


 def _sanitize_query(query: str) -> str:


RE are not great to debug. It would be necessary to have a description of what this function is doing.

jcabrero · 2025-09-11T16:41:40Z

nilai-api/src/nilai_api/handlers/web_search.py


-async def perform_web_search_async(query: str) -> WebSearchContext:
-    """Perform an asynchronous web search using the Brave Search API.
+def _strip_code_fences(text: str) -> str:


Same here. It would benefit highly from a good docstring

jcabrero · 2025-09-11T16:55:08Z

nilai-api/src/nilai_api/handlers/web_search.py

-    """Handle web search enhancement for a conversation.
-
-    Extracts the most recent user message, generates an optimized search query
-    using an LLM, and enhances the conversation with web search results.
-
-    Args:
-        req_messages: List of conversation messages to process
-        model_name: Name of the LLM model to use for query generation
-        client: LLM client instance for making API calls
-
-    Returns:
-        WebSearchEnhancedMessages with web search context added, or original
-        messages if no user query is found or search fails
-    """


Why removing the docstrings?

jcabrero · 2025-09-11T17:01:20Z

nilai-api/src/nilai_api/handlers/web_search.py

+    sections: List[str] = []
+    all_sources: List[Source] = []
+
+    for idx, (tq, ctx) in enumerate(zip(topic_queries, contexts), start=1):


These variables should be named topic_query and context. Short names normally lose significance in longer code.

jcabrero · 2025-09-12T07:34:47Z

packages/nilai-common/src/nilai_common/api_model.py



+# Common source-type identifier for recording the original query used in web search
+WEB_SEARCH_QUERY_SOURCE = "web_search_query"


This should go with the web_search handler

jcabrero · 2025-09-12T07:35:59Z

nilai-api/src/nilai_api/handlers/web_search.py

+def _format_search_results(results: List[SearchResult]) -> str:
+    lines = [


Make sure all the functions who do not have a docstring have one please

jcabrero · 2025-09-12T07:39:18Z

packages/nilai-common/src/nilai_common/api_model.py

+    @staticmethod
+    def merge_system_content(
+        messages: List[Message], system_content: str
+    ) -> List[Message]:
+        """Prepend or merge a system message with the given content.
+
+        - If the first message is a system message, append the new content after a blank line.
+        - Otherwise, create a new system message and prepend it.
+        """
+        if not messages:
+            return [MessageAdapter.new_message(role="system", content=system_content)]
+
+        first = MessageAdapter(raw=messages[0])
+        if first.role == "system":
+            existing = first.extract_text() or ""
+            merged = (
+                (existing + "\n\n" + system_content) if existing else system_content
+            )
+            new_first = MessageAdapter.new_message(role="system", content=merged)
+            return [new_first] + [
+                MessageAdapter(raw=m).to_openai_param() for m in messages[1:]
+            ]
+
+        return [MessageAdapter.new_message(role="system", content=system_content)] + [
+            MessageAdapter(raw=m).to_openai_param() for m in messages
+        ]
+


This is a static method to apply a change to the messages. I think copying the list of messages is not very productive here. I think the change to the list of messages needs to be applied in place and should be in the ChatCompletionRequest. I don't think this makes total sense.

jcabrero · 2025-09-12T07:40:24Z

nilai-api/src/nilai_api/handlers/web_search.py


 async def generate_search_query_from_llm(
-    user_message: str, model_name: str, client
+    user_message: str, model_name: str, client, *, topic: str | None = None


client doesn't have a type in any of the functions signatures.

jcabrero · 2025-09-12T07:44:05Z

nilai-api/src/nilai_api/handlers/web_search.py

+
+    logger.info("Analyze web search topics start model=%s", model_name)
+    try:
+        response = await client.chat.completions.create(**req)


When you do the chat completions we use QueryLogManager.log_query to track the use of completion tokens, you should track the usage somehow and pass it to the response object that then we use. In the new PR we introduce the ability to track the number of web searches, and also we would need to track the number of queries done.

jcabrero · 2025-09-12T07:45:48Z

nilai-api/src/nilai_api/handlers/web_search.py

    if not content:
-        logger.error("LLM returned empty search query")
-        raise RuntimeError("LLM returned an empty search query")
+        logger.warning("LLM returned empty search query; falling back to user input")
+        content = _sanitize_query(user_message)


I think this comes from the fact that sometimes GPT OSS fails to answer. I would include reasoning: low in the system prompt in order to avoid this as much as possible.

jcabrero

I think in the interest of time is better for you to have the ability to merge. I approve the PR but please make sure to apply all the changes.

…ystem content management

…ion functions

… type annotations

blefo added 7 commits September 9, 2025 12:11

refactor: enhance web search message handling to merge system role co…

22c7701

…ntent

feat: implement multi-topic web search analysis and enhance message h…

3416034

…andling

feat: enhance web search functionality with topic analysis and improv…

e192282

…ed query handling

feat: integrate trafilatura for content extraction in web search results

4a85ab7

refactor: clean up web search code

6bd53ce

fix: ruff naming

bd9dcdf

fix: update expected message count in web search test

4fd1d9e

blefo requested a review from Copilot September 10, 2025 16:25

Copilot AI reviewed Sep 10, 2025

View reviewed changes

nilai-api/src/nilai_api/handlers/web_search.py Show resolved Hide resolved

packages/nilai-common/src/nilai_common/api_model.py Show resolved Hide resolved

nilai-api/src/nilai_api/handlers/web_search.py Show resolved Hide resolved

blefo requested a review from jcabrero September 11, 2025 15:57

jcabrero reviewed Sep 11, 2025

View reviewed changes

jcabrero reviewed Sep 12, 2025

View reviewed changes

jcabrero approved these changes Sep 12, 2025

View reviewed changes

blefo and others added 4 commits September 12, 2025 13:24

refactor: update web search message handling to use ChatRequest for s…

9847980

…ystem content management

refactor: simplify query handling in web search by removing sanitizat…

4deaf99

…ion functions

refactor: improve web search handling with enhanced documentation and…

164c44e

… type annotations

Merge branch 'main' into fix-web-search-V2

89748f6

jcabrero merged commit c5d7c07 into main Sep 13, 2025
8 checks passed



		# Common source-type identifier for recording the original query used in web search
		WEB_SEARCH_QUERY_SOURCE = "web_search_query"

		def _format_search_results(results: List[SearchResult]) -> str:
		lines = [

feat: Web Search V2 with context extraction + multi-model adjustment #154

feat: Web Search V2 with context extraction + multi-model adjustment #154

Uh oh!

Conversation

blefo commented Sep 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcabrero left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants