Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@

from onyx.agents.agent_search.dr.sub_agents.states import SubAgentMainState
from onyx.agents.agent_search.dr.sub_agents.states import SubAgentUpdate
from onyx.agents.agent_search.dr.utils import chunks_or_sections_to_search_docs
from onyx.agents.agent_search.shared_graph_utils.utils import (
get_langgraph_node_log_string,
)
from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
from onyx.context.search.models import SavedSearchDoc
from onyx.context.search.models import SearchDoc
Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two-agent-filter: Group imports from the same module into a single statement to reduce duplication and improve readability.

DEV MODE: This violation would have been filtered out by screening filters. Failing filters: commentPurpose, functionalImpact, objectivity.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Stylistic-only import grouping suggestion with no functional or maintainability impact; per criteria, filter out low-importance style issues.

Prompt for AI agents
Address the following comment on backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_3_reduce.py at line 13:

<comment>two-agent-filter: Group imports from the same module into a single statement to reduce duplication and improve readability.

*DEV MODE: This violation would have been filtered out by screening filters. Failing filters: commentPurpose, functionalImpact, objectivity.*

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Stylistic-only import grouping suggestion with no functional or maintainability impact; per criteria, filter out low-importance style issues.</comment>

<file context>
@@ -5,12 +5,12 @@
 )
 from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
 from onyx.context.search.models import SavedSearchDoc
+from onyx.context.search.models import SearchDoc
 from onyx.server.query_and_chat.streaming_models import SectionEnd
 from onyx.utils.logger import setup_logger
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-agent-filter: Group imports from the same module into a single statement to reduce duplication and improve readability.

DEV MODE: This violation would have been filtered out by screening filters. Failing filters: commentPurpose, functionalImpact, objectivity.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Technically correct but purely stylistic (import grouping). No functional, performance, or maintainability impact in this context; not worth reporting per criteria.

Prompt for AI agents
Address the following comment on backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_3_reduce.py at line 13:

<comment>single-agent-filter: Group imports from the same module into a single statement to reduce duplication and improve readability.

*DEV MODE: This violation would have been filtered out by screening filters. Failing filters: commentPurpose, functionalImpact, objectivity.*

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Technically correct but purely stylistic (import grouping). No functional, performance, or maintainability impact in this context; not worth reporting per criteria.</comment>

<file context>
@@ -5,12 +5,12 @@
 )
 from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
 from onyx.context.search.models import SavedSearchDoc
+from onyx.context.search.models import SearchDoc
 from onyx.server.query_and_chat.streaming_models import SectionEnd
 from onyx.utils.logger import setup_logger
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two-agent-filter: Duplicate imports from the same module on separate lines. Combine into a single import for clarity and style consistency.

DEV MODE: This violation would have been filtered out by screening filters. Failing filters: functionalImpact, objectivity.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Purely stylistic duplicate import from same module; no functional or maintainability impact. Filter out per selective criteria.

Prompt for AI agents
Address the following comment on backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_3_reduce.py at line 13:

<comment>two-agent-filter: Duplicate imports from the same module on separate lines. Combine into a single import for clarity and style consistency.

*DEV MODE: This violation would have been filtered out by screening filters. Failing filters: functionalImpact, objectivity.*

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Purely stylistic duplicate import from same module; no functional or maintainability impact. Filter out per selective criteria.</comment>

<file context>
@@ -5,12 +5,12 @@
 )
 from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
 from onyx.context.search.models import SavedSearchDoc
+from onyx.context.search.models import SearchDoc
 from onyx.server.query_and_chat.streaming_models import SectionEnd
 from onyx.utils.logger import setup_logger
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-agent-filter: Duplicate imports from the same module on separate lines. Combine into a single import for clarity and style consistency.

DEV MODE: This violation would have been filtered out by screening filters. Failing filters: functionalImpact, objectivity.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Separate imports of different names from the same module are valid and common in Python. This is purely stylistic with no functional or maintainability impact; per guidelines, such low-impact style issues should be filtered out.

Prompt for AI agents
Address the following comment on backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_3_reduce.py at line 13:

<comment>single-agent-filter: Duplicate imports from the same module on separate lines. Combine into a single import for clarity and style consistency.

*DEV MODE: This violation would have been filtered out by screening filters. Failing filters: functionalImpact, objectivity.*

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Separate imports of different names from the same module are valid and common in Python. This is purely stylistic with no functional or maintainability impact; per guidelines, such low-impact style issues should be filtered out.</comment>

<file context>
@@ -5,12 +5,12 @@
 )
 from onyx.agents.agent_search.shared_graph_utils.utils import write_custom_event
 from onyx.context.search.models import SavedSearchDoc
+from onyx.context.search.models import SearchDoc
 from onyx.server.query_and_chat.streaming_models import SectionEnd
 from onyx.utils.logger import setup_logger
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

from onyx.server.query_and_chat.streaming_models import SectionEnd
from onyx.utils.logger import setup_logger

Expand Down Expand Up @@ -47,7 +47,7 @@ def is_reducer(
doc_list.append(x)

# Convert InferenceSections to SavedSearchDocs
search_docs = chunks_or_sections_to_search_docs(doc_list)
search_docs = SearchDoc.chunks_or_sections_to_search_docs(doc_list)
retrieved_saved_search_docs = [
SavedSearchDoc.from_search_doc(search_doc, db_doc_id=0)
for search_doc in search_docs
Expand Down
4 changes: 2 additions & 2 deletions backend/onyx/agents/agent_search/dr/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
)
from onyx.context.search.models import InferenceSection
from onyx.context.search.models import SavedSearchDoc
from onyx.context.search.utils import chunks_or_sections_to_search_docs
from onyx.context.search.models import SearchDoc
from onyx.tools.tool_implementations.web_search.web_search_tool import (
WebSearchTool,
)
Expand Down Expand Up @@ -266,7 +266,7 @@ def convert_inference_sections_to_search_docs(
is_internet: bool = False,
) -> list[SavedSearchDoc]:
# Convert InferenceSections to SavedSearchDocs
search_docs = chunks_or_sections_to_search_docs(inference_sections)
search_docs = SearchDoc.chunks_or_sections_to_search_docs(inference_sections)
Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two-agent-filter: Conversion logic from InferenceSection to SavedSearchDoc with default db_doc_id=0 duplicates functionality in backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_3_reduce.py and backend/onyx/chat/chat_utils.py. This logic should be centralized into a shared utility function.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Low-impact and partially inaccurate. Code already uses centralized classmethods (SearchDoc.chunks_or_sections_to_search_docs and SavedSearchDoc.from_search_doc). Remaining repetition is a trivial list comprehension with db_doc_id=0; not substantive duplication to justify a shared helper.

Prompt for AI agents
Address the following comment on backend/onyx/agents/agent_search/dr/utils.py at line 269:

<comment>two-agent-filter: Conversion logic from InferenceSection to SavedSearchDoc with default db_doc_id=0 duplicates functionality in `backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_3_reduce.py` and `backend/onyx/chat/chat_utils.py`. This logic should be centralized into a shared utility function.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Low-impact and partially inaccurate. Code already uses centralized classmethods (SearchDoc.chunks_or_sections_to_search_docs and SavedSearchDoc.from_search_doc). Remaining repetition is a trivial list comprehension with db_doc_id=0; not substantive duplication to justify a shared helper.</comment>

<file context>
@@ -266,7 +266,7 @@ def convert_inference_sections_to_search_docs(
 ) -&gt; list[SavedSearchDoc]:
     # Convert InferenceSections to SavedSearchDocs
-    search_docs = chunks_or_sections_to_search_docs(inference_sections)
+    search_docs = SearchDoc.chunks_or_sections_to_search_docs(inference_sections)
     for search_doc in search_docs:
         search_doc.is_internet = is_internet
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Unmapped Agent in getAgentNameFromViolationSource (undefined)

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-agent-filter: Conversion logic from InferenceSection to SavedSearchDoc with default db_doc_id=0 duplicates functionality in backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_3_reduce.py and backend/onyx/chat/chat_utils.py. This logic should be centralized into a shared utility function.

Libraries consulted:

Prompt for AI agents
Address the following comment on backend/onyx/agents/agent_search/dr/utils.py at line 269:

<comment>single-agent-filter: Conversion logic from InferenceSection to SavedSearchDoc with default db_doc_id=0 duplicates functionality in `backend/onyx/agents/agent_search/dr/sub_agents/basic_search/dr_basic_search_3_reduce.py` and `backend/onyx/chat/chat_utils.py`. This logic should be centralized into a shared utility function. 

• **Libraries consulted**: </comment>

<file context>
@@ -266,7 +266,7 @@ def convert_inference_sections_to_search_docs(
 ) -&gt; list[SavedSearchDoc]:
     # Convert InferenceSections to SavedSearchDocs
-    search_docs = chunks_or_sections_to_search_docs(inference_sections)
+    search_docs = SearchDoc.chunks_or_sections_to_search_docs(inference_sections)
     for search_doc in search_docs:
         search_doc.is_internet = is_internet
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Unmapped Agent in getAgentNameFromViolationSource (undefined)

Fix with Cubic

for search_doc in search_docs:
search_doc.is_internet = is_internet

Expand Down
2 changes: 1 addition & 1 deletion backend/onyx/agents/agent_search/orchestration/states.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from pydantic import BaseModel

from onyx.chat.prompt_builder.answer_prompt_builder import PromptSnapshot
from onyx.chat.prompt_builder.schemas import PromptSnapshot
from onyx.tools.message import ToolCallSummary
from onyx.tools.models import SearchToolOverrideKwargs
from onyx.tools.models import ToolCallFinalResult
Expand Down
3 changes: 2 additions & 1 deletion backend/onyx/background/celery/tasks/docprocessing/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,6 @@
from onyx.file_store.document_batch_storage import get_document_batch_storage
from onyx.httpx.httpx_pool import HttpxPool
from onyx.indexing.embedder import DefaultIndexingEmbedder
from onyx.indexing.indexing_pipeline import run_indexing_pipeline
from onyx.natural_language_processing.search_nlp_models import EmbeddingModel
from onyx.natural_language_processing.search_nlp_models import (
InformationContentClassificationModel,
Expand Down Expand Up @@ -1268,6 +1267,8 @@ def _docprocessing_task(
tenant_id: str,
batch_num: int,
) -> None:
from onyx.indexing.indexing_pipeline import run_indexing_pipeline
Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two-agent-filter: Heartbeat timeout is mutated per-loop and persists across attempts, delaying failure detection for others.

Libraries consulted:

Prompt for AI agents
Address the following comment on backend/onyx/background/celery/tasks/docprocessing/tasks.py at line 1270:

<comment>two-agent-filter: Heartbeat timeout is mutated per-loop and persists across attempts, delaying failure detection for others. 

• **Libraries consulted**: </comment>

<file context>
@@ -1268,6 +1267,8 @@ def _docprocessing_task(
     tenant_id: str,
     batch_num: int,
 ) -&gt; None:
+    from onyx.indexing.indexing_pipeline import run_indexing_pipeline
+
     start_time = time.monotonic()
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-agent-filter: Heartbeat timeout is mutated per-loop and persists across attempts, delaying failure detection for others.

Libraries consulted:

Prompt for AI agents
Address the following comment on backend/onyx/background/celery/tasks/docprocessing/tasks.py at line 1270:

<comment>single-agent-filter: Heartbeat timeout is mutated per-loop and persists across attempts, delaying failure detection for others. 

• **Libraries consulted**: </comment>

<file context>
@@ -1268,6 +1267,8 @@ def _docprocessing_task(
     tenant_id: str,
     batch_num: int,
 ) -&gt; None:
+    from onyx.indexing.indexing_pipeline import run_indexing_pipeline
+
     start_time = time.monotonic()
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic


start_time = time.monotonic()

if tenant_id:
Expand Down
6 changes: 4 additions & 2 deletions backend/onyx/background/indexing/run_docfetching.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@
from onyx.connectors.connector_runner import ConnectorRunner
from onyx.connectors.exceptions import ConnectorValidationError
from onyx.connectors.exceptions import UnexpectedValidationError
from onyx.connectors.factory import instantiate_connector
from onyx.connectors.interfaces import CheckpointedConnector
from onyx.connectors.models import ConnectorFailure
from onyx.connectors.models import ConnectorStopSignal
Expand Down Expand Up @@ -66,7 +65,6 @@
from onyx.httpx.httpx_pool import HttpxPool
from onyx.indexing.embedder import DefaultIndexingEmbedder
from onyx.indexing.indexing_heartbeat import IndexingHeartbeatInterface
from onyx.indexing.indexing_pipeline import run_indexing_pipeline
from onyx.natural_language_processing.search_nlp_models import (
InformationContentClassificationModel,
)
Expand Down Expand Up @@ -100,6 +98,8 @@ def _get_connector_runner(
are the complete list of existing documents of the connector. If the task
of type LOAD_STATE, the list will be considered complete and otherwise incomplete.
"""
from onyx.connectors.factory import instantiate_connector
Copy link

@cubic-dev-ai cubic-dev-ai bot Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local import placed outside try in _get_connector_runner allows ImportError to escape and leaves index attempt stuck (not marked failed).

Prompt for AI agents
Address the following comment on backend/onyx/background/indexing/run_docfetching.py at line 101:

<comment>Local import placed outside try in _get_connector_runner allows ImportError to escape and leaves index attempt stuck (not marked failed).</comment>

<file context>
@@ -100,6 +98,8 @@ def _get_connector_runner(
     are the complete list of existing documents of the connector. If the task
     of type LOAD_STATE, the list will be considered complete and otherwise incomplete.
     &quot;&quot;&quot;
+    from onyx.connectors.factory import instantiate_connector
+
     task = attempt.connector_credential_pair.connector.input_type
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two-agent-filter: Early exceptions before try/except in connector_document_extraction cause stuck IN_PROGRESS attempts and MemoryTracer not stopped due to lazy import in _get_connector_runner.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Technically valid but redundant; subsumed by the broader issue that connector_document_extraction’s outer try/finally starts too late. Addressing that covers this and other early exceptions.

Prompt for AI agents
Address the following comment on backend/onyx/background/indexing/run_docfetching.py at line 101:

<comment>two-agent-filter: Early exceptions before try/except in connector_document_extraction cause stuck IN_PROGRESS attempts and MemoryTracer not stopped due to lazy import in _get_connector_runner.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Technically valid but redundant; subsumed by the broader issue that connector_document_extraction’s outer try/finally starts too late. Addressing that covers this and other early exceptions.</comment>

<file context>
@@ -100,6 +98,8 @@ def _get_connector_runner(
     are the complete list of existing documents of the connector. If the task
     of type LOAD_STATE, the list will be considered complete and otherwise incomplete.
     &quot;&quot;&quot;
+    from onyx.connectors.factory import instantiate_connector
+
     task = attempt.connector_credential_pair.connector.input_type
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-agent-filter: Early exceptions before try/except in connector_document_extraction cause stuck IN_PROGRESS attempts and MemoryTracer not stopped due to lazy import in _get_connector_runner.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Filter as overlapping with #1 and contains a minor inaccuracy: MemoryTracer starts after get_document_batch_storage, so a failure there wouldn’t leak the tracer. The core issue (early exceptions before the try/finally) is better captured by #1.

Prompt for AI agents
Address the following comment on backend/onyx/background/indexing/run_docfetching.py at line 101:

<comment>single-agent-filter: Early exceptions before try/except in connector_document_extraction cause stuck IN_PROGRESS attempts and MemoryTracer not stopped due to lazy import in _get_connector_runner.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Filter as overlapping with #1 and contains a minor inaccuracy: MemoryTracer starts after get_document_batch_storage, so a failure there wouldn’t leak the tracer. The core issue (early exceptions before the try/finally) is better captured by #1.</comment>

<file context>
@@ -100,6 +98,8 @@ def _get_connector_runner(
     are the complete list of existing documents of the connector. If the task
     of type LOAD_STATE, the list will be considered complete and otherwise incomplete.
     &quot;&quot;&quot;
+    from onyx.connectors.factory import instantiate_connector
+
     task = attempt.connector_credential_pair.connector.input_type
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy import placed outside error handling; ImportError would bypass the existing try/except, leading to unhandled failure instead of graceful connector pause/logging.

Prompt for AI agents
Address the following comment on backend/onyx/background/indexing/run_docfetching.py at line 101:

<comment>Lazy import placed outside error handling; ImportError would bypass the existing try/except, leading to unhandled failure instead of graceful connector pause/logging.</comment>

<file context>
@@ -100,6 +98,8 @@ def _get_connector_runner(
     are the complete list of existing documents of the connector. If the task
     of type LOAD_STATE, the list will be considered complete and otherwise incomplete.
     &quot;&quot;&quot;
+    from onyx.connectors.factory import instantiate_connector
+
     task = attempt.connector_credential_pair.connector.input_type
</file context>

[internal] Confidence score: 8/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of instantiate_connector outside try/except causes unhandled setup exceptions, leaving attempts stuck IN_PROGRESS and skipping CCPair pause.

Prompt for AI agents
Address the following comment on backend/onyx/background/indexing/run_docfetching.py at line 101:

<comment>Import of instantiate_connector outside try/except causes unhandled setup exceptions, leaving attempts stuck IN_PROGRESS and skipping CCPair pause.</comment>

<file context>
@@ -100,6 +98,8 @@ def _get_connector_runner(
     are the complete list of existing documents of the connector. If the task
     of type LOAD_STATE, the list will be considered complete and otherwise incomplete.
     &quot;&quot;&quot;
+    from onyx.connectors.factory import instantiate_connector
+
     task = attempt.connector_credential_pair.connector.input_type
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import inside _get_connector_runner occurs outside its try/except, so import-time errors bypass pause logic and, since caller’s try is later, leave attempt IN_PROGRESS.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: The import sits outside the local try/except, but the stated impact is wrong. The watchdog (docfetching_proxy_task/process_job_result) marks attempts FAILED on early errors, so attempts won’t be stuck IN_PROGRESS. Pausing on import errors (dependency/code issues) is also undesirable. Low impact; filter.

Prompt for AI agents
Address the following comment on backend/onyx/background/indexing/run_docfetching.py at line 101:

<comment>Import inside _get_connector_runner occurs outside its try/except, so import-time errors bypass pause logic and, since caller’s try is later, leave attempt IN_PROGRESS.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: The import sits outside the local try/except, but the stated impact is wrong. The watchdog (docfetching_proxy_task/process_job_result) marks attempts FAILED on early errors, so attempts won’t be stuck IN_PROGRESS. Pausing on import errors (dependency/code issues) is also undesirable. Low impact; filter.</comment>

<file context>
@@ -100,6 +98,8 @@ def _get_connector_runner(
     are the complete list of existing documents of the connector. If the task
     of type LOAD_STATE, the list will be considered complete and otherwise incomplete.
     &quot;&quot;&quot;
+    from onyx.connectors.factory import instantiate_connector
+
     task = attempt.connector_credential_pair.connector.input_type
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic


task = attempt.connector_credential_pair.connector.input_type

try:
Expand Down Expand Up @@ -283,6 +283,8 @@ def _run_indexing(
2. Embed and index these documents into the chosen datastore (vespa)
3. Updates Postgres to record the indexed documents + the outcome of this run
"""
from onyx.indexing.indexing_pipeline import run_indexing_pipeline
Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import is outside the function’s exception handling; an ImportError would escape the try/except that manages indexing failures.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Although the import precedes the try/except in _run_indexing, this function is explicitly marked as legacy/for comparison and has no call sites in this file. Impact is likely negligible; no MemoryTracer is started before the import. Given medium sensitivity and desire to avoid false positives, this is too low-impact/uncertain to report.

Prompt for AI agents
Address the following comment on backend/onyx/background/indexing/run_docfetching.py at line 286:

<comment>Import is outside the function’s exception handling; an ImportError would escape the try/except that manages indexing failures.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Although the import precedes the try/except in _run_indexing, this function is explicitly marked as legacy/for comparison and has no call sites in this file. Impact is likely negligible; no MemoryTracer is started before the import. Given medium sensitivity and desire to avoid false positives, this is too low-impact/uncertain to report.</comment>

<file context>
@@ -283,6 +283,8 @@ def _run_indexing(
     2. Embed and index these documents into the chosen datastore (vespa)
     3. Updates Postgres to record the indexed documents + the outcome of this run
     &quot;&quot;&quot;
+    from onyx.indexing.indexing_pipeline import run_indexing_pipeline
+
     start_time = time.monotonic()  # jsut used for logging
</file context>

[internal] Confidence score: 8/10

[internal] Posted by: General AI Review Agent

Fix with Cubic


start_time = time.monotonic() # jsut used for logging

with get_session_with_current_tenant() as db_session_temp:
Expand Down
5 changes: 0 additions & 5 deletions backend/onyx/chat/prompt_builder/answer_prompt_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
from langchain_core.messages import BaseMessage
from langchain_core.messages import HumanMessage
from langchain_core.messages import SystemMessage
from pydantic import BaseModel
from pydantic.v1 import BaseModel as BaseModel__v1

from onyx.chat.models import PromptConfig
Expand Down Expand Up @@ -196,10 +195,6 @@ def build(self) -> list[BaseMessage]:


# Stores some parts of a prompt builder as needed for tool calls
class PromptSnapshot(BaseModel):
raw_message_history: list[PreviousMessage]
raw_user_query: str
built_prompt: list[BaseMessage]


# TODO: rename this? AnswerConfig maybe?
Expand Down
10 changes: 10 additions & 0 deletions backend/onyx/chat/prompt_builder/schemas.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from langchain_core.messages import BaseMessage
from pydantic import BaseModel

from onyx.llm.models import PreviousMessage
Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module-level import of onyx.llm.models triggers heavy LangChain/LLM dependencies at import time, undermining lazy-loading and increasing memory footprint. Use TYPE_CHECKING and forward references to avoid runtime import, or move this type to a lighter module.

Prompt for AI agents
Address the following comment on backend/onyx/chat/prompt_builder/schemas.py at line 4:

<comment>Module-level import of onyx.llm.models triggers heavy LangChain/LLM dependencies at import time, undermining lazy-loading and increasing memory footprint. Use TYPE_CHECKING and forward references to avoid runtime import, or move this type to a lighter module.</comment>

<file context>
@@ -0,0 +1,10 @@
+from langchain_core.messages import BaseMessage
+from pydantic import BaseModel
+
+from onyx.llm.models import PreviousMessage
+
+
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: General AI Review Agent

Fix with Cubic



class PromptSnapshot(BaseModel):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`PromptSnapshot` includes LangChain `BaseMessage` instances but does not enable pydantic's `arbitrary_types_allowed`, risking validation/runtime errors when constructing the model. (Based on your team's feedback about LangChain using pydantic v1 and the need to allow arbitrary types when nesting LangChain models inside pydantic v2 models, as seen in existing patterns.)
Prompt for AI agents ~~~ Address the following comment on backend/onyx/chat/prompt_builder/schemas.py at line 7: `PromptSnapshot` includes LangChain `BaseMessage` instances but does not enable pydantic's `arbitrary_types_allowed`, risking validation/runtime errors when constructing the model. (Based on your team's feedback about LangChain using pydantic v1 and the need to allow arbitrary types when nesting LangChain models inside pydantic v2 models, as seen in existing patterns.) @@ -0,0 +1,10 @@ +from onyx.llm.models import PreviousMessage + + +class PromptSnapshot(BaseModel): + raw_message_history: list[PreviousMessage] + raw_user_query: str ~~~
[internal] *Confidence score: 9/10* [internal] *Posted by: General AI Review Agent* [View Feedback](https://www.cubic.dev/ai-review?tab=learnings&feedbackId=langchain_pydantic_v1_arbitrary_types&repo=1063449229)

raw_message_history: list[PreviousMessage]
raw_user_query: str
built_prompt: list[BaseMessage]
Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two-agent-filter: Pydantic v2 model includes list[BaseMessage] (external type) without enabling arbitrary_types_allowed; this can raise validation errors when instantiating PromptSnapshot with langchain messages.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: LangChain BaseMessage is a Pydantic model; list[BaseMessage] fields validate without arbitrary_types_allowed in Pydantic v2. No concrete evidence of instantiation errors, and similar config in other files likely targets non-Pydantic types. High risk of false positive.

Libraries consulted: Pydantic v2 arbitrary_types_allowed, LangChain BaseMessage messages, Pydantic, Langchain, Python_langchain

Prompt for AI agents
Address the following comment on backend/onyx/chat/prompt_builder/schemas.py at line 10:

<comment>two-agent-filter: Pydantic v2 model includes `list[BaseMessage]` (external type) without enabling `arbitrary_types_allowed`; this can raise validation errors when instantiating `PromptSnapshot` with langchain messages.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: LangChain BaseMessage is a Pydantic model; list[BaseMessage] fields validate without arbitrary_types_allowed in Pydantic v2. No concrete evidence of instantiation errors, and similar config in other files likely targets non-Pydantic types. High risk of false positive.

• **Libraries consulted**: Pydantic v2 arbitrary_types_allowed, LangChain BaseMessage messages, Pydantic, Langchain, Python_langchain</comment>

<file context>
@@ -0,0 +1,10 @@
+class PromptSnapshot(BaseModel):
+    raw_message_history: list[PreviousMessage]
+    raw_user_query: str
+    built_prompt: list[BaseMessage]
</file context>

[internal] Confidence score: 8/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-agent-filter: Pydantic v2 model includes list[BaseMessage] (external type) without enabling arbitrary_types_allowed; this can raise validation errors when instantiating PromptSnapshot with langchain messages.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Uncertain. While the repo pins Pydantic v2 and uses arbitrary_types_allowed in several models, v2 generally supports arbitrary external types without explicit config, and documentation evidence here doesn’t conclusively show that a BaseModel field of list[BaseMessage] will fail validation. Without a reproducible error or usage context showing failure, this is too speculative to report.

Libraries consulted: Pydantic v2 arbitrary types allowed, Pydantic

Prompt for AI agents
Address the following comment on backend/onyx/chat/prompt_builder/schemas.py at line 10:

<comment>single-agent-filter: Pydantic v2 model includes `list[BaseMessage]` (external type) without enabling `arbitrary_types_allowed`; this can raise validation errors when instantiating `PromptSnapshot` with langchain messages.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Uncertain. While the repo pins Pydantic v2 and uses arbitrary_types_allowed in several models, v2 generally supports arbitrary external types without explicit config, and documentation evidence here doesn’t conclusively show that a BaseModel field of list[BaseMessage] will fail validation. Without a reproducible error or usage context showing failure, this is too speculative to report.

• **Libraries consulted**: Pydantic v2 arbitrary types allowed, Pydantic</comment>

<file context>
@@ -0,0 +1,10 @@
+class PromptSnapshot(BaseModel):
+    raw_message_history: list[PreviousMessage]
+    raw_user_query: str
+    built_prompt: list[BaseMessage]
</file context>

[internal] Confidence score: 8/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pydantic v2 model lacks arbitrary_types_allowed while using non-Pydantic type BaseMessage, causing schema/instantiation errors.

Prompt for AI agents
Address the following comment on backend/onyx/chat/prompt_builder/schemas.py at line 10:

<comment>Pydantic v2 model lacks arbitrary_types_allowed while using non-Pydantic type BaseMessage, causing schema/instantiation errors.</comment>

<file context>
@@ -0,0 +1,10 @@
+class PromptSnapshot(BaseModel):
+    raw_message_history: list[PreviousMessage]
+    raw_user_query: str
+    built_prompt: list[BaseMessage]
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing arbitrary_types_allowed for field using external type BaseMessage; PromptSnapshot will fail validation on instantiation.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Likely false positive. LangChain’s BaseMessage and subclasses are Pydantic models in recent versions, so list[BaseMessage] validates without arbitrary_types_allowed. Other files enabling arbitrary types don’t prove it’s needed here, and no failing instantiation is shown. High risk of false positive; filter it out.

Libraries consulted: pydantic arbitrary_types_allowed v2, langchain_core BaseMessage, Pydantic, Langchain

Prompt for AI agents
Address the following comment on backend/onyx/chat/prompt_builder/schemas.py at line 10:

<comment>Missing arbitrary_types_allowed for field using external type BaseMessage; PromptSnapshot will fail validation on instantiation.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Likely false positive. LangChain’s BaseMessage and subclasses are Pydantic models in recent versions, so list[BaseMessage] validates without arbitrary_types_allowed. Other files enabling arbitrary types don’t prove it’s needed here, and no failing instantiation is shown. High risk of false positive; filter it out.

• **Libraries consulted**: pydantic arbitrary_types_allowed v2, langchain_core BaseMessage, Pydantic, Langchain</comment>

<file context>
@@ -0,0 +1,10 @@
+class PromptSnapshot(BaseModel):
+    raw_message_history: list[PreviousMessage]
+    raw_user_query: str
+    built_prompt: list[BaseMessage]
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pydantic model lacks arbitrary_types_allowed for list[BaseMessage], causing schema/validation error

Prompt for AI agents
Address the following comment on backend/onyx/chat/prompt_builder/schemas.py at line 10:

<comment>Pydantic model lacks arbitrary_types_allowed for list[BaseMessage], causing schema/validation error</comment>

<file context>
@@ -0,0 +1,10 @@
+class PromptSnapshot(BaseModel):
+    raw_message_history: list[PreviousMessage]
+    raw_user_query: str
+    built_prompt: list[BaseMessage]
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

2 changes: 1 addition & 1 deletion backend/onyx/chat/tool_handling/tool_response_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from onyx.chat.models import ResponsePart
from onyx.chat.prompt_builder.answer_prompt_builder import AnswerPromptBuilder
from onyx.chat.prompt_builder.answer_prompt_builder import LLMCall
from onyx.chat.prompt_builder.answer_prompt_builder import PromptSnapshot
from onyx.chat.prompt_builder.schemas import PromptSnapshot
from onyx.llm.interfaces import LLM
from onyx.tools.force import ForceUseTool
from onyx.tools.message import build_tool_message
Expand Down
39 changes: 39 additions & 0 deletions backend/onyx/context/search/models.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from collections.abc import Sequence
from datetime import datetime
from typing import Any

Expand Down Expand Up @@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
secondary_owners: list[str] | None = None
is_internet: bool = False

@classmethod
def chunks_or_sections_to_search_docs(
cls,
items: "Sequence[InferenceChunk | InferenceSection] | None",
Copy link

@cubic-dev-ai cubic-dev-ai bot Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid type annotation: mixing a string literal with | None will raise a TypeError at import. Use a real Sequence[...] | None type or quote the entire annotation.

Prompt for AI agents
Address the following comment on backend/onyx/context/search/models.py at line 362:

<comment>Invalid type annotation: mixing a string literal with `| None` will raise a TypeError at import. Use a real `Sequence[...] | None` type or quote the entire annotation.</comment>

<file context>
@@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
+    @classmethod
+    def chunks_or_sections_to_search_docs(
+        cls,
+        items: &quot;Sequence[InferenceChunk | InferenceSection] | None&quot;,
+    ) -&gt; list[&quot;SearchDoc&quot;]:
+        &quot;&quot;&quot;Convert a sequence of InferenceChunk or InferenceSection objects to SearchDoc objects.&quot;&quot;&quot;
</file context>

[internal] Confidence score: 10/10

[internal] Posted by: General AI Review Agent

Suggested change
items: "Sequence[InferenceChunk | InferenceSection] | None",
items: Sequence[InferenceChunk | InferenceSection] | None,
Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stringified type annotation is unnecessary here; use a real annotation so the imported Sequence is actually used and annotations remain introspectable, preventing potential unused-import warnings.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: String annotations here are acceptable and consistent with the forward reference in the return type. No functional or maintainability issue is demonstrated; the unused-import warning concern is speculative. This is a stylistic nit, so it should be filtered out.

Prompt for AI agents
Address the following comment on backend/onyx/context/search/models.py at line 362:

<comment>Stringified type annotation is unnecessary here; use a real annotation so the imported Sequence is actually used and annotations remain introspectable, preventing potential unused-import warnings.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: String annotations here are acceptable and consistent with the forward reference in the return type. No functional or maintainability issue is demonstrated; the unused-import warning concern is speculative. This is a stylistic nit, so it should be filtered out.</comment>

<file context>
@@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
+    @classmethod
+    def chunks_or_sections_to_search_docs(
+        cls,
+        items: &quot;Sequence[InferenceChunk | InferenceSection] | None&quot;,
+    ) -&gt; list[&quot;SearchDoc&quot;]:
+        &quot;&quot;&quot;Convert a sequence of InferenceChunk or InferenceSection objects to SearchDoc objects.&quot;&quot;&quot;
</file context>

[internal] Confidence score: 8/10

[internal] Posted by: General AI Review Agent

Suggested change
items: "Sequence[InferenceChunk | InferenceSection] | None",
items: Sequence[InferenceChunk | InferenceSection] | None,
Fix with Cubic

) -> list["SearchDoc"]:
"""Convert a sequence of InferenceChunk or InferenceSection objects to SearchDoc objects."""
if not items:
return []

search_docs = [
cls(
document_id=(
chunk := (
item.center_chunk
if isinstance(item, InferenceSection)
else item
)
).document_id,
chunk_ind=chunk.chunk_id,
semantic_identifier=chunk.semantic_identifier or "Unknown",
link=chunk.source_links[0] if chunk.source_links else None,
Copy link

@cubic-dev-ai cubic-dev-ai bot Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential KeyError: source_links is a dict; accessing [0] assumes key 0 exists. Use .get(0) to safely retrieve the first link by that key or return None.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Access pattern source_links[0] appears to be a project-wide invariant when source_links is set; low likelihood of KeyError and widely used elsewhere. Not worth reporting.

Prompt for AI agents
Address the following comment on backend/onyx/context/search/models.py at line 379:

<comment>Potential KeyError: `source_links` is a dict; accessing `[0]` assumes key 0 exists. Use `.get(0)` to safely retrieve the first link by that key or return None.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Access pattern `source_links[0]` appears to be a project-wide invariant when `source_links` is set; low likelihood of KeyError and widely used elsewhere. Not worth reporting.</comment>

<file context>
@@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
+                ).document_id,
+                chunk_ind=chunk.chunk_id,
+                semantic_identifier=chunk.semantic_identifier or &quot;Unknown&quot;,
+                link=chunk.source_links[0] if chunk.source_links else None,
+                blurb=chunk.blurb,
+                source_type=chunk.source_type,
</file context>

[internal] Confidence score: 8/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two-agent-filter: Directly accessing chunk.source_links[0] assumes key 0 exists on a dict[int, str]; if source_links is present but lacks key 0, this raises KeyError. Use a safe first-value retrieval or get(0). (Based on your team's feedback about consolidating conversion into SearchDoc to avoid brittle assumptions, this influenced checking for safer link extraction.) • **Libraries consulted**:
Prompt for AI agents ~~~ Address the following comment on backend/onyx/context/search/models.py at line 379: two-agent-filter: Directly accessing chunk.source_links[0] assumes key 0 exists on a dict[int, str]; if source_links is present but lacks key 0, this raises KeyError. Use a safe first-value retrieval or get(0). (Based on your team's feedback about consolidating conversion into SearchDoc to avoid brittle assumptions, this influenced checking for safer link extraction.) • **Libraries consulted**: @@ -355,6 +356,44 @@ class SearchDoc(BaseModel): + ).document_id, + chunk_ind=chunk.chunk_id, + semantic_identifier=chunk.semantic_identifier or "Unknown", + link=chunk.source_links[0] if chunk.source_links else None, + blurb=chunk.blurb, + source_type=chunk.source_type, ~~~
[internal] *Confidence score: 9/10* [internal] *Posted by: General AI Review Agent* [View Feedback](http://localhost:3000/ai-review?tab=learnings&feedbackId=DAN-2573&repo=1063449229)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-agent-filter: Directly accessing chunk.source_links[0] assumes key 0 exists on a dict[int, str]; if source_links is present but lacks key 0, this raises KeyError. Use a safe first-value retrieval or get(0). (Based on your team's feedback about consolidating conversion into SearchDoc to avoid brittle assumptions, this influenced checking for safer link extraction.) • **Libraries consulted**:
Prompt for AI agents ~~~ Address the following comment on backend/onyx/context/search/models.py at line 379: single-agent-filter: Directly accessing chunk.source_links[0] assumes key 0 exists on a dict[int, str]; if source_links is present but lacks key 0, this raises KeyError. Use a safe first-value retrieval or get(0). (Based on your team's feedback about consolidating conversion into SearchDoc to avoid brittle assumptions, this influenced checking for safer link extraction.) • **Libraries consulted**: @@ -355,6 +356,44 @@ class SearchDoc(BaseModel): + ).document_id, + chunk_ind=chunk.chunk_id, + semantic_identifier=chunk.semantic_identifier or "Unknown", + link=chunk.source_links[0] if chunk.source_links else None, + blurb=chunk.blurb, + source_type=chunk.source_type, ~~~
[internal] *Confidence score: 9/10* [internal] *Posted by: General AI Review Agent* [View Feedback](http://localhost:3000/ai-review?tab=learnings&feedbackId=DAN-2573&repo=1063449229)

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two-agent-filter: Directly indexing source_links[0] can raise KeyError when the dict exists but lacks key 0; use .get(0) for safety.

Libraries consulted:

Prompt for AI agents
Address the following comment on backend/onyx/context/search/models.py at line 379:

<comment>two-agent-filter: Directly indexing `source_links[0]` can raise KeyError when the dict exists but lacks key 0; use .get(0) for safety. 

• **Libraries consulted**: </comment>

<file context>
@@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
+                ).document_id,
+                chunk_ind=chunk.chunk_id,
+                semantic_identifier=chunk.semantic_identifier or &quot;Unknown&quot;,
+                link=chunk.source_links[0] if chunk.source_links else None,
+                blurb=chunk.blurb,
+                source_type=chunk.source_type,
</file context>

[internal] Confidence score: 8/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-agent-filter: Directly indexing source_links[0] can raise KeyError when the dict exists but lacks key 0; use .get(0) for safety.

Libraries consulted:

Prompt for AI agents
Address the following comment on backend/onyx/context/search/models.py at line 379:

<comment>single-agent-filter: Directly indexing `source_links[0]` can raise KeyError when the dict exists but lacks key 0; use .get(0) for safety. 

• **Libraries consulted**: </comment>

<file context>
@@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
+                ).document_id,
+                chunk_ind=chunk.chunk_id,
+                semantic_identifier=chunk.semantic_identifier or &quot;Unknown&quot;,
+                link=chunk.source_links[0] if chunk.source_links else None,
+                blurb=chunk.blurb,
+                source_type=chunk.source_type,
</file context>

[internal] Confidence score: 8/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two-agent-filter: Dict index by fixed key 0 can raise KeyError when source_links lacks key 0.

Libraries consulted:

Prompt for AI agents
Address the following comment on backend/onyx/context/search/models.py at line 379:

<comment>two-agent-filter: Dict index by fixed key 0 can raise KeyError when source_links lacks key 0. 

• **Libraries consulted**: </comment>

<file context>
@@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
+                ).document_id,
+                chunk_ind=chunk.chunk_id,
+                semantic_identifier=chunk.semantic_identifier or &quot;Unknown&quot;,
+                link=chunk.source_links[0] if chunk.source_links else None,
+                blurb=chunk.blurb,
+                source_type=chunk.source_type,
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link
Author

@cubic-dev-local cubic-dev-local bot Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-agent-filter: Dict index by fixed key 0 can raise KeyError when source_links lacks key 0.

Libraries consulted:

Prompt for AI agents
Address the following comment on backend/onyx/context/search/models.py at line 379:

<comment>single-agent-filter: Dict index by fixed key 0 can raise KeyError when source_links lacks key 0. 

• **Libraries consulted**: </comment>

<file context>
@@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
+                ).document_id,
+                chunk_ind=chunk.chunk_id,
+                semantic_identifier=chunk.semantic_identifier or &quot;Unknown&quot;,
+                link=chunk.source_links[0] if chunk.source_links else None,
+                blurb=chunk.blurb,
+                source_type=chunk.source_type,
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct indexing chunk.source_links[0] can raise KeyError when dict lacks key 0, causing runtime failure.

Prompt for AI agents
Address the following comment on backend/onyx/context/search/models.py at line 379:

<comment>Direct indexing chunk.source_links[0] can raise KeyError when dict lacks key 0, causing runtime failure.</comment>

<file context>
@@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
+                ).document_id,
+                chunk_ind=chunk.chunk_id,
+                semantic_identifier=chunk.semantic_identifier or &quot;Unknown&quot;,
+                link=chunk.source_links[0] if chunk.source_links else None,
+                blurb=chunk.blurb,
+                source_type=chunk.source_type,
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeyError risk: source_links is dict, code assumes key 0 exists and indexes like a list

Prompt for AI agents
Address the following comment on backend/onyx/context/search/models.py at line 379:

<comment>KeyError risk: source_links is dict, code assumes key 0 exists and indexes like a list</comment>

<file context>
@@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
+                ).document_id,
+                chunk_ind=chunk.chunk_id,
+                semantic_identifier=chunk.semantic_identifier or &quot;Unknown&quot;,
+                link=chunk.source_links[0] if chunk.source_links else None,
+                blurb=chunk.blurb,
+                source_type=chunk.source_type,
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing source_links by key 0 can raise KeyError; source_links is a dict[int, str], not guaranteed to contain 0.

Prompt for AI agents
Address the following comment on backend/onyx/context/search/models.py at line 379:

<comment>Indexing source_links by key 0 can raise KeyError; source_links is a dict[int, str], not guaranteed to contain 0.</comment>

<file context>
@@ -355,6 +356,44 @@ class SearchDoc(BaseModel):
+                ).document_id,
+                chunk_ind=chunk.chunk_id,
+                semantic_identifier=chunk.semantic_identifier or &quot;Unknown&quot;,
+                link=chunk.source_links[0] if chunk.source_links else None,
+                blurb=chunk.blurb,
+                source_type=chunk.source_type,
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Suggested change
link=chunk.source_links[0] if chunk.source_links else None,
link=(next(iter(chunk.source_links.values())) if chunk.source_links else None),
Fix with Cubic

blurb=chunk.blurb,
source_type=chunk.source_type,
boost=chunk.boost,
hidden=chunk.hidden,
metadata=chunk.metadata,
score=chunk.score,
match_highlights=chunk.match_highlights,
updated_at=chunk.updated_at,
primary_owners=chunk.primary_owners,
secondary_owners=chunk.secondary_owners,
is_internet=False,
)
for item in items
]

return search_docs

def model_dump(self, *args: list, **kwargs: dict[str, Any]) -> dict[str, Any]: # type: ignore
initial_dict = super().model_dump(*args, **kwargs) # type: ignore
initial_dict["updated_at"] = (
Expand Down
34 changes: 0 additions & 34 deletions backend/onyx/context/search/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,40 +118,6 @@ def inference_section_from_chunks(
)


def chunks_or_sections_to_search_docs(
items: Sequence[InferenceChunk | InferenceSection] | None,
) -> list[SearchDoc]:
if not items:
return []

search_docs = [
SearchDoc(
document_id=(
chunk := (
item.center_chunk if isinstance(item, InferenceSection) else item
)
).document_id,
chunk_ind=chunk.chunk_id,
semantic_identifier=chunk.semantic_identifier or "Unknown",
link=chunk.source_links[0] if chunk.source_links else None,
blurb=chunk.blurb,
source_type=chunk.source_type,
boost=chunk.boost,
hidden=chunk.hidden,
metadata=chunk.metadata,
score=chunk.score,
match_highlights=chunk.match_highlights,
updated_at=chunk.updated_at,
primary_owners=chunk.primary_owners,
secondary_owners=chunk.secondary_owners,
is_internet=False,
)
for item in items
]

return search_docs


def remove_stop_words_and_punctuation(keywords: list[str]) -> list[str]:
try:
# Re-tokenize using the NLTK tokenizer for better matching
Expand Down
5 changes: 2 additions & 3 deletions backend/onyx/db/chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@
from onyx.context.search.models import RetrievalDocs
from onyx.context.search.models import SavedSearchDoc
from onyx.context.search.models import SearchDoc as ServerSearchDoc
from onyx.context.search.utils import chunks_or_sections_to_search_docs
from onyx.db.models import AgentSearchMetrics
from onyx.db.models import AgentSubQuery
from onyx.db.models import AgentSubQuestion
Expand All @@ -57,7 +56,7 @@
from onyx.server.query_and_chat.models import ChatMessageDetail
from onyx.server.query_and_chat.models import SubQueryDetail
from onyx.server.query_and_chat.models import SubQuestionDetail
from onyx.tools.tool_runner import ToolCallFinalResult
from onyx.tools.models import ToolCallFinalResult
from onyx.utils.logger import setup_logger
from onyx.utils.special_types import JSON_ro

Expand Down Expand Up @@ -1147,7 +1146,7 @@ def log_agent_sub_question_results(
db_session.add(sub_query_object)
db_session.commit()

search_docs = chunks_or_sections_to_search_docs(
search_docs = ServerSearchDoc.chunks_or_sections_to_search_docs(
sub_query.retrieved_documents
)
for doc in search_docs:
Expand Down
26 changes: 20 additions & 6 deletions backend/onyx/file_processing/extract_file_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,12 @@
from typing import Any
from typing import IO
from typing import NamedTuple
from typing import Optional
from typing import TYPE_CHECKING
from zipfile import BadZipFile

import chardet
import openpyxl
from markitdown import FileConversionException
from markitdown import MarkItDown
from markitdown import StreamInfo
from markitdown import UnsupportedFormatException
from PIL import Image
from pypdf import PdfReader
from pypdf.errors import PdfStreamError
Expand All @@ -37,6 +35,8 @@
from onyx.utils.file_types import WORD_PROCESSING_MIME_TYPE
from onyx.utils.logger import setup_logger

if TYPE_CHECKING:
from markitdown import MarkItDown
Copy link

@cubic-dev-ai cubic-dev-ai bot Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy import of markitdown lacks ImportError handling; docx/pptx processing will crash if the dependency is missing. Add a graceful fallback.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: markitdown is a declared dependency; ImportError is unlikely. Outer try/except in _extract_text_and_images prevents crashes; extract_file_text’s raising is intentional. Not high-impact.

Prompt for AI agents
Address the following comment on backend/onyx/file_processing/extract_file_text.py at line 39:

<comment>Lazy import of `markitdown` lacks ImportError handling; docx/pptx processing will crash if the dependency is missing. Add a graceful fallback.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: markitdown is a declared dependency; ImportError is unlikely. Outer try/except in _extract_text_and_images prevents crashes; extract_file_text’s raising is intentional. Not high-impact.</comment>

<file context>
@@ -37,6 +35,8 @@
 from onyx.utils.logger import setup_logger
 
+if TYPE_CHECKING:
+    from markitdown import MarkItDown
 logger = setup_logger()
 
</file context>

[internal] Confidence score: 7/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

Copy link

@cubic-staging cubic-staging bot Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unhandled ImportError in get_markitdown_converter causes docx/pptx processing to raise RuntimeError via extract_file_text instead of graceful fallback.

Prompt for AI agents
Address the following comment on backend/onyx/file_processing/extract_file_text.py at line 39:

<comment>Unhandled ImportError in get_markitdown_converter causes docx/pptx processing to raise RuntimeError via extract_file_text instead of graceful fallback.</comment>

<file context>
@@ -37,6 +35,8 @@
 from onyx.utils.logger import setup_logger
 
+if TYPE_CHECKING:
+    from markitdown import MarkItDown
 logger = setup_logger()
 
</file context>

[internal] Confidence score: 9/10

[internal] Posted by: Functional Bugs Agent

Fix with Cubic

logger = setup_logger()

# NOTE(rkuo): Unify this with upload_files_for_chat and file_valiation.py
Expand Down Expand Up @@ -85,16 +85,18 @@
"image/webp",
]

_MARKITDOWN_CONVERTER: MarkItDown | None = None
_MARKITDOWN_CONVERTER: Optional["MarkItDown"] = None

KNOWN_OPENPYXL_BUGS = [
"Value must be either numerical or a string containing a wildcard",
"File contains no valid workbook part",
]


def get_markitdown_converter() -> MarkItDown:
def get_markitdown_converter() -> "MarkItDown":
global _MARKITDOWN_CONVERTER
from markitdown import MarkItDown

if _MARKITDOWN_CONVERTER is None:
_MARKITDOWN_CONVERTER = MarkItDown(enable_plugins=False)
return _MARKITDOWN_CONVERTER
Expand Down Expand Up @@ -358,6 +360,12 @@ def docx_to_text_and_images(
The images list returned is empty in this case.
"""
md = get_markitdown_converter()
from markitdown import (
Copy link

@cubic-dev-ai cubic-dev-ai bot Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime import of markitdown exceptions/classes lacks ImportError handling; docx extraction will error instead of falling back.

    DEV MODE: This violation would have been filtered out by GPT-5.

Reasoning:
GPT-5: Lazy import in docx_to_text_and_images would be caught by outer handler; with required dependency, ImportError concerns aren’t actionable.

Prompt for AI agents
Address the following comment on backend/onyx/file_processing/extract_file_text.py at line 363:

<comment>Runtime import of `markitdown` exceptions/classes lacks ImportError handling; docx extraction will error instead of falling back.

        DEV MODE: This violation would have been filtered out by GPT-5.
Reasoning:
• **GPT-5**: Lazy import in docx_to_text_and_images would be caught by outer handler; with required dependency, ImportError concerns aren’t actionable.</comment>

<file context>
@@ -358,6 +360,12 @@ def docx_to_text_and_images(
     The images list returned is empty in this case.
     &quot;&quot;&quot;
     md = get_markitdown_converter()
+    from markitdown import (
+        StreamInfo,
+        FileConversionException,
</file context>

[internal] Confidence score: 7/10

[internal] Posted by: General AI Review Agent

Fix with Cubic

StreamInfo,
FileConversionException,
UnsupportedFormatException,
)

try:
doc = md.convert(
to_bytesio(file), stream_info=StreamInfo(mimetype=WORD_PROCESSING_MIME_TYPE)
Expand Down Expand Up @@ -394,6 +402,12 @@ def docx_to_text_and_images(

def pptx_to_text(file: IO[Any], file_name: str = "") -> str:
md = get_markitdown_converter()
from markitdown import (
StreamInfo,
FileConversionException,
UnsupportedFormatException,
)

stream_info = StreamInfo(
mimetype=PRESENTATION_MIME_TYPE, filename=file_name or None, extension=".pptx"
)
Expand Down
Loading