-
-
Notifications
You must be signed in to change notification settings - Fork 743
Fix: Add comprehensive telemetry cleanup to prevent agent termination hang #1002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1300,19 +1300,27 @@ def chat(self, prompt, temperature=0.2, tools=None, output_json=None, output_pyd | |
| # Apply guardrail validation for custom LLM response | ||
| try: | ||
| validated_response = self._apply_guardrail_with_retry(response_text, prompt, temperature, tools, task_name, task_description, task_id) | ||
| # Ensure proper cleanup of telemetry system to prevent hanging | ||
| self._cleanup_telemetry() | ||
| return validated_response | ||
| except Exception as e: | ||
| logging.error(f"Agent {self.name}: Guardrail validation failed for custom LLM: {e}") | ||
| # Rollback chat history on guardrail failure | ||
| self.chat_history = self.chat_history[:chat_history_length] | ||
| # Ensure proper cleanup of telemetry system to prevent hanging | ||
| self._cleanup_telemetry() | ||
| return None | ||
| except Exception as e: | ||
| # Rollback chat history if LLM call fails | ||
| self.chat_history = self.chat_history[:chat_history_length] | ||
| display_error(f"Error in LLM chat: {e}") | ||
| # Ensure proper cleanup of telemetry system to prevent hanging | ||
| self._cleanup_telemetry() | ||
| return None | ||
| except Exception as e: | ||
| display_error(f"Error in LLM chat: {e}") | ||
| # Ensure proper cleanup of telemetry system to prevent hanging | ||
| self._cleanup_telemetry() | ||
| return None | ||
| else: | ||
| # Use the new _build_messages helper method | ||
|
|
@@ -1396,11 +1404,15 @@ def chat(self, prompt, temperature=0.2, tools=None, output_json=None, output_pyd | |
| validated_reasoning = self._apply_guardrail_with_retry(response.choices[0].message.reasoning_content, original_prompt, temperature, tools, task_name, task_description, task_id) | ||
| # Execute callback after validation | ||
| self._execute_callback_and_display(original_prompt, validated_reasoning, time.time() - start_time, task_name, task_description, task_id) | ||
| # Ensure proper cleanup of telemetry system to prevent hanging | ||
| self._cleanup_telemetry() | ||
| return validated_reasoning | ||
| except Exception as e: | ||
| logging.error(f"Agent {self.name}: Guardrail validation failed for reasoning content: {e}") | ||
| # Rollback chat history on guardrail failure | ||
| self.chat_history = self.chat_history[:chat_history_length] | ||
| # Ensure proper cleanup of telemetry system to prevent hanging | ||
| self._cleanup_telemetry() | ||
| return None | ||
| # Apply guardrail to regular response | ||
| try: | ||
|
|
@@ -1849,6 +1861,8 @@ async def achat(self, prompt: str, temperature=0.2, tools=None, output_json=None | |
| if logging.getLogger().getEffectiveLevel() == logging.DEBUG: | ||
| total_time = time.time() - start_time | ||
| logging.debug(f"Agent.achat failed in {total_time:.2f} seconds: {str(e)}") | ||
| # Ensure proper cleanup of telemetry system to prevent hanging | ||
| self._cleanup_telemetry() | ||
| return None | ||
|
Comment on lines
+1864
to
1866
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change adds the cleanup call to the final exception handler, but the fix for Examples of missed paths that don't call
Similar to my comment on the Recommended Solution (using an async decorator): # Add this decorator at the top of the file or in a utils module
from functools import wraps
def async_with_telemetry_cleanup(func):
@wraps(func)
async def wrapper(self, *args, **kwargs):
try:
return await func(self, *args, **kwargs)
finally:
self._cleanup_telemetry()
return wrapper
# Then, apply it to the method and remove all manual _cleanup_telemetry() calls
@async_with_telemetry_cleanup
async def achat(self, ...):
# ... existing logic ...This will make the fix comprehensive and prevent future regressions. |
||
|
|
||
| async def _achat_completion(self, response, tools, reasoning_steps=False): | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While adding
_cleanup_telemetry()here fixes the hang for this specific path, this approach has two issues:Incompleteness: The fix for
chat()is not comprehensive. There are still return paths in theself_reflectblock that are missing the cleanup call (e.g., lines 1494, 1499, 1512, 1517), which means the agent can still hang in those scenarios.Maintainability: Repeating
_cleanup_telemetry()before everyreturnis error-prone. It's easy for a developer to add a new return path in the future and forget this call, re-introducing the bug.A more robust solution is to ensure cleanup happens regardless of how the method exits. You can achieve this using a
try...finallyblock around the entire method's body, or even more cleanly with a decorator.Recommended Solution (using a decorator):
This would fix all missed paths and make the code much cleaner and safer.