Skip to content

Comments

Fix Neo4j stale connection recovery in search path#18

Merged
manana2520 merged 1 commit intomainfrom
fix/neo4j-stale-connection-retry
Feb 22, 2026
Merged

Fix Neo4j stale connection recovery in search path#18
manana2520 merged 1 commit intomainfrom
fix/neo4j-stale-connection-retry

Conversation

@manana2520
Copy link
Contributor

Summary

  • Staging Slack bot returns "I couldn't find relevant information" when the Neo4j TCP connection goes stale after idle periods
  • The Neo4j driver's liveness check catches OSError/ServiceUnavailable/SessionExpired but RuntimeError from asyncio's transport layer (TCPTransport closed) escapes, silently returning zero search results
  • Added retry-with-reset at the GraphitiRetriever layer: on connection error, reset the singleton GraphitiClient and retry once with a fresh Neo4j connection

Changes

  • graphiti_retriever.py: Added _is_connection_error() helper + retry loop in search_chunks() and _lookup_episodes()
  • graphiti_client.py: Added reset_and_reconnect() method to close stale connection and clear singleton
  • config.py: Added NEO4J_SEARCH_MAX_RETRIES setting (default: 1)
  • tests/test_connection_retry.py: 16 unit tests covering error detection, retry behavior, and edge cases

Test plan

  • 16 unit tests pass (error detection, retry on connection error, no retry on regular error, exhausted retries)
  • 296 existing tests pass (no regressions)
  • CI deploys to staging, e2e tests pass
  • Manual: ask staging bot a question after idle period, verify response

The Slack bot returns "I couldn't find relevant information" when the
Neo4j TCP connection goes stale after idle periods. The Neo4j driver's
liveness check catches OSError/ServiceUnavailable/SessionExpired but
RuntimeError from asyncio's transport layer escapes, silently returning
zero search results.

Add retry-with-reset at the GraphitiRetriever layer:
- _is_connection_error() detects RuntimeError(TCPTransport), OSError,
  ServiceUnavailable, and SessionExpired
- search_chunks() and _lookup_episodes() retry once on connection error
  after resetting the GraphitiClient singleton (fresh Neo4j connection)
- NEO4J_SEARCH_MAX_RETRIES config setting (default: 1)
- GraphitiClient.reset_and_reconnect() closes stale connection and
  clears singleton state
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This Pull Request effectively addresses the critical issue of stale Neo4j connections, which was causing the Slack bot to fail with "I couldn't find relevant information" after idle periods. The solution is well-thought-out and robust.

Key strengths of this PR:

  • Clear Intent: The PR directly tackles the problem described in the title and description by implementing a retry-with-reset mechanism for connection errors.
  • Comprehensive Error Detection: The _is_connection_error helper correctly identifies various connection-related exceptions, including the specific RuntimeError related to TCPTransport closed that was observed in production.
  • Robust Retry Logic: The retry loops in both _lookup_episodes and search_chunks are correctly implemented, leveraging a configurable NEO4J_SEARCH_MAX_RETRIES setting. It correctly distinguishes between connection errors (which trigger a retry) and other application errors (which do not).
  • Effective Connection Reset: The reset_and_reconnect method in GraphitiClient correctly forces a re-initialization of the Graphiti instance and its underlying Neo4j driver, ensuring a fresh connection. Crucially, self._graphiti = None in the retriever's retry block ensures the cached Graphiti instance is cleared, forcing _get_graphiti to obtain a new one.
  • Thorough Unit Tests: The new unit tests cover various scenarios, including successful retries, exhausted retries, no retries for non-connection errors, and successful first attempts. This provides high confidence in the fix.
  • Architectural Alignment: The changes integrate seamlessly with the existing Graphiti client singleton pattern, configuration management, and async structure.

No significant issues or security concerns were found. The solution is performant, resilient, and well-tested.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security Review

Security Score: A | 2 INFO

Security Review Summary: Fix Neo4j Stale Connection Recovery

Decision: APPROVE ✅

This PR implements a robust retry mechanism for Neo4j connection failures in the search path. The implementation is security-sound with no credential exposure, authentication bypass risks, or injection vulnerabilities.

Key Security Strengths

  1. No Credential Exposure: The _is_connection_error() helper does not log, expose, or handle any credentials. NEO4J_PASSWORD remains protected.

  2. Proper Error Classification: The helper correctly distinguishes connection errors (which warrant retry) from application errors (which should not retry). This prevents masking real bugs.

  3. Slack Token & PII Protection: No changes to Slack bot integration, no new logging of SLACK_BOT_TOKEN or user PII (reporter_id, reporter_name).

  4. No Injection Risks: The Cypher query in _lookup_episodes() uses parameterized queries (WHERE ep.uuid IN $uuids) — no SQL/Cypher injection risk.

  5. Configuration Management: New NEO4J_SEARCH_MAX_RETRIES setting properly defined in config.py as a pydantic-settings field with reasonable default (1 retry).

  6. Connection Reset Handled: The reset_and_reconnect() method in GraphitiClient properly closes stale connections and clears the singleton, enabling clean reconnection.

Detailed Findings

OWASP A02 (Cryptographic Failures): No credentials are hardcoded, logged, or exposed in error paths. NEO4J_PASSWORD handling is unchanged.

OWASP A09 (Logging & Monitoring): Warning logs include exception messages but do NOT include credentials. The specific error cases (TCPTransport, ServiceUnavailable, SessionExpired) are logged at WARNING level, which is appropriate for infrastructure issues.

Pre-Scan Finding Analysis:

  • Flagged: api_key="vertex-ai-mode" — This is a placeholder string, not a real credential. It's explicitly documented as unused with custom Vertex AI client. This is ACCEPTED per ADR-003. NOT A FINDING.
  • Flagged: NEO4J_PASSWORD is required — This is an error message during initialization, not logging a credential value. PASS.
  • Flagged: Injection risks in graphiti_retriever.py — The code uses Graphiti's search() API, not raw Cypher injection. User queries are passed to Graphiti's safe search interface, not interpolated into Cypher. PASS.

Test Coverage

16 unit tests cover:

  • RuntimeError detection with TCPTransport
  • OSError and connection errors
  • Neo4j-specific exceptions (ServiceUnavailable, SessionExpired)
  • Retry behavior on connection errors
  • No retry on non-connection errors
  • Exhausted retries returning empty results
  • Successful first-attempt execution

All tests pass, no regressions in existing 296 tests.

Minor Notes

  • The _is_connection_error() helper uses string matching ("TCPTransport" in str(exc)). This is pragmatic given asyncio's generic RuntimeError. If asyncio changes error message format in future versions, this could fail silently. Document as a known limitation.
  • Exception messages in logs are appropriate for operational debugging without exposing secrets.

Acceptance Criteria Met

✅ No credential exposure in logs or error handling
✅ No auth bypass or privilege escalation
✅ No injection vulnerabilities
✅ Configuration properly externalized
✅ Comprehensive unit tests with retry logic verification
✅ No changes to Slack security or Neo4j access controls
✅ Follows established security patterns (pydantic-settings, error handling)

Recommendation: APPROVE for merge. Implementation is security-sound and addresses a real production issue (stale Neo4j connections after idle periods).

Findings

🔵 [INFO] Connection error detection uses string matching on exception message ('TCPTransport' in str(exc)). While this approach works, it's fragile if error message format changes.

File: src/knowledge_base/graph/graphiti_retriever.py:line 37
Category: Code Quality
Impact: If Neo4j driver or asyncio changes error message format, TCPTransport errors may not be detected and retries won't trigger.
Recommendation: Consider checking exception type hierarchy more robustly, though current approach is reasonable given asyncio's RuntimeError is generic. Document this as a known limitation.

🔵 [INFO] Warning log message includes exception object which may contain connection details. Generally safe but worth noting.

File: src/knowledge_base/graph/graphiti_retriever.py:line 246
Category: Logging & Monitoring
Impact: Exception details logged could theoretically reveal internal network topology in connection error cases.
Recommendation: Current implementation is acceptable. The warning is specific to connection retry scenarios and doesn't include credentials.

OWASP Top 10 Checklist

Category Status
A01 Access Control ✅ PASS
A02 Crypto Failures ✅ PASS
A03 Injection ✅ PASS
A04 Insecure Design ✅ PASS
A05 Misconfiguration ✅ PASS
A06 Vulnerable Components ✅ PASS
A07 Auth Failures ✅ PASS
A08 Integrity Failures ✅ PASS
A09 Logging Monitoring ✅ PASS
A10 Ssrf ✅ PASS

🤖 Security review powered by Claude

@manana2520 manana2520 merged commit f1dcd0d into main Feb 22, 2026
9 checks passed
@manana2520 manana2520 deleted the fix/neo4j-stale-connection-retry branch February 22, 2026 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant