Skip to content

crt/crt_db error handling#3124

Open
liquidsec wants to merge 3 commits into
devfrom
crt-resilience
Open

crt/crt_db error handling#3124
liquidsec wants to merge 3 commits into
devfrom
crt-resilience

Conversation

@liquidsec
Copy link
Copy Markdown
Collaborator

Supersedes #3077 — bundles the fix @TrebledJ proposed there with an adjacent fix in crt_db that came out of the same audit.

crt: don't short-circuit 404 as "no data"

The default _api_response_is_success in BaseModule treats HTTP 404 as success because most APIs use 404 to mean "no records for this query." crt.sh's API returns 404 (and 503) transiently when overloaded — it's a real failure, not an empty-set signal. Today those responses pass through _api_response_is_success, skip the api_request retry loop, hit parse_results, and produce JSONDecodeError noise.

This overrides the success check in the crt module to drop the 404 special case:

def _api_response_is_success(self, r):
    return getattr(r, "is_success", False)

So 404 → counts as a failure → api_request retries with backoff → if still failing after api_failure_abort_threshold, the module enters error state (same path 503 already takes).

Credit to @TrebledJ for spotting this in #3077; this PR uses the corrected logic from the review there (#3077's diff inverted the membership check and unintentionally promoted 400/500/502/etc. to success — sending the abort threshold the wrong way).

crt_db: reconnect on dropped connections, bail on Postgres OOM

A real bbot scan log shows two flavors of repeating failure from crt_db:

Count Issue
asyncpg.InterfaceError: connection is closed 4483 The module opens one asyncpg connection on first request and reuses it for the rest of the scan. When the upstream (crt.sh's Postgres) drops the connection — restart, idle timeout, OOM — every subsequent db_conn.fetch() raises and the module is silently broken for the rest of the scan with no recovery path.
asyncpg.OutOfMemoryError 343 crt.sh's Postgres reports "out of shared memory." The module keeps hammering instead of backing off.

The fix:

  • Extract _connect() so the same parameters can be reused.
  • Before each query, check db_conn is None or db_conn.is_closed() and reopen.
  • Wrap fetch() so InterfaceError / ConnectionError clears self.db_conn (next call reconnects) and re-raises.
  • On OutOfMemoryError, call set_error_state(...) so the module backs off cleanly for the rest of the scan.

Added a regression test TestCRT_DB_Reconnect that drops the first connection mid-scan and asserts the module reconnects and continues producing results.

liquidsec added 2 commits May 22, 2026 09:06
- crt: override _api_response_is_success so 404 falls back to api_request's
  retry path instead of being treated as "no data" (crt.sh returns 404/503
  transiently when overloaded). Co-authored with @TrebledJ's #3077.
- crt_db: reconnect when the asyncpg connection is dropped mid-scan, and
  bail out via set_error_state on Postgres OutOfMemoryError instead of
  hammering the upstream.
@github-actions
Copy link
Copy Markdown
Contributor

🚀 Performance Benchmark Report

⚠️ No current benchmark data available

This might be because:

  • Benchmarks failed to run
  • No benchmark tests found
  • Dependencies missing

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 91.83673% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 90%. Comparing base (feacaba) to head (21688a5).

Files with missing lines Patch % Lines
bbot/modules/crt_db.py 75% 3 Missing ⚠️
...est/test_step_2/module_tests/test_module_crt_db.py 98% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #3124   +/-   ##
=====================================
- Coverage     90%     90%   -0%     
=====================================
  Files        441     441           
  Lines      38743   38781   +38     
=====================================
+ Hits       34663   34695   +32     
- Misses      4080    4086    +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@liquidsec liquidsec changed the title crt/crt_db: stop treating transient failures as terminal crt/crt_db error handling May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant