fix(audit): defer row fetch in audit logs list query to avoid full-row scan#28851
fix(audit): defer row fetch in audit logs list query to avoid full-row scan#28851yan-3005 wants to merge 2 commits into
Conversation
✅ PR checks passedThe linked issue has a description and all required Shipping project fields set. Thanks! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
This PR optimizes the GET /api/v1/audit/logs list query to avoid full-row scans of audit_log_event (notably the large event_json) when paging over large time windows, by switching to a deferred-join (late row lookup) pattern that first selects the top-N ids and then joins back to fetch full columns only for those rows.
Changes:
- Rewrote
CollectionDAO.AuditLogDAO.listSQL toJOINagainst a limited inner subquery ofids, deferring full-row reads until afterLIMIT. - Updated
AuditLogRepositoryto pass a qualified outerORDER BYclause (a.event_ts,a.id) to avoid ambiguity after the join for both forward and backward pagination paths. - Added integration tests ensuring forward/backward cursor pagination preserves strict
(eventTs DESC, id DESC)ordering and page consistency with the deferred-join query.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java | Rewrites audit log list SQL to a deferred-join query and adds a new @Define for the qualified outer ORDER BY. |
| openmetadata-service/src/main/java/org/openmetadata/service/audit/AuditLogRepository.java | Introduces qualified order constants and wires them into DAO calls for both forward and backward pagination. |
| openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/AuditLogResourceIT.java | Adds regression tests validating ordering and cursor pagination equivalence under the deferred-join rewrite. |
🟡 Playwright Results — all passed (12 flaky)✅ 4272 passed · ❌ 0 failed · 🟡 12 flaky · ⏭️ 88 skipped
🟡 12 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
| + "actor_type, impersonated_by, service_name, " | ||
| + "entity_type, entity_id, entity_fqn, entity_fqn_hash, event_json, search_text, created_at " | ||
| + "FROM audit_log_event <condition> <orderClause> LIMIT :limit") | ||
| "SELECT a.id, a.change_event_id, a.event_ts, a.event_type, a.user_name, " |
Code Review ✅ ApprovedOptimizes the audit log list query using a deferred join to bypass full-row scans, reducing response time from 90+ seconds to milliseconds. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



Fixes #28850
Problem
GET /api/v1/audit/logsis extremely slow on largeaudit_log_eventtables (millions of rows). A single page (limit=25) over a 24-hour window can take 90+ seconds, ~99% of it in the database — even though the request passes a boundedlimitand theevent_tsindex exists and is used.Root cause
The list query selects all columns, including the large
event_json(LONGTEXT, ~16 KB/row), withORDER BY event_ts DESC, id DESC LIMIT n. To satisfyORDER BY + LIMIT, the engine does a non-covering index range scan — it reads the full row (includingevent_json) for every row in the time window, then trims tolimit. For a ~112k-row window that is ~112k full-row reads per page, which is the entire cost.EXPLAIN ANALYZEover a ~111k-row window:COUNTover the same rows (index-only)Fix
Deferred join / late row lookup — resolve the page of
ids from the index first (index-only, noevent_jsonread), then join back for the full columns of only the final page:<condition>and all@Bindparams unchanged; it is an index-only scan overidx_audit_log_event_tsthat picks the top-Nids.ORDER BYmust be qualified (a.event_ts,a.id) becauseidis ambiguous after the join — addedORDER_DESC_QUALIFIED/ORDER_ASC_QUALIFIEDalongside the existing inner-scopeORDER_DESC/ORDER_ASC. Both directions are needed (backward pagination sorts ASC then reverses in Java).@SqlQuery, valid on both MySQL and PostgreSQL (standard SQL).Why results are identical (not just faster)
idis the primary key (unique, non-null) → the join is strictly 1:1: no rows dropped, none duplicated.WHERE/ORDER BY/LIMITas before, and(event_ts, id)is a total order, so the top-N is deterministic and identical to the old query.JOINdoes not preserve subquery order, so the outer re-sorts by the same keys — reproducing the original order exactly.This is purely a performance change; the result set and ordering are unchanged. The only load-bearing assumption is
idbeing a unique PK, which the schema guarantees.Changes
CollectionDAO.AuditLogDAO.list— deferred-join SQL + new@Define("orderClauseQualified").AuditLogRepository— addedORDER_DESC_QUALIFIED/ORDER_ASC_QUALIFIEDand passed them at bothlist(...)call sites (forward + backward pagination).exportInBatchesdelegates tolist, so it is fixed automatically.Tests
Added two integration tests in
AuditLogResourceITthat exercise the deferred join with real multi-page data:test_listAuditLogs_deferredJoin_forwardPaginationOrderingIsConsistent— seeds an audit-event burst, pages forward through a fixed-endTswindow, and asserts strict(eventTs DESC, id DESC)ordering and no duplicate ids across pages (coversORDER_DESC_QUALIFIED+ the join).test_listAuditLogs_deferredJoin_backwardPaginationMatchesForward— pages forward to obtain abeforecursor, then backward, and asserts the backward page reproduces the forward page in identical order (coversORDER_ASC_QUALIFIED).These are correctness/regression guards. A "fails-without-the-fix" test is intentionally not included: the old query is correct, just slow, and the slowness only reproduces at multi-million-row scale that CI does not have — so a timing assertion would pass on the un-optimized query too and would be meaningless. The tests instead guard the real risk introduced by the rewrite (dropped/duplicated/reordered rows, ambiguous-
idSQL), on both MySQL and PostgreSQL.Follow-ups (not in this PR)
entity_type,event_type) still scan within the window — add(entity_type, event_ts)/(event_type, event_ts)composite indexes if used at scale.COUNTcould be made opt-in for cursor pagination.DataRetention+OPTIMIZE TABLE).