Skip to content

fix(audit): defer row fetch in audit logs list query to avoid full-row scan#28851

Open
yan-3005 wants to merge 2 commits into
mainfrom
audit-logs-handoff-doc
Open

fix(audit): defer row fetch in audit logs list query to avoid full-row scan#28851
yan-3005 wants to merge 2 commits into
mainfrom
audit-logs-handoff-doc

Conversation

@yan-3005

@yan-3005 yan-3005 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Fixes #28850

Problem

GET /api/v1/audit/logs is extremely slow on large audit_log_event tables (millions of rows). A single page (limit=25) over a 24-hour window can take 90+ seconds, ~99% of it in the database — even though the request passes a bounded limit and the event_ts index exists and is used.

Root cause

The list query selects all columns, including the large event_json (LONGTEXT, ~16 KB/row), with ORDER BY event_ts DESC, id DESC LIMIT n. To satisfy ORDER BY + LIMIT, the engine does a non-covering index range scan — it reads the full row (including event_json) for every row in the time window, then trims to limit. For a ~112k-row window that is ~112k full-row reads per page, which is the entire cost.

EXPLAIN ANALYZE over a ~111k-row window:

Query Full rows read Time
Current list query ~111,747 ~153,000 ms
COUNT over the same rows (index-only) 0 ~21 ms
Deferred-join rewrite 26 ~180 ms

Fix

Deferred join / late row lookup — resolve the page of ids from the index first (index-only, no event_json read), then join back for the full columns of only the final page:

SELECT a.<cols>
FROM audit_log_event a
JOIN (SELECT id FROM audit_log_event <condition> <orderClause> LIMIT :limit) k
  ON a.id = k.id
<orderClauseQualified>
  • Inner subquery keeps the existing <condition> and all @Bind params unchanged; it is an index-only scan over idx_audit_log_event_ts that picks the top-N ids.
  • Outer ORDER BY must be qualified (a.event_ts, a.id) because id is ambiguous after the join — added ORDER_DESC_QUALIFIED / ORDER_ASC_QUALIFIED alongside the existing inner-scope ORDER_DESC / ORDER_ASC. Both directions are needed (backward pagination sorts ASC then reverses in Java).
  • Single shared @SqlQuery, valid on both MySQL and PostgreSQL (standard SQL).

Why results are identical (not just faster)

  • id is the primary key (unique, non-null) → the join is strictly 1:1: no rows dropped, none duplicated.
  • The inner applies the same WHERE / ORDER BY / LIMIT as before, and (event_ts, id) is a total order, so the top-N is deterministic and identical to the old query.
  • A JOIN does not preserve subquery order, so the outer re-sorts by the same keys — reproducing the original order exactly.

This is purely a performance change; the result set and ordering are unchanged. The only load-bearing assumption is id being a unique PK, which the schema guarantees.

Changes

  • CollectionDAO.AuditLogDAO.list — deferred-join SQL + new @Define("orderClauseQualified").
  • AuditLogRepository — added ORDER_DESC_QUALIFIED / ORDER_ASC_QUALIFIED and passed them at both list(...) call sites (forward + backward pagination). exportInBatches delegates to list, so it is fixed automatically.

Tests

Added two integration tests in AuditLogResourceIT that exercise the deferred join with real multi-page data:

  • test_listAuditLogs_deferredJoin_forwardPaginationOrderingIsConsistent — seeds an audit-event burst, pages forward through a fixed-endTs window, and asserts strict (eventTs DESC, id DESC) ordering and no duplicate ids across pages (covers ORDER_DESC_QUALIFIED + the join).
  • test_listAuditLogs_deferredJoin_backwardPaginationMatchesForward — pages forward to obtain a before cursor, then backward, and asserts the backward page reproduces the forward page in identical order (covers ORDER_ASC_QUALIFIED).

These are correctness/regression guards. A "fails-without-the-fix" test is intentionally not included: the old query is correct, just slow, and the slowness only reproduces at multi-million-row scale that CI does not have — so a timing assertion would pass on the un-optimized query too and would be meaningless. The tests instead guard the real risk introduced by the rewrite (dropped/duplicated/reordered rows, ambiguous-id SQL), on both MySQL and PostgreSQL.

Follow-ups (not in this PR)

  • Unindexed filters (entity_type, event_type) still scan within the window — add (entity_type, event_ts) / (event_type, event_ts) composite indexes if used at scale.
  • Per-request total COUNT could be made opt-in for cursor pagination.
  • Table size / disk reclamation is a separate retention concern (DataRetention + OPTIMIZE TABLE).

Copilot AI review requested due to automatic review settings June 9, 2026 06:45
@yan-3005 yan-3005 added bug Something isn't working governance labels Jun 9, 2026
@yan-3005 yan-3005 self-assigned this Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

✅ PR checks passed

The linked issue has a description and all required Shipping project fields set. Thanks!

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@yan-3005 yan-3005 added safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch labels Jun 9, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the GET /api/v1/audit/logs list query to avoid full-row scans of audit_log_event (notably the large event_json) when paging over large time windows, by switching to a deferred-join (late row lookup) pattern that first selects the top-N ids and then joins back to fetch full columns only for those rows.

Changes:

  • Rewrote CollectionDAO.AuditLogDAO.list SQL to JOIN against a limited inner subquery of ids, deferring full-row reads until after LIMIT.
  • Updated AuditLogRepository to pass a qualified outer ORDER BY clause (a.event_ts, a.id) to avoid ambiguity after the join for both forward and backward pagination paths.
  • Added integration tests ensuring forward/backward cursor pagination preserves strict (eventTs DESC, id DESC) ordering and page consistency with the deferred-join query.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java Rewrites audit log list SQL to a deferred-join query and adds a new @Define for the qualified outer ORDER BY.
openmetadata-service/src/main/java/org/openmetadata/service/audit/AuditLogRepository.java Introduces qualified order constants and wires them into DAO calls for both forward and backward pagination.
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/AuditLogResourceIT.java Adds regression tests validating ordering and cursor pagination equivalence under the deferred-join rewrite.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🟡 Playwright Results — all passed (12 flaky)

✅ 4272 passed · ❌ 0 failed · 🟡 12 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 300 0 1 4
🟡 Shard 2 802 0 4 9
✅ Shard 3 808 0 0 8
✅ Shard 4 843 0 0 12
🟡 Shard 5 719 0 2 47
🟡 Shard 6 800 0 5 8
🟡 12 flaky test(s) (passed on retry)
  • Features/DataAssetRulesDisabled.spec.ts › Verify the Chart entity item action after rules disabled (shard 1, 1 retry)
  • Features/BulkImport.spec.ts › Table (shard 2, 1 retry)
  • Features/DataQuality/ColumnLevelTests.spec.ts › Column Values To Be Not In Set (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with only VIEW cannot PATCH results (shard 2, 1 retry)
  • Pages/Entity.spec.ts › Announcement create, edit & delete (shard 5, 1 retry)
  • Pages/EntityDataSteward.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage service type filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

+ "actor_type, impersonated_by, service_name, "
+ "entity_type, entity_id, entity_fqn, entity_fqn_hash, event_json, search_text, created_at "
+ "FROM audit_log_event <condition> <orderClause> LIMIT :limit")
"SELECT a.id, a.change_event_id, a.event_ts, a.event_type, a.user_name, "

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need this join @yan-3005

@gitar-bot

gitar-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown
Code Review ✅ Approved

Optimizes the audit log list query using a deferred join to bypass full-row scans, reducing response time from 90+ seconds to milliseconds. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Audit Logs list API (GET /api/v1/audit/logs) is extremely slow on large audit_log_event tables

3 participants