Skip to content

Coalesce DMLs across concurrent queueables to reduce row-lock contention #35

@Mateusz7410

Description

@Mateusz7410

Problem

Today many independent queueables fire concurrently, each ending in its own DML against children of a shared parent (lookup / M-D). Salesforce locks the parent on every child write → UNABLE_TO_LOCK_ROW. Lookup/ownership skew amplifies. Retry (Issue #N) is a safety net but doesn't address root cause: N small races for the same parent lock instead of one batched commit.

Proposal

Treat this as a design spike — not a green-lit implementation. Explore whether the Async-lib can intercept DMLs at the end of independent queueables and coalesce them into a single batched commit within a time window (or other grouping signal).

The producer side should require zero changes to existing queueables — devs already wrote MyJob ending in update records;. The framework should be able to opt that job into "deferred DML" mode without rewriting it.

// Sketch — devs don't redesign their jobs, they just opt in
Async.queueable(new MyJob(args))
    .deferDmlVia(AccountDmlCoalescer.class)
    .enqueue();

The coalescer collects pending DML requests from N producer jobs within a configurable window, sorts by parent ID, deduplicates, commits once.

Mitigations to evaluate FIRST (cheaper, ship before spike)

Before building coalescing infra, prove these aren't enough:

  1. Sort by parent ID before DML — kills most lock contention in practice. Could be a .sortByParent(Field) helper in DML-lib.
  2. Partial commit + retry failed subsetDatabase.update(records, false) + retry the locked rows. Pairs naturally with Issue #N.
  3. .dedupe() option on individual queueables — drop duplicate work in same job.

If 1+2+3 still leaves measurable contention, the coalescer is justified.

Open Questions

  • Cross-transaction queue mechanism — Async-lib internal state, Platform Event, custom object queue, Platform Cache? Open to options; PE was one idea, not a requirement.
  • Grouping key — per SObjectType, per parent ID, per coalescer class, per business window?
  • Time/size window — fixed delay, max batch size, both?
  • Single-runner enforcement — how do we guarantee only one coalescer flushes a given group at a time?
  • Failure semantics — if coalesced DML partially fails, how do we attribute per-producer logs back?
  • Interplay with retry framework — does retry happen at producer level or coalescer level?

Acceptance Criteria (for the spike)

  • Measurement: real production lock-row incident profile (frequency, hot objects, parent skew)
  • Quantified test of mitigations 1+2+3 against representative load
  • Decision doc: ship cheap mitigations only, or proceed to coalescer impl?
  • If coalescer proceeds: design doc covering all open questions above

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions