Skip to content

[FEATURE]: Modified After Filter for Historical Snapshot Ingestion #8

@daves-mantel

Description

@daves-mantel

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

When a CDC Snapshot pipeline runs for the first time against a source with a large backlog of historical snapshots (where versions are timestamp-based), the framework attempts to process every available snapshot from the beginning of time. For sources with years of accumulated history, this results in unnecessarily long initial loads, excessive compute costs, and potential failures - particularly when the business requirement is only to establish a baseline from a recent point in time rather than replay the full history.

Proposed Solution

Add a modifiedAfter: str | None = None configuration option to CDCSnapshotFileSource. When set to an ISO-formatted timestamp, the framework filters out any timestamp-type snapshot versions that fall before the specified datetime during the first run (i.e. when latest_snapshot_version is None). This allows operators to define a cutoff point, skipping old historical data and beginning CDC processing from a known recent baseline. On subsequent runs the filter has no effect, as the pipeline tracks versions normally from the point of initial ingestion. The JSON schema for cdcSnapshotSettings is updated to include the new field across all supported dataflow spec definitions.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions