-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Is there an existing issue for this?
- I have searched the existing issues
Problem statement
When a CDC Snapshot pipeline runs for the first time against a source with a large backlog of historical snapshots (where versions are timestamp-based), the framework attempts to process every available snapshot from the beginning of time. For sources with years of accumulated history, this results in unnecessarily long initial loads, excessive compute costs, and potential failures - particularly when the business requirement is only to establish a baseline from a recent point in time rather than replay the full history.
Proposed Solution
Add a modifiedAfter: str | None = None configuration option to CDCSnapshotFileSource. When set to an ISO-formatted timestamp, the framework filters out any timestamp-type snapshot versions that fall before the specified datetime during the first run (i.e. when latest_snapshot_version is None). This allows operators to define a cutoff point, skipping old historical data and beginning CDC processing from a known recent baseline. On subsequent runs the filter has no effect, as the pipeline tracks versions normally from the point of initial ingestion. The JSON schema for cdcSnapshotSettings is updated to include the new field across all supported dataflow spec definitions.
Additional Context
No response