Allow for "inferred deletes" on incrementally synced connections without CDC #6383
Labels
area/connectors
Connector related issues
type/enhancement
New feature or request
wontfix
This will not be worked on
Tell us about the problem you're trying to solve
It would be convenient for Airbyte to support "inferred deletes" on "Incremental Sync" database connections that do not utilize CDC (e.g. not setup to use a replication slot on a Postgres database).
Describe the solution you’d like
Disclaimer - After thinking through this for a while I realize that this feature might be Dead on Arrival, but I hadn't seen any mention of it and wanted to pickle my thoughts and get Airbyte developer feedback.
Pre-emptive TLDR
"Inferred deletes" can be constructed by a simple set operation: the set of destination primary keys
minus
the set of source primary keys for a given table.Implementation - Create a new deletion propagation worker that:
Despite being the best I could come up with, this implementation is quite computationally expensive on both the worker node and the source system.
The "How"
In general, inferring deletes on a table is a simple set operation between the destination and source table. It can be achieved by taking the set of destination primary keys
minus
the set of source primary keys yielding the set of primary keys that need to be deleted (sent to the destination system as tombstones).The "Where"
The problem here isn't "how" to infer deletes, but "where". Specifically, for the set operation to be performed, both sets need to exist in the same place (the source set in its' entirety and the destination set potentially partitioned or sent in batches). This place could be one of 3 basic locations: the source system, completely in memory on the worker node, or the destination system.
Destination System
This seems like a bad idea idea for 2 reasons:
A Worker Node (both sets in memory)
While this might work in some cases, storing the whole of a table's primary keys as a set in memory will likely become untenable or at least very difficult to manage. The worker node's memory requirements would need to scale with the size of the source table that it is processing, which adds - at the very least - additional deployment concerns.
It did occurred to me, since the memory requirement of the set was the main issue, why not potentially employ a Bloom filter here to handle the majority of set operation... But alas, this would not be an appropriate use case. That is, Bloom filters provide the response "definitely not in the set" or "possibly in the set" and the vast majority of the results would return "possibly in the set" which is essentially a cache miss requiring access to the actual set. I don't know of any "reverse" bloom filters ("definitely in the set" or "possibly not in the set") that provide the same memory advantages. So this would be a non solution.
Source System
Pushing the destination set into the source system to perform the set operation is likely a non-starter because the source system could be read-only and not something to which Airbyte could write.
However, it is possible to "send" pks to the source system to perform a set operation (in number of batches), without actually storing the data on the source system, namely using a CTE/SQL query:
All that said, the most feasible solution - for most databases - would be to create a new Worker (
DeletionPropagationWorker
) that periodically:This would likely also require a new optionally implementable interface to be created by each connector so that it could determine how the CTE/SQL query is constructed and run on the source system.
Describe the alternative you’ve considered or used
FULL_REFRESH
of the table will by its nature propagate deletesAny way I think about this, even my "feasible" solution, amounts to a significant computational burden and substantial data flowing back from the destination system (something that I assume doesn't happen extensively, currently). In a zero sum game, doing "inferred deletes" could take up substantial processing resources that could otherwise be devoted to moving data from source to destination, or to a lesser extent compete with operational queries on the source system.
The alternatives of doing a
FULL_REFRESH
or leveraging CDC seem far more appealing.Maybe I missed something, which would be good to hear, but I wanted to get confirmation of likelihood of this feature.
Are you willing to submit a PR?
Yes, though I suspect this a larger architectural issue that would have many downstream impacts, which Airbyte's code governance would like more control over.
The text was updated successfully, but these errors were encountered: