Effectively-once delivery guarantees when publishing to Pulsar topic from external source. #24605
Unanswered
artur-jablonski
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am trying to understand under what conditions is it possible to have EFFECTIVELY once delivery guarantees towards Pulsar topic from some external data source.
Based on those reads:
https://github.com/apache/pulsar/wiki/PIP-6:-Guaranteed-Message-Deduplication
https://streamnative.io/blog/exactly-once-semantics-transactions-pulsar
My understanding is that there are two options:
Option 1:
Each discrete data item that is to be published from the external source to Pulsar topic needs to be associated with a monotonically increasing sequence number and it must be possible for the consumer of the data source to resume consuming data from data item given its sequence number.
In this scenario, the server side topic deduplication must be enabled on the target topic and the Pulsar Producer must use data items' sequence number when publishing on Pulsar topic. When Pulsar Producer connects to the topic it asks Pulsar Broker for the latest sequence number that it has published on the topic. It must use that number to position the data source reading on the next data item to be published and resume from there.
Limitations:
Generally can't go over parallelism=1 on the external source unless there's a scheme where data source can be split into disjoint datasets, each consumed by a separate Pulsar Publisher with unique name, since Pulsar Broker side deduplication works per Pulsar Publisher name.
Pulsar Broker side deduplication is per partition, so needs to be taken into consideration if the target topic is partitioned.
Notes:
The Pulsar Broker side deduplication also works without associating sequence number with data items. In this case Pulsar Producer uses internal sequence number that is incremented with each published data item. When Pulsar Producer gets timeout on Broker's ACK for given data item, it can attempt to republish the data using the same sequence number. This allows the Broker to deduplicate in case the data item was successfully published already. However, when Pulsar Publisher itself suffers a failure and restarts, then since the sequence number is not associated in any way with data items, it's possible the same data item will be published on the topic and EFFECTIVELY once guarantees are broken.
Examples of data source that can be used in this strategy:
flat text file where each line is considered a data item. Line number is used as sequence number and after restart and getting last sequence number from Pulsar Broker, it's possible to position the consumer on appropriate line number. This assumes the file doesn't change in between.
Kafka topic. Pulsar Producer can use Kafka topic offset as sequence number. After restart and getting last sequence number from Pulsar Broker, the Pulsar Producer can position itself on appropriate record in Kafka topic. Note: Kafka maintains offset per topic partition, therefore if data source is Kafka topic with more than 1 partition, then each would need to be published by a separate Pulsar Producer with unique name.
Option 2:
For this strategy the source needs to:
have ability to acknowledge processing a data item
ability to publish data item to Pulsar topic and acknowledge processing data item ATOMICALLY
In this strategy the publishing to Pulsar topic and acknowledging back to the source is bound by transactional all-or-nothing semantics.
Examples of data sources that can be used in this startegy:
Apache Flink's Pulsar Sink uses its two phase commit protocol capabilities and Pulsar's Transaction API that becomes participant of the 2PC to make publishing to Pulsar and ACK back to Flink atomic therefore achieves EFFECTIVELY ONCE semantics.
An example or a source that cannot have effectively once semantics:
If we take a JMS queue as the external data source we can say that:
there is no sequence number that can be assigned to data items published on JMS queue and more importantly the JMS client cannot position itself on specific data item to consume.
there is no way to tie acknowledging JMS queue message consumption with publishing it to Pulsar topic in an atomic transaction.
Therefore there is no strategy that exists to achieve EFFECTIVELY ONCE delivery semantics from JMS queue to Pulsar topic.
Please correct me if I got this wrong.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions