Description
Use case
I want to gurantee exactly once delivery to clickhouse without requiring separate persistence for replicated merge tree families when i use upload from some message broker (kafka-like).
Most message brokers has stream api. We read buffer of data, push it to clickhouse and commit offsets when it done. So in case of failures such client code may experience double read. If worker push data to clikchouse it may lead to double writes aswell. Clickhouse already has well defined deduplication mechanism, clickhouse will dedublicate data if you push exact same batch multiple time. So in our worker (assuming we write in single thread to CH) we must know last request offsets. To achive these i have to use separate storage and do 2 writes into it (before insert with right offset, and after ack from CH to confirm left offset).
User flow looks something like this:
- Read data from broker (let say 1000 record, from N to M offset)
- Store right ofset to M
- Start upload data to clickhouse
- Die in agony
- Restore worker, look to offset store, and see that last time there was N to M offset sent
- Read from broker exact same offsets
- Push it again to clickhouse
- Store M offset to left.
Describe the solution you'd like
Main idea to eliminate need of separate storage. Right now clickhouse already utilize persistence via ZK, so we could use it for these need. If user will provide some meaningfull (for him) string with each write request we can consistent store it to ZK. Then on client side we may look at it and make a decision of what offset may be skipped.
Ideal user flow looks something like this:
- Read data from broker from N to M.
- Start upload to clickhouse with insert id N_M
- Die with happy smile
- Restore worker. Look to clickhosue to latest insert id (N_M recieved)
- Skip offset till M
- Go to 1
Describe alternatives you've considered
We could continue to use separate offset store.
Additional context
It may be also usefull for internal use of kafka storage (wich also have same proble).
Activity