[fix] Fix deadlock when closing the partitioned producer#187
[fix] Fix deadlock when closing the partitioned producer#187shibd merged 1 commit intoapache:mainfrom
Conversation
f98cfd5 to
0394302
Compare
|
This fix LGTM. But I still think the lock for |
|
Oh, this deadlock is not related to the lock in |
|
Although this modification avoids deadlocks, I would like to discuss some implementation details.
pulsar-client-cpp/lib/RoundRobinMessageRouter.cc Lines 53 to 57 in 872f8ab |
I agree with you. The bahavior of the Java client is to remove the producers for all new added partitions. I think we can fix it in another PR.
Yes. It's an expected behavior of the partitions change. Messages with the same key are sent to the same partition only if the partitions does not change. |
Agree. Actually, the first solution I think for this issue is to release the lock in the callback. But it will raise another problem because it's not the root cause. So I changed the solution. But I still think that avoid locking the lock in the use callback is a better practice.
Yes. We can fix it later. I think we should not let the newly added producer affect the existing Partitioned Producer. In other way, the auto update partitions operation should not affect the existing producer. The java client skips this error.
This is an expected behavior. I think the RoundRobinMode is mostly used as load-balancing under the topic scope. If users want to ensure the order and this requirement, they should use |
@RobertIndie I just wrote another solution based on this idea in my local env. But this PR is more simple and solves the deadlock directly so I prefer this PR. We can do more enhancements in future. |
Fixes #186
Motivation
This PR fixes the deadlock issue mentioned in #186
The case is that when we create a Partitioned Producer with 2 partitions.
And then we expand the topic to 3 partitions. The PP(Partitioned Producer) will create a new internal producer(Let's called it P3)
But if we close the PP before P3 starts completed. The P3.closeAsync will be called. And it will failed the creation for itself here:
pulsar-client-cpp/lib/ProducerImpl.cc
Line 938 in 63c4245
The PP then knows the P3 has failed to create and then close PP.closeAsync again:
pulsar-client-cpp/lib/PartitionedProducerImpl.cc
Line 164 in 63c4245
The internal producers will be closed again can cause the deadlock here:
pulsar-client-cpp/lib/ProducerImpl.cc
Line 718 in 63c4245
Here is the sequence diagram for the issue:

And here is the stack trace in #186
We should not call
PartitionedProudcer.closeAsyncinhandleSinglePartitionProducerCreatedwhen the partitioned producer is already in the closing state.Modifications
Verifying this change
Documentation
doc-required(Your PR needs to update docs and you will update later)
doc-not-needed(Please explain why)
doc(Your PR contains doc changes)
doc-complete(Docs have been already added)