[fix] Fix deadlock when closing the partitioned producer by RobertIndie · Pull Request #187 · apache/pulsar-client-cpp

RobertIndie · 2023-02-03T10:20:40Z

Fixes #186

Motivation

This PR fixes the deadlock issue mentioned in #186

The case is that when we create a Partitioned Producer with 2 partitions.
And then we expand the topic to 3 partitions. The PP(Partitioned Producer) will create a new internal producer(Let's called it P3)

But if we close the PP before P3 starts completed. The P3.closeAsync will be called. And it will failed the creation for itself here:

pulsar-client-cpp/lib/ProducerImpl.cc

Line 938 in 63c4245

producerCreatedPromise_.setFailed(ResultAlreadyClosed);

The PP then knows the P3 has failed to create and then close PP.closeAsync again:

pulsar-client-cpp/lib/PartitionedProducerImpl.cc

Line 164 in 63c4245

closeAsync(nullptr);

The internal producers will be closed again can cause the deadlock here:

pulsar-client-cpp/lib/ProducerImpl.cc

Line 718 in 63c4245

Lock lock(mutex_);

Here is the sequence diagram for the issue:

And here is the stack trace in #186

    frame #6: 0x000000010c5d7672 pulsar-tests`pulsar::ProducerImpl::closeAsync(this=0x00007fb19e012c20, originalCallback=<unavailable>)>) at ProducerImpl.cc:725:10
    frame #7: 0x000000010c5768a1 pulsar-tests`pulsar::PartitionedProducerImpl::closeAsync(this=0x00007fb19ef04098, originalCallback=<unavailable>)>) at PartitionedProducerImpl.cc:287:23
    frame #8: 0x000000010c57518f pulsar-tests`pulsar::PartitionedProducerImpl::handleSinglePartitionProducerCreated(this=0x00007fb19ef04098, result=ResultAlreadyClosed, producerWeakPtr=<unavailable>, partitionIndex=2) at PartitionedProducerImpl.cc:166:13
    frame #9: 0x000000010c582c9c pulsar-tests`decltype(__f=0x0000600002699868, __a0=std::__1::shared_ptr<pulsar::PartitionedProducerImpl>::element_type @ 0x00007fb19ef04098 strong=8 weak=4, __args=0x00007ff7b4127fa4, __args=nullptr, __args=0x0000600002699888).*fp(static_cast<pulsar::Result>(fp1), static_cast<std::__1::weak_ptr<pulsar::ProducerImplBase> const&>(fp1), static_cast<unsigned int&>(fp1))) std::__1::__invoke<void (pulsar::PartitionedProducerImpl::*&)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>&, pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase> const&, unsigned int&, void>(void (pulsar::PartitionedProducerImpl::*&)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>&, pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&, unsigned int&) at type_traits:3859:1
    frame #10: 0x000000010c582bb4 pulsar-tests`std::__1::__bind_return<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::tuple<std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1>, std::__1::placeholders::__ph<2>, unsigned int>, std::__1::tuple<pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&>, __is_valid_bind_return<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::tuple<std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1>, std::__1::placeholders::__ph<2>, unsigned int>, std::__1::tuple<pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&>>::value>::type std::__1::__apply_functor<void (__f=0x0000600002699868, __bound_args=size=4, (null)=__tuple_indices<0, 1, 2, 3> @ 0x00007ff7b4127dd8, __args=size=2)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::tuple<std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1>, std::__1::placeholders::__ph<2>, unsigned int>, 0ul, 1ul, 2ul, 3ul, std::__1::tuple<pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&>>(void (pulsar::PartitionedProducerImpl::*&)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::tuple<std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1>, std::__1::placeholders::__ph<2>, unsigned int>&, std::__1::__tuple_indices<0ul, 1ul, 2ul, 3ul>, std::__1::tuple<pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&>&&) at bind.h:257:12
    frame #11: 0x000000010c582b0b pulsar-tests`std::__1::__bind_return<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::tuple<std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1>, std::__1::placeholders::__ph<2>, unsigned int>, std::__1::tuple<pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&>, __is_valid_bind_return<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::tuple<std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1>, std::__1::placeholders::__ph<2>, unsigned int>, std::__1::tuple<pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&>>::value>::type std::__1::__bind<void (this=0x0000600002699868, __args=0x00007ff7b4127fa4, __args=nullptr)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1> const&, std::__1::placeholders::__ph<2> const&, unsigned int&>::operator()<pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase> const&>(pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&) at bind.h:292:20
    frame #12: 0x000000010c582a95 pulsar-tests`decltype(__f=0x0000600002699868, __args=0x00007ff7b4127fa4, __args=nullptr)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1> const&, std::__1::placeholders::__ph<2> const&, unsigned int&>&>(fp)(static_cast<pulsar::Result>(fp0), static_cast<std::__1::weak_ptr<pulsar::ProducerImplBase> const&>(fp0))) std::__1::__invoke<std::__1::__bind<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1> const&, std::__1::placeholders::__ph<2> const&, unsigned int&>&, pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase> const&>(std::__1::__bind<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1> const&, std::__1::placeholders::__ph<2> const&, unsigned int&>&, pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&) at type_traits:3918:1
    frame #13: 0x000000010c582a47 pulsar-tests`void std::__1::__invoke_void_return_wrapper<void, true>::__call<std::__1::__bind<void (__args=0x0000600002699868, __args=0x00007ff7b4127fa4, __args=nullptr)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1> const&, std::__1::placeholders::__ph<2> const&, unsigned int&>&, pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase> const&>(std::__1::__bind<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1> const&, std::__1::placeholders::__ph<2> const&, unsigned int&>&, pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&) at invoke.h:61:9
    frame #14: 0x000000010c5829f7 pulsar-tests`std::__1::__function::__alloc_func<std::__1::__bind<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1> const&, std::__1::placeholders::__ph<2> const&, unsigned int&>, std::__1::allocator<std::__1::__bind<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1> const&, std::__1::placeholders::__ph<2> const&, unsigned int&>>, void (pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase> const&)>::operator(this=0x0000600002699868, __arg=0x00007ff7b4127fa4, __arg=nullptr)(pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&) at function.h:178:16
    frame #15: 0x000000010c5815d6 pulsar-tests`std::__1::__function::__func<std::__1::__bind<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1> const&, std::__1::placeholders::__ph<2> const&, unsigned int&>, std::__1::allocator<std::__1::__bind<void (pulsar::PartitionedProducerImpl::*)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int), std::__1::shared_ptr<pulsar::PartitionedProducerImpl>, std::__1::placeholders::__ph<1> const&, std::__1::placeholders::__ph<2> const&, unsigned int&>>, void (pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase> const&)>::operator(this=0x0000600002699860, __arg=0x00007ff7b4127fa4, __arg=nullptr)(pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&) at function.h:352:12
    frame #16: 0x000000010c5edf2f pulsar-tests`std::__1::__function::__value_func<void (pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase> const&)>::operator(this=0x0000600003d999d0, __args=0x00007ff7b4127fa4, __args=nullptr)(pulsar::Result&&, std::__1::weak_ptr<pulsar::ProducerImplBase> const&) const at function.h:505:16
    frame #17: 0x000000010c5edbf1 pulsar-tests`std::__1::function<void (pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase> const&)>::operator(this= Function = pulsar::PartitionedProducerImpl::handleSinglePartitionProducerCreated(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>, unsigned int) , __arg=ResultAlreadyClosed, __arg=nullptr)(pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase> const&) const at function.h:1182:12
    frame #18: 0x000000010c5d26b0 pulsar-tests`pulsar::Promise<pulsar::Result, std::__1::weak_ptr<pulsar::ProducerImplBase>>::setFailed(this=0x00007fb19e013988, result=ResultAlreadyClosed) const at Future.h:156:13
    frame #19: 0x000000010c5daece pulsar-tests`pulsar::ProducerImpl::shutdown(this=0x00007fb19e012c20) at ProducerImpl.cc:945:29
    frame #20: 0x000000010c5d7ed3 pulsar-tests`pulsar::ProducerImpl::closeAsync(this=0x00007ff7b4128670, result=ResultOk)>)::$_6::operator()(pulsar::Result) const at ProducerImpl.cc:716:13

We should not call PartitionedProudcer.closeAsync in handleSinglePartitionProducerCreated when the partitioned producer is already in the closing state.

Modifications

Skip handling single partition created when the partitioned producer is already in the closing state.

Verifying this change

Make sure that the change passes the CI checks.

Documentation

doc-required
(Your PR needs to update docs and you will update later)
doc-not-needed
(Please explain why)
doc
(Your PR contains doc changes)
doc-complete
(Docs have been already added)

BewareMyPower · 2023-02-09T04:24:26Z

This fix LGTM. But I still think the lock for ProducerImpl::handleCreateProducer introduced in #131 is dangerous. We should avoid acquiring the mutex in any callback, which could be called in the caller's thread and might lead to a deadlock.

BewareMyPower · 2023-02-09T04:31:41Z

Oh, this deadlock is not related to the lock in handleCreateProducer, I just found handleCreateProducer is also stuck from the thread stacks.

shibd · 2023-02-09T05:49:44Z

Although this modification avoids deadlocks, I would like to discuss some implementation details.

When P3 started failed, why need close all producers? Wouldn't it be better for other functioning producers to send messages? (Suppose P3 is temporarily unavailable due to load balancing)
Other question: When partitions are changed, we can't guarantee that the same key is sent to the same parition-topic, right? (RoundRobinMode)

topicMetadata.getNumPartitions() has changed, The result is different.

pulsar-client-cpp/lib/RoundRobinMessageRouter.cc

Lines 53 to 57 in 872f8ab

    
           // if message has a key, hash the key and return the partition 
        
           if (msg.hasPartitionKey()) { 
        
               return hash->makeHash(msg.getPartitionKey()) % topicMetadata.getNumPartitions(); 
        
           }

BewareMyPower · 2023-02-09T07:22:18Z

Wouldn't it be better for other functioning producers to send messages?

I agree with you. The bahavior of the Java client is to remove the producers for all new added partitions. I think we can fix it in another PR.

Other question: When partitions are changed, we can't guarantee that the same key is sent to the same parition-topic, right? (RoundRobinMode)

Yes. It's an expected behavior of the partitions change. Messages with the same key are sent to the same partition only if the partitions does not change.

RobertIndie · 2023-02-09T07:35:43Z

We should avoid acquiring the mutex in any callback, which could be called in the caller's thread and might lead to a deadlock.

Agree. Actually, the first solution I think for this issue is to release the lock in the callback. But it will raise another problem because it's not the root cause. So I changed the solution. But I still think that avoid locking the lock in the use callback is a better practice.

When P3 started failed, why need close all producers? Wouldn't it be better for other functioning producers to send messages? (Suppose P3 is temporarily unavailable due to load balancing)

Yes. We can fix it later. I think we should not let the newly added producer affect the existing Partitioned Producer. In other way, the auto update partitions operation should not affect the existing producer. The java client skips this error.

Other question: When partitions are changed, we can't guarantee that the same key is sent to the same parition-topic, right? (RoundRobinMode)

This is an expected behavior. I think the RoundRobinMode is mostly used as load-balancing under the topic scope. If users want to ensure the order and this requirement, they should use Key-shared subscription.

BewareMyPower · 2023-02-09T08:33:08Z

Actually, the first solution I think for this issue is to release the lock in the callback.

@RobertIndie I just wrote another solution based on this idea in my local env. But this PR is more simple and solves the deadlock directly so I prefer this PR. We can do more enhancements in future.

RobertIndie marked this pull request as draft February 3, 2023 10:22

RobertIndie self-assigned this Feb 3, 2023

Fix deadlock when closing the partitioned producer

0394302

RobertIndie force-pushed the fix-deadlock branch from f98cfd5 to 0394302 Compare February 9, 2023 02:50

RobertIndie added this to the 3.2.0 milestone Feb 9, 2023

RobertIndie marked this pull request as ready for review February 9, 2023 03:12

RobertIndie requested review from BewareMyPower and shibd and removed request for BewareMyPower February 9, 2023 03:27

BewareMyPower approved these changes Feb 9, 2023

View reviewed changes

shibd approved these changes Feb 9, 2023

View reviewed changes

shibd merged commit f69d0ce into apache:main Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[fix] Fix deadlock when closing the partitioned producer#187

[fix] Fix deadlock when closing the partitioned producer#187
shibd merged 1 commit intoapache:mainfrom
RobertIndie:fix-deadlock

RobertIndie commented Feb 3, 2023 •

edited

Loading

Uh oh!

BewareMyPower commented Feb 9, 2023

Uh oh!

BewareMyPower commented Feb 9, 2023

Uh oh!

shibd commented Feb 9, 2023

Uh oh!

BewareMyPower commented Feb 9, 2023

Uh oh!

RobertIndie commented Feb 9, 2023

Uh oh!

BewareMyPower commented Feb 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

RobertIndie commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Documentation

Uh oh!

BewareMyPower commented Feb 9, 2023

Uh oh!

BewareMyPower commented Feb 9, 2023

Uh oh!

shibd commented Feb 9, 2023

Uh oh!

BewareMyPower commented Feb 9, 2023

Uh oh!

RobertIndie commented Feb 9, 2023

Uh oh!

BewareMyPower commented Feb 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RobertIndie commented Feb 3, 2023 •

edited

Loading