Skip to content

[TRACKING ISSUE] Fix ClientInvocation Retry #375

@hz-devops-test

Description

@hz-devops-test

The tracking issue for the Java side PR.

See hazelcast/hazelcast#18309 for details.


Bug Description:
There seems to be multiple problems which is caused by single problem
we allow multiple threads to retry/change state of single invocation.
Example:
A connection is closed. CleanResourcesTask checks periodically and
handles invocation with connection closed.
https://github.com/hazelcast/hazelcast/blob/04e9df4c2f406238267a8d45fbc9ae476b40c7d5/hazelcast/src/main/java/com/hazelcast/client/impl/spi/impl/ClientInvocationServiceImpl.java#L290-L301

This continues and schedules a retry here
https://github.com/hazelcast/hazelcast/blob/04e9df4c2f406238267a8d45fbc9ae476b40c7d5/hazelcast/src/main/java/com/hazelcast/client/impl/spi/impl/ClientInvocation.java#L305-L315

Before this task gets a chance to run, CleanResoursTask can wakeup
check the same connection and schedule a new retry.

As a solution I first, try to sync them on deRegisterInvocation(callid)
,but this does not work. The idea was to only one that can
deRegisterInvocation will retry. Also a response/exception from
remote will notify if they can deregister the invocation.
One problem is correlation id is changed by the retried task, and
other threads are trying to deregister by reading that correlationId.
In response path, we have correlation id coming from remote but in
connection closed case, we don't have one to use(instead of reading
from invocation itself)

Another problem is how we handle connection close. Lets say a member
is about the close and we send an invocation to it. It can send,
ClusterPassive, InstanceNotActive like exceptions. And it closes the
connection after that. We have two cases to handle on the client side.
Lets say we have detected that connection is closed again in
CleanResoursesTask, and we are about to notify the invocation.
In the mean time we got InstanceNotActiveException as well, and we
retried it on another connection/with new call id.
Notify for connection close deregisters the new retry which should
not happen.

I have noticed we had similar problems on the core side.
#7270

I have investigate the solutions and core codebase
#9296
#9303

They are not suitable for the client. Especially connection close
case is different. On the member, we have MemberLeftException
which is guarded by MemberListVersion which are not applicable
to the client. So for the client, solution grew around connection
naturally.

Solution:
We coordinate all threads that wants to notify the invocation.
We achieve synchronization of different threads via sentConnection
field sentConnection starts as null.
It is set to a non-null when connection to be sent is determined.
It is set to null when a response/exception is going to be notified.
It is trying to be compared and set to null on connection close case
with dead connection, to prevent invocation to be notified which is
already retried on another connection.
Only one thread that can set this to null can be actively notify
a response/exception.

This way we make sure that only one thread changes the states
(changing correlation id/deregistration of invication).

fixes #18062

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions