-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDDS-2107. Datanodes should retry forever to connect to SCM in an… #1424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ecure environment
/label ozone |
@xiaoyuyao @hanishakoneru @anuengineer @elek @bharatviswa504 Please review |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vivekratnavel,
As far as I see, DataNode already tries forever due to the main loop in the state machine:
Lines 181 to 201 in c255333
while (context.getState() != DatanodeStates.SHUTDOWN) { | |
try { | |
LOG.debug("Executing cycle Number : {}", context.getExecutionCount()); | |
long heartbeatFrequency = context.getHeartbeatFrequency(); | |
nextHB.set(Time.monotonicNow() + heartbeatFrequency); | |
context.execute(executorService, heartbeatFrequency, | |
TimeUnit.MILLISECONDS); | |
now = Time.monotonicNow(); | |
if (now < nextHB.get()) { | |
if(!Thread.interrupted()) { | |
Thread.sleep(nextHB.get() - now); | |
} | |
} | |
} catch (InterruptedException e) { | |
// Some one has sent interrupt signal, this could be because | |
// 1. Trigger heartbeat immediately | |
// 2. Shutdown has be initiated. | |
} catch (Exception e) { | |
LOG.error("Unable to finish the execution.", e); | |
} | |
} |
You can verify this by starting DataNode without SCM, and setting the IP for scm
to the DataNode's own address:
cd hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/compose/ozone
docker-compose up -d datanode
docker-compose exec datanode bash -c "tail -1 /etc/hosts | sed 's/\t\+[a-z0-9]*$/ scm/' | sudo tee -a /etc/hosts"
docker-compose logs -f --tail=10 datanode
Result:
...
datanode_1 | 2019-09-11 12:29:39 INFO Client:948 - Retrying connect to server: scm/192.168.0.2:9861. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
datanode_1 | 2019-09-11 12:29:39 ERROR EndpointStateMachine:204 - Unable to communicate to SCM server at scm:9861 for past 300 seconds.
...
datanode_1 | 2019-09-11 12:29:40 INFO Client:948 - Retrying connect to server: scm/192.168.0.2:9861. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
...
@adoroszlai You are right. With this change, we don't get the error from
|
Thank you @vivekratnavel for working on this. |
@hanishakoneru Sure. |
Thank you @vivekratnavel. +1. I will commit it. |
…ecure environment (apache#1424)
…ecure environment (apache#1424)
… unsecure environment
In an unsecure environment, the datanodes try upto 10 times after waiting for 1000 milliseconds each time before throwing this error:
This PR fixes that issue by having datanodes try forever to connect with SCM and not throw an error from the state machine.
I have also increased timeouts on a unit test to improve its stability.