HDDS-2107. Datanodes should retry forever to connect to SCM in an… #1424

vivekratnavel · 2019-09-11T01:27:46Z

… unsecure environment

In an unsecure environment, the datanodes try upto 10 times after waiting for 1000 milliseconds each time before throwing this error:

Unable to communicate to SCM server at scm:9861 for past 0 seconds.
java.net.ConnectException: Call From scm:9861 failed on connection exception: java.net.ConnectException: Connection refused;

This PR fixes that issue by having datanodes try forever to connect with SCM and not throw an error from the state machine.

I have also increased timeouts on a unit test to improve its stability.

…ecure environment

vivekratnavel · 2019-09-11T01:28:16Z

/label ozone

vivekratnavel · 2019-09-11T01:28:44Z

@xiaoyuyao @hanishakoneru @anuengineer @elek @bharatviswa504 Please review

hadoop-yetus · 2019-09-11T03:54:04Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	41	Docker mode activated.
		_ Prechecks _
+1	dupname	0	No case conflicting files found.
+1	@author	0	The patch does not contain any @author tags.
+1	test4tests	0	The patch appears to include 1 new or modified test files.
		_ trunk Compile Tests _
0	mvndep	67	Maven dependency ordering for branch
+1	mvninstall	589	trunk passed
+1	compile	381	trunk passed
+1	checkstyle	83	trunk passed
+1	mvnsite	0	trunk passed
+1	shadedclient	868	branch has no errors when building and testing our client artifacts.
+1	javadoc	178	trunk passed
0	spotbugs	417	Used deprecated FindBugs config; considering switching to SpotBugs.
+1	findbugs	615	trunk passed
		_ Patch Compile Tests _
0	mvndep	41	Maven dependency ordering for patch
+1	mvninstall	536	the patch passed
+1	compile	387	the patch passed
+1	javac	387	the patch passed
+1	checkstyle	90	the patch passed
+1	mvnsite	0	the patch passed
+1	whitespace	0	The patch has no whitespace issues.
+1	shadedclient	678	patch has no errors when building and testing our client artifacts.
+1	javadoc	175	the patch passed
+1	findbugs	631	the patch passed
		_ Other Tests _
-1	unit	280	hadoop-hdds in the patch failed.
-1	unit	2824	hadoop-ozone in the patch failed.
+1	asflicense	55	The patch does not generate ASF License warnings.
		8715

Reason	Tests
Failed junit tests	hadoop.hdds.scm.container.placement.algorithms.TestSCMContainerPlacementRackAware
	hadoop.ozone.container.TestContainerReplication
	hadoop.ozone.client.rpc.TestCloseContainerHandlingByClient
	hadoop.ozone.container.common.statemachine.commandhandler.TestBlockDeletion
	hadoop.ozone.client.rpc.TestContainerStateMachineFailures
	hadoop.ozone.client.rpc.Test2WayCommitInRatis
	hadoop.ozone.TestSecureOzoneCluster
	hadoop.ozone.scm.TestContainerSmallFile
	hadoop.ozone.client.rpc.TestBlockOutputStream
	hadoop.ozone.client.rpc.TestBlockOutputStreamWithFailures
	hadoop.ozone.om.TestOzoneManagerHA

Subsystem	Report/Notes
Docker	Client=19.03.1 Server=19.03.1 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1424/1/artifact/out/Dockerfile
GITHUB PR	#1424
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle
uname	Linux 2f98f8163e51 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `f8f8598`
Default Java	1.8.0_222
unit	https://builds.apache.org/job/hadoop-multibranch/job/PR-1424/1/artifact/out/patch-unit-hadoop-hdds.txt
unit	https://builds.apache.org/job/hadoop-multibranch/job/PR-1424/1/artifact/out/patch-unit-hadoop-ozone.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-1424/1/testReport/
Max. process+thread count	5408 (vs. ulimit of 5500)
modules	C: hadoop-hdds/container-service hadoop-ozone/ozone-manager U: .
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-1424/1/console
versions	git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by	Apache Yetus 0.10.0 http://yetus.apache.org

This message was automatically generated.

adoroszlai

Hi @vivekratnavel,

As far as I see, DataNode already tries forever due to the main loop in the state machine:

hadoop/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java

Lines 181 to 201 in c255333

    
           while (context.getState() != DatanodeStates.SHUTDOWN) { 
        
             try { 
        
               LOG.debug("Executing cycle Number : {}", context.getExecutionCount()); 
        
               long heartbeatFrequency = context.getHeartbeatFrequency(); 
        
               nextHB.set(Time.monotonicNow() + heartbeatFrequency); 
        
               context.execute(executorService, heartbeatFrequency, 
        
                   TimeUnit.MILLISECONDS); 
        
               now = Time.monotonicNow(); 
        
               if (now < nextHB.get()) { 
        
                 if(!Thread.interrupted()) { 
        
                   Thread.sleep(nextHB.get() - now); 
        
                 } 
        
               } 
        
             } catch (InterruptedException e) { 
        
               // Some one has sent interrupt signal, this could be because 
        
               // 1. Trigger heartbeat immediately 
        
               // 2. Shutdown has be initiated. 
        
             } catch (Exception e) { 
        
               LOG.error("Unable to finish the execution.", e); 
        
             } 
        
           }

You can verify this by starting DataNode without SCM, and setting the IP for scm to the DataNode's own address:

cd hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/compose/ozone
docker-compose up -d datanode
docker-compose exec datanode bash -c "tail -1 /etc/hosts | sed 's/\t\+[a-z0-9]*$/ scm/' | sudo tee -a /etc/hosts"
docker-compose logs -f --tail=10 datanode

Result:

...
datanode_1  | 2019-09-11 12:29:39 INFO  Client:948 - Retrying connect to server: scm/192.168.0.2:9861. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 12:29:39 ERROR EndpointStateMachine:204 - Unable to communicate to SCM server at scm:9861 for past 300 seconds.
...
datanode_1  | 2019-09-11 12:29:40 INFO  Client:948 - Retrying connect to server: scm/192.168.0.2:9861. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
...

vivekratnavel · 2019-09-11T18:20:17Z

@adoroszlai You are right. With this change, we don't get the error from EndPointStateMachine and the result now looks like this:

datanode_1  | 2019-09-11 18:16:55 INFO  InitDatanodeState:140 - DatanodeDetails is persisted to /data/datanode.id
datanode_1  | 2019-09-11 18:16:57 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:16:58 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:16:59 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:00 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:01 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:02 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:03 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:04 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:05 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:06 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:07 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 10 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:08 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 11 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:09 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 12 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:10 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 13 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:11 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 14 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:12 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 15 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:13 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 16 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:14 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 17 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:15 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 18 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:16 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 19 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:17 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 20 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:18 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 21 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:19 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 22 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:20 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 23 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:21 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 24 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:22 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 25 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:23 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 26 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:24 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 27 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:25 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 28 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:26 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 29 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:27 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 30 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:28 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 31 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:29 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 32 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:30 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 33 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:31 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 34 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:32 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 35 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:33 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 36 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:34 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 37 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:35 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 38 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:36 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 39 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:37 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 40 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:38 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 41 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:39 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 42 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:40 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 43 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:41 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 44 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:43 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 45 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:44 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 46 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:45 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 47 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:45 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:46 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:47 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 50 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:48 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 51 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:49 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 52 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:50 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 53 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:51 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 54 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:52 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 55 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:53 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 56 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:54 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 57 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:55 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 58 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:56 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 59 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:58 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 60 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:17:59 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 61 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:00 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 62 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:01 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 63 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:02 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 64 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:03 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 65 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:04 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 66 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:05 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 67 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:06 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 68 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:07 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 69 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:08 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 70 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:09 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 71 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:10 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 72 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:11 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 73 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:12 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 74 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:13 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 75 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:14 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 76 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:15 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 77 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)
datanode_1  | 2019-09-11 18:18:16 INFO  Client:948 - Retrying connect to server: datanode/172.19.0.2:9861. Already tried 78 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=2147483647, sleepTime=1000 MILLISECONDS)

hanishakoneru · 2019-09-13T17:00:53Z

Thank you @vivekratnavel for working on this.
LGTM. +1.
Can you please update the description. It states that DNs fail after 10 retries which is not the case.

vivekratnavel · 2019-09-13T18:28:22Z

@hanishakoneru Sure.

hanishakoneru · 2019-09-16T19:58:01Z

Thank you @vivekratnavel. +1. I will commit it.

…ecure environment (apache#1424)

HDDS-2107. Datanodes should retry forever to connect to SCM in an uns…

f440bad

…ecure environment

elek added the ozone label Sep 11, 2019

adoroszlai reviewed Sep 11, 2019

View reviewed changes

hanishakoneru merged commit 66bd168 into apache:trunk Sep 16, 2019

amahussein pushed a commit to amahussein/hadoop that referenced this pull request Oct 29, 2019

HDDS-2107. Datanodes should retry forever to connect to SCM in an uns…

cc23a44

…ecure environment (apache#1424)

RogPodge pushed a commit to RogPodge/hadoop that referenced this pull request Mar 25, 2020

HDDS-2107. Datanodes should retry forever to connect to SCM in an uns…

8cb9097

…ecure environment (apache#1424)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-2107. Datanodes should retry forever to connect to SCM in an… #1424

HDDS-2107. Datanodes should retry forever to connect to SCM in an… #1424

Uh oh!

vivekratnavel commented Sep 11, 2019 •

edited

Loading

Uh oh!

vivekratnavel commented Sep 11, 2019

Uh oh!

vivekratnavel commented Sep 11, 2019

Uh oh!

hadoop-yetus commented Sep 11, 2019

Uh oh!

adoroszlai left a comment

Uh oh!

vivekratnavel commented Sep 11, 2019

Uh oh!

hanishakoneru commented Sep 13, 2019

Uh oh!

vivekratnavel commented Sep 13, 2019

Uh oh!

hanishakoneru commented Sep 16, 2019

Uh oh!

Uh oh!

	while (context.getState() != DatanodeStates.SHUTDOWN) {
	try {
	LOG.debug("Executing cycle Number : {}", context.getExecutionCount());
	long heartbeatFrequency = context.getHeartbeatFrequency();
	nextHB.set(Time.monotonicNow() + heartbeatFrequency);
	context.execute(executorService, heartbeatFrequency,
	TimeUnit.MILLISECONDS);
	now = Time.monotonicNow();
	if (now < nextHB.get()) {
	if(!Thread.interrupted()) {
	Thread.sleep(nextHB.get() - now);
	}
	}
	} catch (InterruptedException e) {
	// Some one has sent interrupt signal, this could be because
	// 1. Trigger heartbeat immediately
	// 2. Shutdown has be initiated.
	} catch (Exception e) {
	LOG.error("Unable to finish the execution.", e);
	}
	}

HDDS-2107. Datanodes should retry forever to connect to SCM in an… #1424

HDDS-2107. Datanodes should retry forever to connect to SCM in an… #1424

Uh oh!

Conversation

vivekratnavel commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vivekratnavel commented Sep 11, 2019

Uh oh!

vivekratnavel commented Sep 11, 2019

Uh oh!

hadoop-yetus commented Sep 11, 2019

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

vivekratnavel commented Sep 11, 2019

Uh oh!

hanishakoneru commented Sep 13, 2019

Uh oh!

vivekratnavel commented Sep 13, 2019

Uh oh!

hanishakoneru commented Sep 16, 2019

Uh oh!

Uh oh!

vivekratnavel commented Sep 11, 2019 •

edited

Loading