HDFS-16070. DataTransfer block storm when datanode's io is busy. #3105

zhengchenyu · 2021-06-15T03:42:05Z

When I speed up the decommission, I found that some datanode's io is busy, then I found host's load is very high, and ten thousands data transfer thread are running.
Then I find log like below.

# setup datatranfer log
2021-06-08 13:42:37,620 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.201.4.49:9866, datanodeUuid=6c55b7cb-f8ef-445b-9cca-d82b5b077ed1, infoPort=9864, infoSecurePort=0, ipcPort=9867, storageInfo=lv=-57;cid=CID-37e80bd5-733a-4d7b-ba3d-b46269573c72;nsid=215490653;c=1584525570797) Starting thread to transfer BP-852924019-10.201.1.32-1584525570797:blk_-9223372036449848858_30963611 to 10.201.7.52:9866
2021-06-08 13:52:36,345 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.201.4.49:9866, datanodeUuid=6c55b7cb-f8ef-445b-9cca-d82b5b077ed1, infoPort=9864, infoSecurePort=0, ipcPort=9867, storageInfo=lv=-57;cid=CID-37e80bd5-733a-4d7b-ba3d-b46269573c72;nsid=215490653;c=1584525570797) Starting thread to transfer BP-852924019-10.201.1.32-1584525570797:blk_-9223372036449848858_30963611 to 10.201.7.31:9866
2021-06-08 14:02:37,197 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.201.4.49:9866, datanodeUuid=6c55b7cb-f8ef-445b-9cca-d82b5b077ed1, infoPort=9864, infoSecurePort=0, ipcPort=9867, storageInfo=lv=-57;cid=CID-37e80bd5-733a-4d7b-ba3d-b46269573c72;nsid=215490653;c=1584525570797) Starting thread to transfer BP-852924019-10.201.1.32-1584525570797:blk_-9223372036449848858_30963611 to 10.201.16.50:9866
# datatranfer done log
2021-06-08 13:54:08,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at bd-tz1-hadoop-004049.zeus.lianjia.com:9866: Transmitted BP-852924019-10.201.1.32-1584525570797:blk_-9223372036449848858_30963611 (numBytes=7457424) to /10.201.7.52:9866
2021-06-08 14:10:47,170 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at bd-tz1-hadoop-004049.zeus.lianjia.com:9866: Transmitted BP-852924019-10.201.1.32-1584525570797:blk_-9223372036449848858_30963611 (numBytes=7457424) to /10.201.16.50:9866

You will see last datatranfser thread was done on 13:54:08, but next datatranfser was start at 13:52:36.
If datatranfser was not done in 10min(pending timeout + check interval), then next datatranfser for same block will be running. Then disk and network are heavy.

Note: decommission ec block will trigger this problem easily, becuase every ec internal block are unique.

zhengchenyu · 2021-06-15T03:47:35Z

@ayushtkn @goiri can you help me review this PR?

hadoop-yetus · 2021-06-15T08:58:35Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 34s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 2s		codespell was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	30m 47s		trunk passed
+1 💚	compile	1m 26s		trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	compile	1m 14s		trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	checkstyle	1m 4s		trunk passed
+1 💚	mvnsite	1m 27s		trunk passed
+1 💚	javadoc	0m 56s		trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 28s		trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	spotbugs	3m 5s		trunk passed
+1 💚	shadedclient	16m 11s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 13s		the patch passed
+1 💚	compile	1m 14s		the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javac	1m 14s		the patch passed
+1 💚	compile	1m 7s		the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	javac	1m 7s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 54s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 3 new + 149 unchanged - 0 fixed = 152 total (was 149)
+1 💚	mvnsite	1m 14s		the patch passed
+1 💚	javadoc	0m 48s		the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 💚	javadoc	1m 17s		the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 💚	spotbugs	3m 11s		the patch passed
+1 💚	shadedclient	16m 4s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	231m 22s		hadoop-hdfs in the patch passed.
-1 ❌	asflicense	0m 47s	/results-asflicense.txt	The patch generated 1 ASF License warnings.
		315m 15s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3105/1/artifact/out/Dockerfile
GITHUB PR	#3105
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname	Linux a54436f5e3d2 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `06790ea`
Default Java	Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3105/1/testReport/
Max. process+thread count	3015 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3105/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

…usy.

goiri · 2021-06-15T17:03:07Z