[SPARK-36384][CORE][DOC] Add doc for shuffle checksum

### What changes were proposed in this pull request? Add doc for the shuffle checksum configs in `configuration.md`. ### Why are the changes needed? doc ### Does this PR introduce _any_ user-facing change? No, since Spark 3.2 hasn't been released. ### How was this patch tested? Pass existed tests. Closes apache#33637 from Ngone51/SPARK-36384. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
huawei-cloudnative · Aug 5, 2021 · 3b92c72 · 3b92c72
1 parent 0f5c3a4
commit 3b92c72
Show file tree

Hide file tree

Showing 2 changed files with 26 additions and 5 deletions.
diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -1370,21 +1370,24 @@ package object config {
 
   private[spark] val SHUFFLE_CHECKSUM_ENABLED =
     ConfigBuilder("spark.shuffle.checksum.enabled")
-      .doc("Whether to calculate the checksum of shuffle output. If enabled, Spark will try " +
-        "its best to tell if shuffle data corruption is caused by network or disk or others.")
+      .doc("Whether to calculate the checksum of shuffle data. If enabled, Spark will calculate " +
+        "the checksum values for each partition data within the map output file and store the " +
+        "values in a checksum file on the disk. When there's shuffle data corruption detected, " +
+        "Spark will try to diagnose the cause (e.g., network issue, disk issue, etc.) of the " +
+        "corruption by using the checksum file.")
       .version("3.2.0")
       .booleanConf
       .createWithDefault(true)
 
   private[spark] val SHUFFLE_CHECKSUM_ALGORITHM =
     ConfigBuilder("spark.shuffle.checksum.algorithm")
-      .doc("The algorithm used to calculate the checksum. Currently, it only supports" +
-        " built-in algorithms of JDK.")
+      .doc("The algorithm is used to calculate the shuffle checksum. Currently, it only supports " +
+        "built-in algorithms of JDK.")
       .version("3.2.0")
       .stringConf
       .transform(_.toUpperCase(Locale.ROOT))
       .checkValue(Set("ADLER32", "CRC32").contains, "Shuffle checksum algorithm " +
-        "should be either Adler32 or CRC32.")
+        "should be either ADLER32 or CRC32.")
       .createWithDefault("ADLER32")
 
   private[spark] val SHUFFLE_COMPRESS =

diff --git a/docs/configuration.md b/docs/configuration.md
@@ -1032,6 +1032,24 @@ Apart from these, the following properties are also available, and may be useful
   </td>
   <td>1.6.0</td>
 </tr>
+<tr>
+  <td><code>spark.shuffle.checksum.enabled</code></td>
+  <td>true</td>
+  <td>
+    Whether to calculate the checksum of shuffle data. If enabled, Spark will calculate the checksum values for each partition
+    data within the map output file and store the values in a checksum file on the disk. When there's shuffle data corruption
+    detected, Spark will try to diagnose the cause (e.g., network issue, disk issue, etc.) of the corruption by using the checksum file.
+  </td>
+  <td>3.2.0</td>
+</tr>
+<tr>
+  <td><code>spark.shuffle.checksum.algorithm</code></td>
+  <td>ADLER32</td>
+  <td>
+    The algorithm is used to calculate the shuffle checksum. Currently, it only supports built-in algorithms of JDK, e.g., ADLER32, CRC32.
+  </td>
+  <td>3.2.0</td>
+</tr>
 </table>
 
 ### Spark UI