[SPARK-19347] ReceiverSupervisorImpl can add block to ReceiverTracker multiple times because of askWithRetry. #16690

jinxing64 · 2017-01-24T09:45:39Z

What changes were proposed in this pull request?

ReceiverSupervisorImpl on executor side reports block's meta back to ReceiverTracker on driver side. In current code, askWithRetry is used. However, for AddBlock, ReceiverTracker is not idempotent, which may result in messages are processed multiple times.

To reproduce:

Check if it is the first time receiving AddBlock in ReceiverTracker, if so sleep long enough(say 200 seconds), thus the first RPC call will be timeout in askWithRetry, then AddBlock will be resent.
Rebuild Spark and run following job:

  def streamProcessing(): Unit = {
    val conf = new SparkConf()
      .setAppName("StreamingTest")
      .setMaster(masterUrl)
    val ssc = new StreamingContext(conf, Seconds(200))
    val stream = ssc.socketTextStream("localhost", 1234)
    stream.print()
    ssc.start()
    ssc.awaitTermination()
  }

To fix:

It makes sense to provide a blocking version ask in RpcEndpointRef, as mentioned in SPARK-18113 (#16503 (comment)). Because Netty RPC layer will not drop messages. askWithRetry is a leftover from akka days. It imposes restrictions on the caller(e.g. idempotency) and other things that people generally don't pay that much attention to when using it.

How was this patch tested?

Test manually. The scenario described above doesn't happen with this patch.

squito · 2017-01-24T16:26:42Z

Jenkins, ok to test

squito · 2017-01-24T16:31:19Z

cc @vanzin @zsxwing

(I was going to make a longer comment about removing askWithRetry and whether we need another method, but then saw the comments on the referenced PR -- I'll defer to the experts here)

SparkQA · 2017-01-24T18:24:28Z

Test build #71938 has finished for PR 16690 at commit ce5216e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-26T20:22:49Z

Test build #72027 has finished for PR 16690 at commit fe86511.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-27T02:47:24Z

Test build #72057 has finished for PR 16690 at commit 3b7e17b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-01-27T03:33:55Z

@vanzin @zsxwing
ping for review~

vanzin

Also, since you're adding the blocking API, please add a deprecation annotation to askWithRetry.

vanzin · 2017-01-27T21:49:06Z

core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala


 import org.apache.spark.{SparkConf, SparkException}
 import org.apache.spark.internal.Logging
 import org.apache.spark.util.RpcUtils

+


nit: don't add.

vanzin · 2017-01-27T21:50:27Z

core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala

+   * @tparam T type of the reply message
+   * @return the reply message from the corresponding [[RpcEndpoint]]
+   */
+  def askWithBlocking[T: ClassTag](message: Any): T = askWithBlocking(message, defaultAskTimeout)


askWithBlocking is a weird name. I'd use blockingAsk, or askSync.

vanzin · 2017-01-27T21:51:01Z

core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala

+    try {
+      val future = ask[T](message, timeout)
+      val result = timeout.awaitResult(future)
+      if (result == null) {


This is not an error. It's perfectly legitimate to return null.

vanzin · 2017-01-27T21:51:39Z

core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala

+      return result
+    } catch {
+      case NonFatal(e) =>
+        throw new SparkException(


Isn't it better to just propagate the original exception? You can get the context from the stack trace.

… multiple times because of askWithRetry

SparkQA · 2017-01-28T18:29:56Z

Test build #72107 has finished for PR 16690 at commit e7002c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-01-30T11:14:55Z

@vanzin
ping for review

vanzin · 2017-01-30T18:02:43Z

ping for review

Please be a little more patient, especially during weekends.

jinxing64 · 2017-01-30T23:46:22Z

I feel very sorry if this is disturbing : )
@vanzin Thanks a lot for continuing reviewing this and I'll be more patient : )
Sorry again~~

vanzin

Looks good, just need to fix the versions.

vanzin · 2017-01-31T19:19:25Z

core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala

@@ -91,6 +123,7 @@ private[spark] abstract class RpcEndpointRef(conf: SparkConf)
   * @tparam T type of the reply message
   * @return the reply message from the corresponding [[RpcEndpoint]]
   */
+  @deprecated("use 'askSync' instead.", "2.1.0")


vanzin · 2017-01-31T19:19:38Z

core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala

@@ -75,10 +106,11 @@ private[spark] abstract class RpcEndpointRef(conf: SparkConf)
   * @tparam T type of the reply message
   * @return the reply message from the corresponding [[RpcEndpoint]]
   */
+  @deprecated("use 'askSync' instead.", "2.1.0")


vanzin · 2017-01-31T19:19:53Z

core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala

@@ -19,6 +19,7 @@ package org.apache.spark.rpc

 import scala.concurrent.Future
 import scala.reflect.ClassTag
+import scala.util.control.NonFatal


SparkQA · 2017-02-01T06:25:31Z

Test build #72228 has finished for PR 16690 at commit 42eb540.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-02-01T10:04:00Z

@vanzin
Thanks a lot for helping this PR~ I've already refined~
Please take another look~

vanzin · 2017-02-01T21:53:59Z

LGTM, merging to master.

jinxing64 · 2017-02-02T02:29:04Z

Thanks a lot for reviewing this PR~

## What changes were proposed in this pull request? The current code in `HeartbeatReceiverSuite`, executorId is set as below: ``` private val executorId1 = "executor-1" private val executorId2 = "executor-2" ``` The executorId is sent to driver when register as below: ``` test("expire dead hosts should kill executors with replacement (SPARK-8119)") { ... fakeSchedulerBackend.driverEndpoint.askSync[Boolean]( RegisterExecutor(executorId1, dummyExecutorEndpointRef1, "1.2.3.4", 0, Map.empty)) ... } ``` Receiving `RegisterExecutor` in `CoarseGrainedSchedulerBackend`, the executorId will be compared with `currentExecutorIdCounter` as below: ``` case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls) => if (executorDataMap.contains(executorId)) { executorRef.send(RegisterExecutorFailed("Duplicate executor ID: " + executorId)) context.reply(true) } else { ... executorDataMap.put(executorId, data) if (currentExecutorIdCounter < executorId.toInt) { currentExecutorIdCounter = executorId.toInt } ... ``` `executorId.toInt` will cause NumberformatException. This unit test can pass currently because of `askWithRetry`, when catching exception, RPC will call again, thus it will go `if` branch and return true. **To fix** Rectify executorId and replace `askWithRetry` with `askSync`, refer to #16690 ## How was this patch tested? This fix is for unit test and no need to add another one.(If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: jinxing <jinxing@meituan.com> Closes #16779 from jinxing64/SPARK-19437.

… multiple times because of askWithRetry. ## What changes were proposed in this pull request? `ReceiverSupervisorImpl` on executor side reports block's meta back to `ReceiverTracker` on driver side. In current code, `askWithRetry` is used. However, for `AddBlock`, `ReceiverTracker` is not idempotent, which may result in messages are processed multiple times. **To reproduce**: 1. Check if it is the first time receiving `AddBlock` in `ReceiverTracker`, if so sleep long enough(say 200 seconds), thus the first RPC call will be timeout in `askWithRetry`, then `AddBlock` will be resent. 2. Rebuild Spark and run following job: ``` def streamProcessing(): Unit = { val conf = new SparkConf() .setAppName("StreamingTest") .setMaster(masterUrl) val ssc = new StreamingContext(conf, Seconds(200)) val stream = ssc.socketTextStream("localhost", 1234) stream.print() ssc.start() ssc.awaitTermination() } ``` **To fix**: It makes sense to provide a blocking version `ask` in RpcEndpointRef, as mentioned in SPARK-18113 (apache#16503 (comment)). Because Netty RPC layer will not drop messages. `askWithRetry` is a leftover from akka days. It imposes restrictions on the caller(e.g. idempotency) and other things that people generally don't pay that much attention to when using it. ## How was this patch tested? Test manually. The scenario described above doesn't happen with this patch. Author: jinxing <jinxing@meituan.com> Closes apache#16690 from jinxing64/SPARK-19347.

## What changes were proposed in this pull request? The current code in `HeartbeatReceiverSuite`, executorId is set as below: ``` private val executorId1 = "executor-1" private val executorId2 = "executor-2" ``` The executorId is sent to driver when register as below: ``` test("expire dead hosts should kill executors with replacement (SPARK-8119)") { ... fakeSchedulerBackend.driverEndpoint.askSync[Boolean]( RegisterExecutor(executorId1, dummyExecutorEndpointRef1, "1.2.3.4", 0, Map.empty)) ... } ``` Receiving `RegisterExecutor` in `CoarseGrainedSchedulerBackend`, the executorId will be compared with `currentExecutorIdCounter` as below: ``` case RegisterExecutor(executorId, executorRef, hostname, cores, logUrls) => if (executorDataMap.contains(executorId)) { executorRef.send(RegisterExecutorFailed("Duplicate executor ID: " + executorId)) context.reply(true) } else { ... executorDataMap.put(executorId, data) if (currentExecutorIdCounter < executorId.toInt) { currentExecutorIdCounter = executorId.toInt } ... ``` `executorId.toInt` will cause NumberformatException. This unit test can pass currently because of `askWithRetry`, when catching exception, RPC will call again, thus it will go `if` branch and return true. **To fix** Rectify executorId and replace `askWithRetry` with `askSync`, refer to apache#16690 ## How was this patch tested? This fix is for unit test and no need to add another one.(If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: jinxing <jinxing@meituan.com> Closes apache#16779 from jinxing64/SPARK-19437.

srowen · 2017-02-16T20:18:46Z

core/src/main/scala/org/apache/spark/rpc/RpcEndpointRef.scala

@@ -75,10 +105,11 @@ private[spark] abstract class RpcEndpointRef(conf: SparkConf)
   * @tparam T type of the reply message
   * @return the reply message from the corresponding [[RpcEndpoint]]
   */
+  @deprecated("use 'askSync' instead.", "2.2.0")


It seems like this has caused the build to produce a lot of deprecation warnings. @jinxing64 could the callers of this method be changed, in Spark, to use the new alternative?

jinxing64 · 2017-02-17T02:18:18Z

@srowen
How do you think about #16790?

## What changes were proposed in this pull request? `askSync` is already added in `RpcEndpointRef` (see SPARK-19347 and apache#16690 (comment)) and `askWithRetry` is marked as deprecated. As mentioned SPARK-18113(apache#16503 (comment)): >askWithRetry is basically an unneeded API, and a leftover from the akka days that doesn't make sense anymore. It's prone to cause deadlocks (exactly because it's blocking), it imposes restrictions on the caller (e.g. idempotency) and other things that people generally don't pay that much attention to when using it. Since `askWithRetry` is just used inside spark and not in user logic. It might make sense to replace all of them with `askSync`. ## How was this patch tested? This PR doesn't change code logic, existing unit test can cover. Author: jinxing <jinxing@meituan.com> Closes apache#16790 from jinxing64/SPARK-19450.

jinxing64 force-pushed the SPARK-19347 branch from c5bcccf to ce5216e Compare January 24, 2017 09:49

jinxing64 force-pushed the SPARK-19347 branch from ce5216e to fe86511 Compare January 26, 2017 16:09

jinxing64 force-pushed the SPARK-19347 branch from fe86511 to 3b7e17b Compare January 27, 2017 00:13

vanzin requested changes Jan 27, 2017

View reviewed changes

[SPARK-19347] ReceiverSupervisorImpl can add block to ReceiverTracker…

e7002c1

… multiple times because of askWithRetry

jinxing64 force-pushed the SPARK-19347 branch from 3b7e17b to e7002c1 Compare January 28, 2017 15:55

vanzin approved these changes Jan 31, 2017

View reviewed changes

Fix versions.

42eb540

asfgit closed this in c5fcb7f Feb 1, 2017

jinxing64 mentioned this pull request Feb 2, 2017

[SPARK-19437] Rectify spark executor id in HeartbeatReceiverSuite. #16779

Closed

jinxing64 mentioned this pull request Feb 3, 2017

[SPARK-19450] Replace askWithRetry with askSync. #16790

Closed

srowen reviewed Feb 16, 2017

View reviewed changes

[SPARK-19347] ReceiverSupervisorImpl can add block to ReceiverTracker multiple times because of askWithRetry. #16690

[SPARK-19347] ReceiverSupervisorImpl can add block to ReceiverTracker multiple times because of askWithRetry. #16690

Uh oh!

Conversation

jinxing64 commented Jan 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

squito commented Jan 24, 2017

Uh oh!

squito commented Jan 24, 2017

Uh oh!

SparkQA commented Jan 24, 2017

Uh oh!

SparkQA commented Jan 26, 2017

Uh oh!

SparkQA commented Jan 27, 2017

Uh oh!

jinxing64 commented Jan 27, 2017

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 28, 2017

Uh oh!

jinxing64 commented Jan 30, 2017

Uh oh!

vanzin commented Jan 30, 2017

Uh oh!

jinxing64 commented Jan 30, 2017

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 1, 2017

Uh oh!

jinxing64 commented Feb 1, 2017

Uh oh!

vanzin commented Feb 1, 2017

Uh oh!

jinxing64 commented Feb 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinxing64 commented Feb 17, 2017

Uh oh!

Uh oh!

jinxing64 commented Jan 24, 2017 •

edited

Loading