[SPARK-17637][Scheduler]Packed scheduling for Spark tasks across executors #15218

zhzhan · 2016-09-23T16:19:16Z

What changes were proposed in this pull request?

Restructure the code and implement two new task assigner.
PackedAssigner: try to allocate tasks to the executors with least available cores, so that spark can release reserved executors when dynamic allocation is enabled.

BalancedAssigner: try to allocate tasks to the executors with more available cores in order to balance the workload across all executors.

By default, the original round robin assigner is used.

We test a pipeline, and new PackedAssigner save around 45% regarding the reserved cpu and memory with dynamic allocation enabled.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Both unit test in TaskSchedulerImplSuite and manual tests in production pipeline.

SparkQA · 2016-09-23T16:24:12Z

Test build #65830 has finished for PR 15218 at commit c3ebf9c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-23T18:59:47Z

Test build #65832 has finished for PR 15218 at commit d5f76ae.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-23T19:07:00Z

Test build #65831 has finished for PR 15218 at commit ffe9800.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhzhan · 2016-09-23T21:30:42Z

Failed in DirectKafkaStreamSuite. It should has nothing to do with the patch.

zhzhan · 2016-09-23T21:30:48Z

retest please

gatorsmile · 2016-09-23T22:28:18Z

See the history of the failed test case: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65832/testReport/org.apache.spark.streaming.kafka010/DirectKafkaStreamSuite/pattern_based_subscription/history/

zhzhan · 2016-09-23T23:58:31Z

@gatorsmile Thanks. #65832 is the latest one which does not have the same failure.

SparkQA · 2016-09-24T02:17:52Z

Test build #65856 has finished for PR 15218 at commit f71f1c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2016-10-04T20:24:40Z

core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala

+  }
+}
+
+class BalancedAssigner(conf: SparkConf) extends TaskAssigner(conf) {


Would be good to shuffle workOffset's for this class too.
Practically, this ensures that initial heap will be randomized when cores are the same.

This will also mean that Ordering below will need to handle case of x.cores == y.cores but x != y

BTW, I don't think need to handle the case of x.cores == y.cores, which means they are equal, and depends on the algorithm in priority queue to decide the behavior.

Returning 0 implies equality - which is not the case here (x != y but x.cores == y.cores).

@mridulm Thanks for the comments. But I am lost here. My understanding is Ordering-wise, x is equal to y if x.cores == y.cores. This ordering is used by priority queue to construct the data structure. Following is an example from trait Ordering. PersonA will be equal to PersionB if they are the same age. Do I miss anything?

import scala.util.Sorting
*

case class Person(name:String, age:Int)

val people = Array(Person("bob", 30), Person("ann", 32), Person("carl", 19))
*

// sort by age

object AgeOrdering extends Ordering[Person] {

def compare(a:Person, b:Person) = a.age compare b.age

}

Sorting.quickSort(people)(AgeOrdering)

}}}

You are right, my bad. I was thinking of Ordered

In class PackedAssigner, you add space between functions. Do you want to be consistent with the style?

SparkQA · 2016-10-04T20:35:21Z

Test build #66328 has finished for PR 15218 at commit c7a0ce2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhzhan · 2016-10-04T22:03:32Z

@mridulm Thanks for review this. Will wait for a while in case there are more comments before solving it.

SparkQA · 2016-10-07T00:07:24Z

Test build #66465 has finished for PR 15218 at commit ed8dd69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2016-10-07T05:05:47Z

Btw, taking a step back, I am not sure this will work as you expect it to.
Other than a few taskset's - those without locality information - the schedule is going to be highly biased towards the locality information supplied.

This typically will mean PROCESS_LOCAL (almost always) and then NODE_LOCAL - which means, exactly match the executor or host (irrespective of the order we traverse the offers).

The randomization of offers we do is for a specific set of purposes - spread load if no locality information (not very common imo) or spread it across cluster when locality information is of more 'low quality' - like from an InputFormat or for shuffle when we are using heuristics which might not be optimal.

But since I have not looked at this in a while, will CC kay. +CC @kayousterhout pls do take a look in case I am missing something.

zhzhan · 2016-10-07T17:27:17Z

@mridulm Thanks for the comments. Your concern regarding the locality is right. The patch does not change this behavior, which takes priority of locality preference. But if multiple executors satisfying the locality restriction, the policy will be applied. In our production pipeline, we do see a big gain with respect to reserved cpu resources when dynamic allocation is enabled.

@kayousterhout Would you like take a look and provide your comments?

mridulm · 2016-10-09T05:25:40Z

@zhzhan I am curious why this is the case for the jobs being mentioned.
This pr should have an impact if the locality preference of the taskset being run is fairly suboptimal to begin with, no ?

If the tasks have PROCESS_LOCAL or NODE_LOCAL locality preference - that will take precedence, and attempts to spread the load or reduce spread to nodes as envisioned here will not work.

So the target here seems to be RACK_LOCAL or ANY locality preference - which should be fairly uncommon; unless I am missing something here w.r.t the jobs being run.

EDIT: I can see one case where it will help, which is why you have shuffle tasks being run where number of partitions is large (greater than the hardcoded thresholds in code).
In this case, we end up without locality pref; and if none of the rdd's run after the shuffle rdd in the shuffle task declare locality pref, then you end up with no locality pref.
Is that the case you are observing ? iirc if more than 1k number of map tasks or reduce tasks - then this behavior might be observed.

zhzhan · 2016-10-09T06:44:35Z

@mridulm You are right. This patch is mainly for the job that has multiple stages, which is very common in production pipeline. As you mentioned, if there is shuffle involved, getLocationsWithLargestOutputs in MapOutputTracker typically return None for the ShuffledRowRDD and ShuffledRDD because of the threshold REDUCER_PREF_LOCS_FRACTION (20%).

The ShuffledRowRDD/ShuffleRDD can be easily more than 10 partitions (even hundreds) in real production pipeline, thus the patch does help a lot in CPU reservation time.

mridulm · 2016-10-15T09:41:37Z

I am assuming @kayousterhout does not have comments on this.
Can you please fix the conflict @zhzhan ? I will merge it in after that to master.

SparkQA · 2016-10-15T21:05:43Z

Test build #67021 has finished for PR 15218 at commit 98a9747.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2016-10-16T01:48:15Z

Merged to master, thanks @zhzhan !

zhzhan · 2016-10-16T02:57:20Z

@mridulm Thanks for reviewing this.

rxin · 2016-10-16T05:21:18Z

@zhzhan and @mridulm all the classes need to be private[scheduler] shouldn't they?

rxin · 2016-10-16T05:22:27Z

core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala

+  }
+}
+
+class PackedAssigner(conf: SparkConf) extends TaskAssigner(conf) {


We need more documentation here to explain what this class does.

rxin · 2016-10-16T05:22:37Z

core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala

+
+class PackedAssigner(conf: SparkConf) extends TaskAssigner(conf) {
+
+  var sorted: Seq[OfferState] = _


all these variables should be private

rxin · 2016-10-16T05:23:02Z

core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala

+
+  // Release internally maintained resources. Subclass is responsible to
+  // release its own private resources.
+  def reset: Unit = {


this should have parentheses since it has side effect

rxin · 2016-10-16T05:24:35Z

core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala

+
+import org.apache.spark.SparkConf
+
+case class OfferState(workOffer: WorkerOffer, var cores: Int) {


we need documentation explaining what this class does

Also case classes are supposed to have mostly immutable state -- if you want cores to be mutable, I'd just make this a normal class.

I read more code. Shouldn't cores be coresRemaining, or coresAvailable?

rxin · 2016-10-16T05:26:31Z

core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala

+
+case class OfferState(workOffer: WorkerOffer, var cores: Int) {
+  // Build a list of tasks to assign to each worker.
+  val tasks = new ArrayBuffer[TaskDescription](cores)


Again I think you need to document what this actually does. My guess (without having looked at rest of the code) is that the index indicates some worker id, but I'm not sure and I might be wrong. We need to explain it here.

Ah ok - my guess was wrong. It would be great to actually say what this list means, e.g. is this a queue?

rxin · 2016-10-16T05:28:51Z

@mridulm @zhzhan I liked the idea here, but unfortunately I think it's merged prematurely. There are insufficient documentation and basic styles that don't align with rest of Spark. I'm going to revert this. It would be good to get this in, and I think with very little work we can get it to a shape that look a lot better.

rxin · 2016-10-16T05:33:13Z

docs/configuration.md

@@ -1334,6 +1334,17 @@ Apart from these, the following properties are also available, and may be useful
    Should be greater than or equal to 1. Number of allowed retries = this value - 1.
  </td>
 </tr>
+<tr>
+  <td><code>spark.task.assigner</code></td>
+  <td>org.apache.spark.scheduler.RoundRobinAssigner</td>


rather than asking for the full class name, I'd just have "roundrobin" and "packed" (case insensitive) as the options and internally maintain the mapping.

In Spark SQL side, we did a similar thing for data source. You can check the code in the function lookupDataSource.

Yea in this case I wouldn't even support external assigners. Just have strings to use the built-in ones.

rxin · 2016-10-16T05:33:30Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

@@ -109,6 +109,72 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B
    assert(!failedTaskSet)
  }

+  test("Scheduler balance the assignment to the worker with more free cores") {


thanks a lot for creating the test cases

rxin · 2016-10-16T05:34:58Z

docs/configuration.md

@@ -1334,6 +1334,17 @@ Apart from these, the following properties are also available, and may be useful
    Should be greater than or equal to 1. Number of allowed retries = this value - 1.
  </td>
 </tr>
+<tr>
+  <td><code>spark.task.assigner</code></td>


I'd add "scheduler" to the option, e.g. "spark.scheduler.taskAssigner"

rxin · 2016-10-16T05:39:30Z

core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala

+  val tasks = new ArrayBuffer[TaskDescription](cores)
+}
+
+abstract class TaskAssigner(conf: SparkConf) {


instead of taking in a generic SparkConf, I'd just take in the cpu per task for now, until we see a clear need to be more generic. This simplifies the dependencies of the class.

rxin · 2016-10-16T05:43:30Z

@zhzhan in general it'd be great to have proper documentation on the classes. For example, it is important to document the behavior of the various assigners, and even more importantly, document the contract for TaskAssigner. The control flow is fairly confusing right now -- I'm not very smart and things that are complicated take me a long time to understand, and when I try changing them in the future, there's a very good chance I will make a mistake and mess it up. It would be great if we can simplify the control flow. If we can't, then we should document it more clearly. For example, when init/reset should be called are all part of the contracts, and none of them are really documented.

zhzhan · 2016-10-16T07:13:27Z

@rxin Thanks a lot for the detail review. I will update the patch.

gatorsmile · 2016-10-16T21:25:50Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

@@ -61,6 +59,21 @@ private[spark] class TaskSchedulerImpl(

  val conf = sc.conf

+  val DEFAULT_TASK_ASSIGNER = classOf[RoundRobinAssigner].getName
+  lazy val taskAssigner: TaskAssigner = {
+    val className = conf.get("spark.task.assigner", DEFAULT_TASK_ASSIGNER)


Like the above MAX_TASK_FAILURES, we can also add spark.task.assigner into object config.

gatorsmile · 2016-10-16T22:13:07Z

docs/configuration.md

+    By default, round robin with randomness is used.
+    org.apache.spark.scheduler.BalancedAssigner tries to balance the task across all workers (allocating tasks to
+    workers with most free cores). org.apache.spark.scheduler.PackedAssigner tries to allocate tasks to workers
+    with the least free cores, which may help releasing the resources when dynamic allocation is enabled.


when dynamic allocation is enabled. ->
when dynamic allocation (spark.dynamicAllocation.enabled) is enabled.

gatorsmile · 2016-10-16T22:58:55Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

+      assert(4 === taskDescriptions.length)
+      taskDescriptions.map(_.executorId)
+    }
+


Nit: remove this empty line.

gatorsmile · 2016-10-16T22:59:06Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

+    assert(!failedTaskSet)
+  }
+
+


Nit: remove this empty line.

gatorsmile · 2016-10-16T22:59:11Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

@@ -408,4 +474,5 @@ class TaskSchedulerImplSuite extends SparkFunSuite with LocalSparkContext with B
    assert(thirdTaskDescs.size === 0)
    assert(taskScheduler.getExecutorsAliveOnHost("host1") === Some(Set("executor1", "executor3")))
  }
+


Nit: remove this empty line.

gatorsmile · 2016-10-16T23:08:35Z

The test case design is pretty good. It covers all the scenarios.

Could you add a check for the negative case? That means, when users do not provide the right TaskAssigner name, we fall back to the default round robin one
For the existing unchanged test cases in TaskSchedulerImplSuite.scala, please add a check to verify whether it picks the default one.
If possible, please change one of the existing test case in TaskSchedulerImplSuite.scala, ensure that users are allowed to input the round robin as the task assigner.

gatorsmile · 2016-10-16T23:10:44Z

core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala

+}
+
+class RoundRobinAssigner(conf: SparkConf) extends TaskAssigner(conf) {
+  var i = 0


Any better variable name?

gatorsmile · 2016-10-16T23:16:02Z

core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala

+      return Ordering[Int].compare(x.cores, y.cores)
+    }
+  }
+  def init(): Unit = {


wangmiao1981 · 2016-10-18T22:08:41Z

core/src/main/scala/org/apache/spark/scheduler/TaskAssigner.scala

+  def getNext(): OfferState
+
+  // Called by the TaskScheduler to indicate whether the current offer is accepted
+  // In order to decide whether the current is valid for the next offering.


"In" should be "in"

wangmiao1981 · 2016-10-18T22:21:55Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

            val tid = task.taskId
            taskIdToTaskSetManager(tid) = taskSet
            taskIdToExecutorId(tid) = execId
            executorIdToTaskCount(execId) += 1
-            availableCpus(i) -= CPUS_PER_TASK
-            assert(availableCpus(i) >= 0)
+            current.cores = current.cores - CPUS_PER_TASK


Do you want to follow the previous style current.cores -= CPUS_PER_TASK

zhzhan · 2016-10-19T01:19:50Z

@wangmiao1981 Thanks for reviewing this. I will open another PR solving these comments soon.

…cutors ## What changes were proposed in this pull request? Restructure the code and implement two new task assigner. PackedAssigner: try to allocate tasks to the executors with least available cores, so that spark can release reserved executors when dynamic allocation is enabled. BalancedAssigner: try to allocate tasks to the executors with more available cores in order to balance the workload across all executors. By default, the original round robin assigner is used. We test a pipeline, and new PackedAssigner save around 45% regarding the reserved cpu and memory with dynamic allocation enabled. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Both unit test in TaskSchedulerImplSuite and manual tests in production pipeline. Author: Zhan Zhang <zhanzhang@fb.com> Closes apache#15218 from zhzhan/packed-scheduler.

zhzhan changed the title ~~[Spark-17637][Scheduler]Packed scheduling for Spark tasks across executors~~ [SPARK-17637][Scheduler]Packed scheduling for Spark tasks across executors Sep 23, 2016

zhzhan force-pushed the packed-scheduler branch from f71f1c0 to c7a0ce2 Compare October 4, 2016 18:06

mridulm reviewed Oct 4, 2016

View reviewed changes

Zhan Zhang added 8 commits October 15, 2016 11:23

solve conflicts

b405156

fix the configuration.md

4e3cda7

formatting and change test cases

d5b6994

fix style check

b1c6747

fix logic error and cleanup

8a63163

cleanup and formating

2c79fee

rebase and solve conflicts

f9a040d

solve review comments

98a9747

zhzhan force-pushed the packed-scheduler branch from ed8dd69 to 98a9747 Compare October 15, 2016 18:50

asfgit closed this in ed14633 Oct 16, 2016

rxin reviewed Oct 16, 2016

View reviewed changes

gatorsmile reviewed Oct 16, 2016

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

assert(!failedTaskSet)

}

Copy link

Member

gatorsmile Oct 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove this empty line.

gatorsmile reviewed Oct 16, 2016

View reviewed changes

wangmiao1981 reviewed Oct 18, 2016

View reviewed changes

zhzhan deleted the packed-scheduler branch October 19, 2016 02:44

zhzhan mentioned this pull request Oct 19, 2016

[SPARK-17637][Scheduler]Packed scheduling for Spark tasks across executors #15541

Closed


		class PackedAssigner(conf: SparkConf) extends TaskAssigner(conf) {

		var sorted: Seq[OfferState] = _


		import org.apache.spark.SparkConf

		case class OfferState(workOffer: WorkerOffer, var cores: Int) {

[SPARK-17637][Scheduler]Packed scheduling for Spark tasks across executors #15218

[SPARK-17637][Scheduler]Packed scheduling for Spark tasks across executors #15218

Uh oh!

Conversation

zhzhan commented Sep 23, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 23, 2016

Uh oh!

SparkQA commented Sep 23, 2016

Uh oh!

SparkQA commented Sep 23, 2016

Uh oh!

zhzhan commented Sep 23, 2016

Uh oh!

zhzhan commented Sep 23, 2016

Uh oh!

gatorsmile commented Sep 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhzhan commented Sep 23, 2016

Uh oh!

SparkQA commented Sep 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 4, 2016

Uh oh!

zhzhan commented Oct 4, 2016

Uh oh!

SparkQA commented Oct 7, 2016

Uh oh!

mridulm commented Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhzhan commented Oct 7, 2016

Uh oh!

mridulm commented Oct 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhzhan commented Oct 9, 2016

Uh oh!

mridulm commented Oct 15, 2016

Uh oh!

SparkQA commented Oct 15, 2016

Uh oh!

mridulm commented Oct 16, 2016

Uh oh!

zhzhan commented Oct 16, 2016

Uh oh!

rxin commented Oct 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin Oct 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

gatorsmile commented Sep 23, 2016 •

edited

Loading

mridulm commented Oct 7, 2016 •

edited

Loading

mridulm commented Oct 9, 2016 •

edited

Loading

rxin Oct 16, 2016 •

edited

Loading