Skip to content

[SPARK-10618] [Mesos] Refactoring scheduling condition and adding test for same #8771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1,144 commits into from
Closed

[SPARK-10618] [Mesos] Refactoring scheduling condition and adding test for same #8771

wants to merge 1,144 commits into from

Conversation

SleepyThread
Copy link
Contributor

This commit just refactored a scheduling condition into the method and added test for the same.

@SleepyThread
Copy link
Contributor Author

@andrewor14 @dragos Can you please take a look at it?

val taskScheduler = mock[TaskSchedulerImpl]
when(taskScheduler.sc).thenReturn(sc)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill space

@tnachen
Copy link
Contributor

tnachen commented Sep 16, 2015

I think this is a good idea, can you do this for fine grain mode too?

@dragos
Copy link
Contributor

dragos commented Sep 16, 2015

Not a bad idea, though I don't think this should go in before #8639, which it conflicts with (and which adds real functionality, and is reviewed).

@SleepyThread
Copy link
Contributor Author

I agree with @dragos. After #8639 is merged in master. I will update this pull request.

@tnachen I am a fan of small changes and hence will it be ok if i open another pull request for fine-grained mode ?

@SleepyThread
Copy link
Contributor Author

Refactored and added test for fine grained mode. Waiting for #8639 to be merged in master.

@tnachen
Copy link
Contributor

tnachen commented Sep 23, 2015

Actually I'm thinking what could be better is that if we can log what's the exact condition that didn't pass which made us skip the offer. It becomes a bit hard for users to guess why no jobs are running.
How about making each testing condition a rule, and whenever any of them wasn't matched we also have a corresponding message saying why the offer was rejected with the exact reason?

@SleepyThread
Copy link
Contributor Author

@tnachen Its really a great idea. But we should log these things in debug mode, rather than info log otherwise it will screw up the logs.

I will add a new patch with a rule based validation and logging once #86393 is merged.

@tnachen
Copy link
Contributor

tnachen commented Sep 24, 2015

Yes please log debug :) But at least the information is available.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@andrewor14
Copy link
Contributor

@SleepyThread #8639 is merged into master now. Can you update this?

cpus >= 1 &&
failuresBySlaveId.getOrElse(slaveId, 0) < MAX_SLAVE_FAILURES &&
!slaveIdsWithExecutors.contains(slaveId)) {
if (isOfferValidForScheduling(meetsConstraints, slaveId, mem, cpus, sc)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording is a bit odd, it's not that the offer is not valid, it usually doesn't meet the job requirements.
How about naming this isOfferSatisfiesRequirements?

@tnachen
Copy link
Contributor

tnachen commented Nov 12, 2015

Just took another pass of this review, besides the naming suggestions and style fixes overall it LGTM.
Also please update the title of this PR since it's updating both coarse grain and fine grain mode.

HeartSaVioR and others added 8 commits November 24, 2015 09:20
…parent class loader

Without patch, two additional tests of ExecutorClassLoaderSuite fails.

- "resource from parent"
- "resources from parent"

Detailed explanation is here, https://issues.apache.org/jira/browse/SPARK-11818?focusedCommentId=15011202&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15011202

Author: Jungtaek Lim <kabhwan@gmail.com>

Closes #9812 from HeartSaVioR/SPARK-11818.
we should pass in resolved encodera to logical `CoGroup` and bind them in physical `CoGroup`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9928 from cloud-fan/cogroup.
Remove duplicate ml examples (only for ml).  mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9933 from yanboliang/SPARK-11685.
…aries ignore weight col

Doc for 1.6 that the summaries mostly ignore the weight column.
To be corrected for 1.7

CC: mengxr thunterdb

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9927 from jkbradley/linregsummary-doc.
Add read/write support to LDA, similar to ALS.

save/load for ml.LocalLDAModel is done.
For DistributedLDAModel, I'm not sure if we can invoke save on the mllib.DistributedLDAModel directly. I'll send update after some test.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9894 from hhbyyh/ldaMLsave.
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9909 from cloud-fan/get-struct.
… bus's thread

This is continuation of SPARK-11761

Andrew suggested adding this protection. See tail of #9741

Author: tedyu <yuzhihong@gmail.com>

Closes #9852 from tedyu/master.
Currently pivot's signature looks like

```scala
scala.annotation.varargs
def pivot(pivotColumn: Column, values: Column*): GroupedData

scala.annotation.varargs
def pivot(pivotColumn: String, values: Any*): GroupedData
```

I think we can remove the one that takes "Column" types, since callers should always be passing in literals. It'd also be more clear if the values are not varargs, but rather Seq or java.util.List.

I also made similar changes for Python.

Author: Reynold Xin <rxin@databricks.com>

Closes #9929 from rxin/SPARK-11946.
jbonofre and others added 10 commits December 12, 2015 08:51
…rait in order to avoid ClassCastException due to KryoSerializer in KinesisReceiver

Author: Jean-Baptiste Onofré <jbonofre@apache.org>

Closes #10203 from jbonofre/SPARK-11193.
https://issues.apache.org/jira/browse/SPARK-12199

Follow-up PR of SPARK-11551. Fix some errors in ml-features.md

mengxr

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10193 from yinxusen/SPARK-12199.
…ct disconnetion message

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10261 from zsxwing/SPARK-12267.
… in the shutdown hook

1. Make sure workers and masters exit so that no worker or master will still be running when triggering the shutdown hook.
2. Set ExecutorState to FAILED if it's still RUNNING when executing the shutdown hook.

This should fix the potential exceptions when exiting a local cluster
```
java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal
	at scala.Predef$.assert(Predef.scala:179)
	at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260)
	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
	at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246)
	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191)
	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180)
	at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73)
	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474)
	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
```

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10269 from zsxwing/executor-state.
Currently, we could generate different plans for query with single distinct (depends on spark.sql.specializeSingleDistinctAggPlanning), one works better on low cardinality columns, the other
works better for high cardinality column (default one).

This PR change to generate a single plan (three aggregations and two exchanges), which work better in both cases, then we could safely remove the flag `spark.sql.specializeSingleDistinctAggPlanning` (introduced in 1.6).

For a query like `SELECT COUNT(DISTINCT a) FROM table` will be
```
AGG-4 (count distinct)
  Shuffle to a single reducer
    Partial-AGG-3 (count distinct, no grouping)
      Partial-AGG-2 (grouping on a)
        Shuffle by a
          Partial-AGG-1 (grouping on a)
```

This PR also includes large refactor for aggregation (reduce 500+ lines of code)

cc yhuai nongli marmbrus

Author: Davies Liu <davies@databricks.com>

Closes #10228 from davies/single_distinct.
When SparkStrategies.BasicOperators's "case BroadcastHint(child) => apply(child)" is hit, it only recursively invokes BasicOperators.apply with this "child". It makes many strategies have no change to process this plan, which probably leads to "No plan" issue, so we use planLater to go through all strategies.

https://issues.apache.org/jira/browse/SPARK-12275

Author: yucai <yucai.yu@intel.com>

Closes #10265 from yucai/broadcast_hint.
Follow-up of [SPARK-12199](https://issues.apache.org/jira/browse/SPARK-12199) and #10193 where a broken link has been left as is.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10282 from BenFradet/SPARK-12199.
… pyspark

JIRA: https://issues.apache.org/jira/browse/SPARK-12016

We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #10100 from viirya/fix-load-py-wordvecmodel.
cc yhuai felixcheung shaneknapp

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #10300 from shivaram/comment-lintr-disable.
I think it was a mistake, and we have not catched it so far until #10260 which begin to check if the `fromRowExpression` is resolved.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10263 from cloud-fan/encoder.
@andrewor14
Copy link
Contributor

@SleepyThread have you had the chance to address the comments? Are you still working on this?

gatorsmile and others added 4 commits December 14, 2015 18:33
… in Dataset APIs

marmbrus This PR is to address your comment. Thanks for your review!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10214 from gatorsmile/followup12188.
Support UnsafeRow for the Coalesce/Except/Intersect.

Could you review if my code changes are ok? davies Thank you!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10285 from gatorsmile/unsafeSupportCIE.
Fix a minor typo (unbalanced bracket) in ResetSystemProperties.

Author: Holden Karau <holden@us.ibm.com>

Closes #10303 from holdenk/SPARK-12332-trivial-typo-in-ResetSystemProperties-comment.
cc\ tdas zsxwing , please review. Thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #10305 from jerryshao/fix-typo-state-impl.
@SleepyThread SleepyThread changed the title [SPARK-10618] [Mesos] Refactoring coarsed-grained scheduling conditio… [SPARK-10618] [Mesos] Refactoring scheduling condition and adding test for same Dec 15, 2015
@SleepyThread
Copy link
Contributor Author

@andrewor14 I have addressed all the comments. Please review the commit.

@andrewor14
Copy link
Contributor

Can you rebase to master? Then @dragos or @tnachen can do another pass.

@SleepyThread
Copy link
Contributor Author

@andrewor14 @dragos @tnachen i screwed up the merge. I will open up a new pull request which is sync with master.

@SleepyThread
Copy link
Contributor Author

Please refer new pull request #10326. Closing this pull request.

@SleepyThread SleepyThread deleted the upstream-local branch February 12, 2016 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.