Skip to content

SPY-350 Don't let Master remove RUNNING Applications #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

markhamstra
Copy link

SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition and prevent Master from removing Application with RUNNING Executors

and prevent Master from removing Application with RUNNING Executors

Conflicts:
	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
	core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala
	core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
…exited executors

There seems to be 2 issues.

1. When job is done, driver asks executor to shutdown. However, this clean exit was assigned FAILED executor state by Worker. I introduced EXITED executor state for executors who voluntarily exit (both normal and abnormal exit depending on the exit code).

2. When Master gets notified an executor has exited, it launches another one to replace it, regardless of reason why the executor had exited. When the reason was job has finished, the unnecessary replacement got subsequently killed when App disassociates. This launching and killing of unnecessary executors shows up in the log and is confusing to users. I added check for executor exit status and avoid launching (and subsequent killing) of unnecessary replacements when executors exit cleanly.

One could ask the scheduler to tell Master job is done so that Master wouldn't launch the replacement executor. However, there is a race condition between App telling Master job is done and Worker telling Master an executor had exited. There is no guarantee the former will happen before the later. Instead, I chose to check the exit code when executor exits. If the exit code is 0, I assume executor has been asked to shutdown by driver and Master will not launch replacements.

Due to race condition, it could also happen that (although didn't happen on my local cluster), Master detects App disassociation event before the executor exits by itself. In such cases, the executor will be rightfully killed and labeled as KILLED, while the App state will show FINISHED.

Author: Kan Zhang <kzhang@apache.org>

Closes apache#306 from kanzhang/SPARK-1118 and squashes the following commits:

cb0cc86 [Kan Zhang] [SPARK-937] adding EXITED executor state and not relaunching cleanly exited executors

Conflicts:
	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
	core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala
if (appInfo.incrementRetryCount() < ApplicationState.MAX_NUM_RETRY) {
schedule()
} else {
logError("Application %s with ID %s failed %d times, removing it".format(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the convention in spark is to use "format" rather than string interpolation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This section of code is old enough that there is potential to backport the change all the way to pre-string-interpolation Spark 0.8, but realistically that's not going to happen, so I'll update to string interpolation.

@jhartlaub
Copy link

Would it be possible to create a unit test to exercise this situation?

val exitCode = process.waitFor()
state = ExecutorState.FAILED
state = ExecutorState.EXITED

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state need to check,because ExecutorExitCode has other state. in class ExecutorExitCode there are many others and their belong to ExecutorState.FAILED, if(exitCode !=0 ) state=ExecutorState.FAILED else state=ExecutorState.EXITED.

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 17, 2014
…either from BlockManager or WAL in HDFS

As part of the initiative of preventing data loss on streaming driver failure, this sub-task implements a BlockRDD that is backed by HDFS. This BlockRDD can either read data from the Spark's BlockManager, or read the data from file-segments in write ahead log in HDFS.

Most of this code has been written by @harishreedharan

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Hari Shreedharan <hshreedharan@apache.org>

Closes apache#2931 from tdas/driver-ha-rdd and squashes the following commits:

209e49c [Tathagata Das] Better fix to style issue.
4a5866f [Tathagata Das] Addressed one more comment.
ed5fbf0 [Tathagata Das] Minor updates.
b0a18b1 [Tathagata Das] Fixed import order.
20aa7c6 [Tathagata Das] Fixed more line length issues.
29aa099 [Tathagata Das] Fixed line length issues.
9e47b5b [Tathagata Das] Renamed class, simplified+added unit tests.
6e1bfb8 [Tathagata Das] Tweaks testuite to create spark contxt lazily to prevent contxt leaks.
9c86a61 [Tathagata Das] Merge pull request alteryx#22 from harishreedharan/driver-ha-rdd
2878c38 [Hari Shreedharan] Shutdown spark context after tests. Formatting/minor fixes
c709f2f [Tathagata Das] Merge pull request alteryx#21 from harishreedharan/driver-ha-rdd
5cce16f [Hari Shreedharan] Make sure getBlockLocations uses offset and length to find the blocks on HDFS
eadde56 [Tathagata Das] Transferred HDFSBackedBlockRDD for the driver-ha-working branch
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jan 12, 2015
…ommands.

Adding support for defining schema in foreign DDL commands. Now foreign DDL support commands like:
```
CREATE TEMPORARY TABLE avroTable
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
```
With this PR user can define schema instead of infer from file, so  support ddl command as follows:
```
CREATE TEMPORARY TABLE avroTable(a int, b string)
USING org.apache.spark.sql.avro
OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro")
```

Author: scwf <wangfei1@huawei.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Fei Wang <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>

Closes apache#3431 from scwf/ddl and squashes the following commits:

7e79ce5 [Fei Wang] Merge pull request alteryx#22 from yhuai/pr3431yin
38f634e [Yin Huai] Remove Option from createRelation.
65e9c73 [Yin Huai] Revert all changes since applying a given schema has not been testd.
a852b10 [scwf] remove cleanIdentifier
f336a16 [Fei Wang] Merge pull request alteryx#21 from yhuai/pr3431yin
baf79b5 [Yin Huai] Test special characters quoted by backticks.
50a03b0 [Yin Huai] Use JsonRDD.nullTypeToStringType to convert NullType to StringType.
1eeb769 [Fei Wang] Merge pull request alteryx#20 from yhuai/pr3431yin
f5c22b0 [Yin Huai] Refactor code and update test cases.
f1cffe4 [Yin Huai] Revert "minor refactory"
b621c8f [scwf] minor refactory
d02547f [scwf] fix HiveCompatibilitySuite test failure
8dfbf7a [scwf] more tests for complex data type
ddab984 [Fei Wang] Merge pull request alteryx#19 from yhuai/pr3431yin
91ad91b [Yin Huai] Parse data types in DDLParser.
cf982d2 [scwf] fixed test failure
445b57b [scwf] address comments
02a662c [scwf] style issue
44eb70c [scwf] fix decimal parser issue
83b6fc3 [scwf] minor fix
9bf12f8 [wangfei] adding test case
7787ec7 [wangfei] added SchemaRelationProvider
0ba70df [wangfei] draft version
@mbautin
Copy link

mbautin commented Jan 30, 2015

@markhamstra: do we still need this PR?

@markhamstra
Copy link
Author

No this has already been fully integrated and through another generation of development upstream, and the fix is already present in csd-1.1 and csd-1.2.

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017
* Add kubernetes profile to travis yml file

* Fix long lines in CompressionUtils.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants