[SPARK-14475] Propagate user-defined context from driver to executors #12248

ericl · 2016-04-07T23:47:20Z

What changes were proposed in this pull request?

This adds a new API call TaskContext.getLocalProperty for getting properties set in the driver from executors. These local properties are automatically propagated from the driver to executors. For streaming, the context for streaming tasks will be the initial driver context when ssc.start() is called.

How was this patch tested?

Unit tests.

cc @JoshRosen

SparkQA · 2016-04-08T00:02:12Z

Test build #55276 has finished for PR 12248 at commit b2fc541.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-04-08T00:17:26Z

The MiMa failure is because TaskContext is a public abstract class, so adding a new method to it breaks binary compatibility for implementors of that class. However, TaskContext is not intended to actually be implemented by users; rather, it functions more as a public interface. Therefore, it should be safe to ignore this MiMa error:

[error]  * abstract method getLocalProperty(java.lang.String)java.lang.String in class org.apache.spark.TaskContext is present only in current version
[error]    filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.TaskContext.getLocalProperty")

JoshRosen · 2016-04-08T00:26:10Z

core/src/main/scala/org/apache/spark/scheduler/Task.scala

@@ -206,6 +210,11 @@ private[spark] object Task {
      dataOut.writeLong(timestamp)
    }

+    // Write the task properties separately so it is available before full task deserialization.


Since the properties aren't transient in Task, I guess this means that we'll write them out twice. If we want to avoid this, we can make localProperties into a @transient var which is private[spark] then re-set the field after deserializing the task. Tasks are send to executors using broadcast variables, so the extra space only makes a different for the first task from a stage that's run on an executor.

As a result, if we think that these serialized properties will typically be small then the extra space savings probably aren't a huge deal, but if we want to heavily optimize then we can do the var trick.

JoshRosen · 2016-04-08T00:39:06Z

To fix MiMa, add the ignores to MimaExcludes; I'd follow the existing convention and create a new section at

spark/project/MimaExcludes.scala

Line 613 in 49fb237

) ++ Seq(

and reference this JIRA.

SparkQA · 2016-04-08T00:52:57Z

Test build #55287 has finished for PR 12248 at commit 82646a9.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-08T00:53:27Z

Test build #55285 has finished for PR 12248 at commit 37f269e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-08T01:19:04Z

Test build #55290 has finished for PR 12248 at commit 2542d01.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-04-08T05:12:33Z

Test build #55291 has finished for PR 12248 at commit 964ee4b.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-08T07:35:12Z

Test build #55320 has finished for PR 12248 at commit 0e46c58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-04-08T09:39:28Z

Backing up, how does this differ from using broadcast variables for data, or for simply sending small properties objects in a closure? does it need all this complexity of yet another mechanism?

ericl · 2016-04-08T16:38:06Z

The main difference is that propagation is transparent to user code. For
example, this could be used to implement something like X-trace without
requiring manual instrumentation of closures

On Fri, Apr 8, 2016, 2:40 AM Sean Owen notifications@github.com wrote:

Backing up, how does this differ from using broadcast variables for data,
or for simply sending small properties objects in a closure? does it need
all this complexity of yet another mechanism?

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#12248 (comment)

srowen · 2016-04-08T16:39:52Z

Propagating references from a closure is already transparent; you just reference whatever you want like a Properties object and it goes with the task. What's the use case for something more than this?

ericl · 2016-04-08T17:47:55Z

It's not though, if you want to propagate something new without manually passing it through all your closures this cannot be done today.

For example, consider a spark library that wants to implement a per-job myLib.setLogLevel() call. With context propagation the library author can provide semantics like this:

myLib.setLogLevel(INFO)
sc.parallelize(...).map(myLib.f1).filter(myLib.f2).collect()

What you have to do now is something more like:

sc.parallelize(...).map { x =>
  myLib.setLogLevel(INFO)
  myLib.f1(x)
}.filter { y =>
  myLib.setLogLevel(INFO)
  myLib.f2(y)
}.collect()

which is more verbose and hard to maintain.

srowen · 2016-04-08T18:17:04Z

Your change is about passing around a Properties, right? You can simply access such an object anywhere you need to and it will be sent around as needed. There is nothing to do actually, not even explicitly setting it as some context property.

Your example however seems to be about configuring some global per-function behavior, not sending props. In this example, why would the library not call setLogLevel internally, either in static initialization or as needed when any method is invoked -- why would the caller have to do it? But, how is this helped by adding an additional Properties parameter?

ericl · 2016-04-08T18:59:04Z

Your change is about passing around a Properties, right? You can simply access such an object anywhere you need to and it will be sent around as needed. There is nothing to do actually, not even explicitly setting it as some context property.

That's not true, if you access a static Properties object within an executor node it won't have the value you set in the driver, since closures only capture variables in lexical scope.

Your example however seems to be about configuring some global per-function behavior, not sending props. In this example, why would the library not call setLogLevel internally, either in static initialization or as needed when any method is invoked -- why would the caller have to do it?

It's more about configuring behavior based on some property set by some upstream caller of the function. The idea is that the user wants to configure loglevel just for this job, without impacting any other jobs potentially running on the cluster.

But, how is this helped by adding an additional Properties parameter?

Sorry, I should have made the example more explicit. setLogLevel would be implemented in the driver side as sc.setLocalProperty("mylib.loglevel", level). On the executor side the library would query TaskContext.getLocalProperty("mylib.loglevel") to determine the verbosity of debug logs.

I think more generally that this adds a mechanism for passing values implicitly without requiring the user (that is writing Spark code) to manually reference it in each of their closures. You are right that this can be achieved via other mechanisms, but those may not be convenient or practical for the use case e.g. if you want to integrate with something like X-trace (which out of the scope of this PR, but would be easy to add once we have the mechanism).

JoshRosen · 2016-04-08T19:02:44Z

@srowen, I think that the main use-case for this feature is associating metadata associated with a Spark action / execution and making that metadata accessible in that action's tasks.

For instance, let's say that I run a Spark SQL query and want to propagate some metadata related to that query execution from the driver to the executors for use in tracing / debugging / instrumentation. Maybe I want to propagate a label associated with all tasks launched from the job, such as a job group name, and read that label in a custom log appender so that my log messages from those tasks contain that metadata.

In this case, the actual RDD code isn't controlled by the user and they don't really have a place to interpose broadcast variables or other custom code for propagating this metadata.

Even the user's library code were to use broadcast variables and define thread-local variables, etc., then they'd have to worry about some subtleties related to Spark's internal threading model: for example, thread-locals need to be handled carefully to make sure that they're correctly propagated across thread-boundaries in PythonRDD, RRDD, ScriptTransformation, PipedRDD, etc., and the set of places where you'd need to do that propagation corresponds exactly to the set of places where we already happen to be propagating the TaskContext thread-local.

Given that localProperties is already a stable public API, I think it makes sense to make those properties accessible in tasks, since it seems like a small and logical extension of an existing API.

srowen · 2016-04-08T19:08:18Z

I mean making a Properties object in the driver, and using a reference to it in a function that is executed on the executors. That's certainly in scope. For the example you give that seems equally simple and can be bottled up inside the library, still.

I understand Josh's use case more. There are certainly tasks and RDDs entirely internal to some Spark process. But those also won't know anything about what to do with some custom user properties. Maybe eventually they invoke a UDF that could use these properties. In many cases that UDF could still just refer to whatever config you like directly (right?) but I'm probably not thinking of some case where this fails to work.

I take the point about this already being an API for the caller anyway.

srowen · 2016-04-08T19:09:37Z

core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala

  with Logging {

  /** A constructor used only in test suites. This does not require passing in an RDD. */
  def this(partitionId: Int) {
-    this(0, 0, null, new Partition { override def index: Int = 0 }, null, null)
+    this(0, 0, null, new Partition { override def index: Int = 0 }, null, null, new Properties)


I wonder if we can avoid making empty Properties all over ... an Option[Properties]? a setter that is called only where needed?

Properties objects are kind of analogous to Maps and I think that Option[Map] would be kind of a weird type in the same sense that Option[Set] (or any other collection type) is usually kind a weird code-smell So, this is fine with me as is.

It seemed safer to make it required. I can change this to an option if you think creating a Properties each time is too much overhead.

Fair enough, I suppose allocating the empty map/properties object isn't that expensive.

ericl · 2016-04-08T20:05:13Z

@srowen, suppose you have an existing service running Spark jobs that read from a custom datasource. You want to add log4j trace annotations in order to attribute datasource logs back to the original caller of the service. However you want to avoid invasive changes to the existing code. This is a two-line change with the proposed API.

// in RPC server running as driver
def receive(request: RPC) {
    sc.setLocalProperty("traceId", request.traceId)  // add this line
    ...
}

// in datasource library running on executors
def handleRead(...) {
    log4j.MDC.put("traceId", TaskContext.getLocalProperty("traceId"))  // add this line
    ...
}

The alternative is to explicitly reference traceId in each of the tasks, but this would clutter application code with many references to diagnostics info, discouraging the use of diagnostic tools.

srowen · 2016-04-08T20:20:55Z

This can as easily be ...

Properties p = ...
p.put("traceId", "foo")
...
def handleRead(...) {
  log4j.MDC.put("traceId", p.get("traceId"))
  ...
}

I get that if handleRead is buried somewhere in a library function you have to plumb through access to Properties explicitly in the library. Going via static methods on a thread-local task context is a little less transparent, but it is more convenient. That's really the win, that anything in your code has direct magic access to context props; I don't think anything else actually gets simpler.

I think the fact it's already an API reduces the cost of a change like this in comparison, so I can see the argument for it.

ericl · 2016-04-10T03:55:06Z

Yes exactly, this is for implementing functionality such as tracing, where to users any existing code modification may be too burdensome due to e.g. too much plumbing or libraries they cannot modify.

It's the same argument for thread-locals, but in this case spanning driver -> worker interactions.

JoshRosen · 2016-04-11T22:35:47Z

It seems like we agree that this API is easy-to-support in Spark and hard/impossible to implement as cleanly in client code. As a result, I think this is okay to merge, so I'm going to run this one more time and will merge a bit after Jenkins passes. If anyone thinks that we need more discussion before accepting this API, let me know.

JoshRosen · 2016-04-11T22:35:49Z

Jenkins, retest this please.

SparkQA · 2016-04-12T00:56:20Z

Test build #55545 has finished for PR 12248 at commit 0e46c58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-12T01:33:41Z

Going to merge this in maser. Thanks.

bsikander · 2019-03-21T12:37:33Z

@ericl @JoshRosen @srowen
If I get this change correctly then can we set log4j MDC values from drivers to executors?
I don't see any reference to MDC inside Spark so I am a bit confused.

ericl added 3 commits April 7, 2016 15:30

Thu Apr 7 15:30:39 PDT 2016

a8b07f1

tests

9104f0c

Thu Apr 7 16:38:01 PDT 2016

b2fc541

JoshRosen reviewed Apr 8, 2016
View reviewed changes

comments

37f269e

ericl added 2 commits April 7, 2016 17:40

better document setlocalprop

3bf34b7

Thu Apr 7 17:41:39 PDT 2016

82646a9

ericl added 2 commits April 7, 2016 17:56

update mima

2542d01

Merge branch 'master' into sc-2813

964ee4b

Thu Apr 7 22:33:50 PDT 2016

0e46c58

srowen reviewed Apr 8, 2016
View reviewed changes

asfgit closed this in 6f27027 Apr 12, 2016

advancedxy mentioned this pull request Dec 26, 2017

[SPARK-22897][CORE]: Expose stageAttemptId in TaskContext #20082

Closed

[SPARK-14475] Propagate user-defined context from driver to executors #12248

[SPARK-14475] Propagate user-defined context from driver to executors #12248

Uh oh!

Conversation

ericl commented Apr 7, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 8, 2016

Uh oh!

JoshRosen commented Apr 8, 2016

Uh oh!

JoshRosen Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Apr 8, 2016

Uh oh!

SparkQA commented Apr 8, 2016

Uh oh!

SparkQA commented Apr 8, 2016

Uh oh!

SparkQA commented Apr 8, 2016

Uh oh!

SparkQA commented Apr 8, 2016

Uh oh!

SparkQA commented Apr 8, 2016

Uh oh!

srowen commented Apr 8, 2016

Uh oh!

ericl commented Apr 8, 2016

Uh oh!

srowen commented Apr 8, 2016

Uh oh!

ericl commented Apr 8, 2016

Uh oh!

srowen commented Apr 8, 2016

Uh oh!

ericl commented Apr 8, 2016

Uh oh!

JoshRosen commented Apr 8, 2016

Uh oh!

srowen commented Apr 8, 2016

Uh oh!

srowen Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

JoshRosen Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Apr 8, 2016

Choose a reason for hiding this comment

Uh oh!

ericl commented Apr 8, 2016

Uh oh!

srowen commented Apr 8, 2016

Uh oh!

ericl commented Apr 10, 2016

Uh oh!

JoshRosen commented Apr 11, 2016

Uh oh!

JoshRosen commented Apr 11, 2016

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

rxin commented Apr 12, 2016

Uh oh!

bsikander commented Mar 21, 2019

Uh oh!

Uh oh!