[SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI #15971

markgrover · 2016-11-22T00:28:11Z

What changes were proposed in this pull request?

This patch adds a new property called spark.secret.redactionPattern that
allows users to specify a scala regex to decide which Spark configuration
properties and environment variables in driver and executor environments
contain sensitive information. When this regex matches the property or
environment variable name, its value is redacted from the environment UI and
various logs like YARN and event logs.

This change uses this property to redact information from event logs and YARN
logs. It also, updates the UI code to adhere to this property instead of
hardcoding the logic to decipher which properties are sensitive.

Here's an image of the UI post-redaction:

Here's the text in the YARN logs, post-redaction:
HADOOP_CREDSTORE_PASSWORD -> *********(redacted)

Here's the text in the event logs, post-redaction:
...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...

How was this patch tested?

Unit tests are added to ensure that redaction works.
A YARN job reading data off of S3 with confidential information
(hadoop credential provider password) being provided in the environment
variables of driver and executor. And, afterwards, logs were grepped to make
sure that no mention of secret password was present. It was also ensure that
the job was able to read the data off of S3 correctly, thereby ensuring that
the sensitive information was being trickled down to the right places to read
the data.
The event logs were checked to make sure no mention of secret password was
present.
UI environment tab was checked to make sure there was no secret information
being displayed.

…and UI This commit adds a new property called `spark.secret.redactionPattern` that allows users to specify a scala regex to decide which Spark configuration properties and environment variables in driver and executor environments contain sensitive information. When this regex matches the property or environment variable name, its value is redacted from the environment UI and various logs like YARN and event logs. This change uses this property to redact information from event logs and YARN logs. It also, updates the UI code to adhere to this property instead of hardcoding the logic to decipher which properties are sensitive. For testing: 1. Unit tests are added to ensure that redaction works. 2. A YARN job reading data off of S3 with confidential information (hadoop credential provider password) being provided in the environment variables of driver and executor. And, afterwards, logs were grepped to make sure that no mention of secret password was present. It was also ensure that the job was able to read the data off of S3 correctly, thereby ensuring that the sensitive information was being trickled down to the right places to read the data. 3. The event logs were checked to make sure no mention of secret password was present. 4. UI environment tab was checked to make sure there was no secret information being displayed.

tejasapatil · 2016-11-22T00:44:35Z

Does this protect against doing sc.getConf.getAll.foreach(println) in spark shell ?

markgrover · 2016-11-22T01:27:57Z

Does this protect against doing sc.getConf.getAll.foreach(println) in spark shell ?
No, it doesn't. The goal of the JIRA is the first step - to stop printing sensitive information all over the place.

If a user has access to spark-shell (say, in a cluster secured through kerberos), it may be reasonable for them to see the sensitive information.

This, however, prevents anyone from going to the unauthenticated Spark UI, and having all the creds to S3.

SparkQA · 2016-11-22T03:28:35Z

Test build #68970 has finished for PR 15971 at commit 5dd3630.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ajbozarth

This is a good upgrade over the last fix. I made a couple notes but overall I think this looks pretty good.

ajbozarth · 2016-11-22T20:22:55Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

@@ -223,4 +223,13 @@ package object config {
      " bigger files.")
    .longConf
    .createWithDefault(4 * 1024 * 1024)
+
+  private[spark] val SECRET_REDACTION_PATTERN =
+    ConfigBuilder("spark.secret.redactionPattern")


wouldn't this mean this pattern would get redacted since it contains secret?

Good point! I think it's good for the actual regex to not be redacted. So, I will rename the property to be clearer anyways (and not have secret in the name) to spark.redaction.regex.

ajbozarth · 2016-11-22T20:23:48Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        " When this regex matches the property or environment variable name, its value is " +
+        "redacted from the environment UI and various logs like YARN and event logs")
+      .stringConf
+      .createWithDefault("secret|password|SECRET|PASSWORD")


would a case-insensitive version be better?

vanzin · 2016-11-22T20:51:02Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+
+  private[spark] val SECRET_REDACTION_PATTERN =
+    ConfigBuilder("spark.secret.redactionPattern")
+      .doc("Scala regex(case-sensitive) to decide which Spark configuration properties and " +


nit: space after "regex"

Just call it a "regex", since the regex syntax is actually defined by the JDK libraries and not by Scala.

vanzin · 2016-11-22T20:51:36Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+    ConfigBuilder("spark.secret.redactionPattern")
+      .doc("Scala regex(case-sensitive) to decide which Spark configuration properties and " +
+        "environment variables in driver and executor environments contain sensitive information." +
+        " When this regex matches the property or environment variable name, its value is " +


"... matches a property ..."

vanzin · 2016-11-22T20:54:02Z

core/src/main/scala/org/apache/spark/util/Utils.scala

@@ -2555,6 +2555,15 @@ private[spark] object Utils extends Logging {
      sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
    }
  }
+
+  private[util] val REDACTION_REPLACEMENT_TEXT = "*********(redacted)"
+  def redact(conf: SparkConf)(kv: (String, String)): (String, String) = {


nit: add empty line

Also, not sure currying is buying you anything. In fact the caller syntax becomes clearer if you don't use it.

Fixed the empty line. So, have the redact method take in a conf as a parameter and return a method?
I see them as equivalent, but don't feel strongly about it, so I will change it.

I mean the redact method should take 2 parameters, the config and the list of things to be redacted, instead of using currying for the second parameter.

vanzin · 2016-11-22T20:54:46Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+
+  private[util] val REDACTION_REPLACEMENT_TEXT = "*********(redacted)"
+  def redact(conf: SparkConf)(kv: (String, String)): (String, String) = {
+    val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r


This is very expensive. How about a version that takes a list of tuples and redacts them?

What part do you think is expensive? Going through all the configuration properties and matching them with the regex?
If so, I agree. However, that has to be done somewhere. All the callers of this function have a SparkConf that they want stuff redacted from. So, if this function accepts a list of tuples, they have to run the regex check to find that list first before sending it to redact(). So, overall, unless I am missing something, I don't think we can avoid the expense.

Compiling the regex once for every item in the list being redacted, instead of doing it once for the whole list.

Ah, good point. Let me fix this.

vanzin · 2016-11-22T20:57:42Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

@@ -231,6 +233,19 @@ private[spark] class EventLoggingListener(
    }
  }

+
+  private def redactEvent(event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = {


Any reason why you chose this instead of redacting at the source of SparkListenerEnvironmentUpdate?

Good question. I thought about this quite a bit when making this change. I should have posted that decision in the PR description, apologies for not providing that context. The way I see it - there are three places where redaction can be done:

Right at the source of SparkListenerEnvironmentUpdate (here, in SparkContext.scala).

In JsonProtocol.scala when converting the event to JSON.

In EventLogginListener.scala (where it is right now), when being persisted to disk.

A user could write a custom listener that listened to the environment updates and did something useful with them. And, I didn't want to impose redaction on them. They could be using it to create a clone of their environment, for example and may need to the same sensitive properties. So, I ruled out 1.

And, JsonProtocol seemed like a generic utility to convert events to JSON. While I could be selective about only redacting events of SparkListenerEnvironmentUpdate type, I didn't want to assume that everyone translating the environment update to JSON should only be seeing redacted configuration. So, that made me rule out 2.

I decided that it was best redact "closest to disk", which made me put the redaction code where I did - in EventLoggingListener. Hope that makes sense, happy to hear your thoughts if you think otherwise.

The downside is that with this choice you have more complicated tests, and you have to do redaction at every place where this information might be written (which at the moment is just two - the event logger and the UI).

JsonProtocol is not really a choice because it doesn't cover the UI.

A user could write a custom listener that listened to the environment updates and did something useful with them. And, I didn't want to impose redaction on them.

That's the only argument I can buy, and said user has different ways to get that information.

That's the only argument I can buy, and said user has different ways to get that information.

True, sparkConf.getAll is always available.

I still think it's better to put the redaction closer to the "sinks" for now. The good thing though is that if we see the number of "sinks" increasing, and everyone wanting redaction, we can take redaction more upstream. For now, 2 sinks seem manageable and it's hard to guess if future sinks are going to want redaction or not. So, unless you strongly object, I'd like to keep the redaction closer to the "sinks" now.

vanzin · 2016-11-22T21:02:26Z

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

+
+  test("redact sensitive information") {
+    val sparkConf = new SparkConf
+    sparkConf.set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", "secret_password")


Much cleaner if you do something like:

val keys = Seq("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", "spark.my.password", ...) keys.foreach { key => // test redaction for that key }

vanzin · 2016-11-22T21:04:23Z

core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala

+    // Make sure nothing secret shows up anywhere
+    assert(!eventLog.contains(secretPassword), s"Secret password ($secretPassword) not redacted " +
+      s"from event logs:\n $eventLog")
+    val expected = """"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)""""


This is pretty hacky. It makes assumptions about how things are formatted in the event log.

The previous assert should be enough (ignoring my previous comment about changing this test).

vanzin · 2016-11-22T21:04:41Z

core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala

+    val regex = """"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"([^"]*)"""".r
+    val matches = regex.findAllIn(eventLog)
+    assert(matches.nonEmpty)
+    matches.foreach(matched => assert(matched.equals(expected)))


Ignoring my previous comments, this should be .foreach { matched => ... }

vanzin · 2016-11-22T21:05:21Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+        " When this regex matches the property or environment variable name, its value is " +
+        "redacted from the environment UI and various logs like YARN and event logs")
+      .stringConf
+      .createWithDefault("secret|password|SECRET|PASSWORD")


vanzin · 2016-11-22T21:06:15Z

core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala

+    val secretPassword = "secret_password"
+    val conf = getLoggingConf(testDirPath, None).set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD",
+      secretPassword)
+    sc = new SparkContext("local-cluster[2,2,1024]", "test", conf)


This is a pretty expensive way of testing this. Why not just call the redaction method and make sure it's doing the right thing?

Yeah, I wanted a little more than just a "unit" test. This was more broader and checked for actual redaction taking place in event logs, so I have it here.

I think you have a valid point though, if you think this is too expensive, I think the method in UtilsSuite.scala does a pretty good job at "unit testing" redact(), so I'd rather take this out completely. Thoughts?

Both are not doing the exact same thing. There's logic in the EventLoggingListener code that also needs to be tested. But you can have a more targeted unit test instead of running a distributed Spark application.

Ok, I have updated this test to be less bloated. Thanks!

1. Renaming the property to not have word secret in it. 2. Making the regex case insensitive. 3. Other minor changes

…s to update. The only pending comment left is the making the test in EventLoggingListenerSuite better

markgrover · 2016-11-23T01:45:36Z

Thanks for your feedback @vanzin and @ajbozarth.
I have updated the PR with your suggestions. The only one that's pending is @vanzin's suggestion of improving the EventLoggingSuiteTest. Let me think about that one for a bit before I update it.

SparkQA · 2016-11-23T03:56:26Z

Test build #69042 has finished for PR 15971 at commit eed33db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-11-23T17:37:37Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+      if (redactionPattern.findFirstIn(kv._1).isDefined) {
+        (kv._1, REDACTION_REPLACEMENT_TEXT)
+      }
+      else kv


Style here is wrong. Better to use the Option api:

regex.findFirstIn(...).map(...).getOrElse(...)

That's much better, thanks.

vanzin · 2016-11-23T17:38:56Z

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

+    val sparkConf = new SparkConf
+
+    // Set some secret keys
+    val secretKeys = Seq("" +


What is "" + accomplishing here?

Me trying to format things in my editor led to this, apologies. Fixed it.

vanzin · 2016-11-23T17:39:38Z

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

+
+    // Assert that secret information got redacted while the regular property remained the same
+    secretKeys.foreach { key =>
+      assert(redactedConf.get(key).get == Utils.REDACTION_REPLACEMENT_TEXT)


assert(conf(key) === ...)

vanzin · 2016-11-23T17:39:52Z

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

+    secretKeys.foreach { key =>
+      assert(redactedConf.get(key).get == Utils.REDACTION_REPLACEMENT_TEXT)
+    }
+    assert(redactedConf.get("spark.regular.property").get == "not_a_secret")


Same as above.

vanzin · 2016-11-23T17:41:23Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

+    val props = event
+      .environmentDetails
+      .get("Spark Properties")
+      .get


Instead of using map.get(key).get, just use map(key).

vanzin · 2016-11-23T17:43:38Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+    ConfigBuilder("spark.redaction.regex")
+      .doc("Regex to decide which Spark configuration properties and environment variables in " +
+        "driver and executor environments contain sensitive information. When this regex matches " +
+        "a property , its value is redacted from the environment UI and various logs like YARN " +


nit: no space before comma.

vanzin · 2016-11-23T17:43:51Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+      .doc("Regex to decide which Spark configuration properties and environment variables in " +
+        "driver and executor environments contain sensitive information. When this regex matches " +
+        "a property , its value is redacted from the environment UI and various logs like YARN " +
+        "and event logs")


nit: missing period.

vanzin · 2016-11-23T17:44:05Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

@@ -231,6 +233,19 @@ private[spark] class EventLoggingListener(
    }
  }

+


nit: don't add this blank line.

SparkQA · 2016-11-23T21:49:04Z

Test build #69085 has finished for PR 15971 at commit 84a7ef3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-23T21:55:58Z

Test build #69086 has finished for PR 15971 at commit 549881b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

markgrover · 2016-11-23T22:00:49Z

Thanks for your review, @vanzin I have incorporated all suggestions, I'd appreciate if you could take another look.

And, the test failures in the last run seem unrelated.

vanzin

A few minor things to take care of.

vanzin · 2016-11-23T22:20:07Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

@@ -231,6 +233,17 @@ private[spark] class EventLoggingListener(
    }
  }

+  private[spark] def redactEvent(event: SparkListenerEnvironmentUpdate):
+  SparkListenerEnvironmentUpdate = {


This is kind of an edge case, but this line needs indentation. I recommend:

private[spark] def redactEvent( event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = {

vanzin · 2016-11-23T22:20:27Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

+  SparkListenerEnvironmentUpdate = {
+    // "Spark Properties" entry will always exist because the map is always populated with it.
+    val props = event
+      .environmentDetails("Spark Properties")


nit: fits in previous line. You can even merge it with the next statement.

vanzin · 2016-11-23T22:21:54Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+    val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r
+    kvs.map { kv =>
+      redactionPattern.findFirstIn(kv._1)
+        .map{ ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) }


nit: space after map

vanzin · 2016-11-23T22:22:04Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+      redactionPattern.findFirstIn(kv._1)
+        .map{ ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) }
+        .getOrElse(kv)
+      }


nit: indented too far

vanzin · 2016-11-23T22:24:00Z

core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala

+    val eventLogger = new EventLoggingListener("test", None, testDirPath.toUri(), conf)
+    val envDetails = SparkEnv.environmentDetails(conf, "FIFO", Seq.empty, Seq.empty)
+    val event = SparkListenerEnvironmentUpdate(envDetails)
+    val redactedProps = eventLogger.redactEvent(event).environmentDetails("Spark Properties").toMap


"Spark Properties" is begging to be turned into a constant somewhere...

vanzin · 2016-11-23T22:25:05Z

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

+    val redactedConf = Utils.redact(sparkConf, sparkConf.getAll).toMap
+
+    // Assert that secret information got redacted while the regular property remained the same
+    secretKeys.foreach { key => assert(redactedConf(key) == Utils.REDACTION_REPLACEMENT_TEXT) }


vanzin · 2016-11-23T22:25:13Z

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

+
+    // Assert that secret information got redacted while the regular property remained the same
+    secretKeys.foreach { key => assert(redactedConf(key) == Utils.REDACTION_REPLACEMENT_TEXT) }
+    assert(redactedConf("spark.regular.property") == "not_a_secret")


vanzin · 2016-11-23T23:33:13Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

@@ -231,6 +233,15 @@ private[spark] class EventLoggingListener(
    }
  }

+  private[spark] def redactEvent(
+    event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = {


nit: this needs to be indented one more level...

Thanks, I'll go through Spark style guide so I don't cause as much trouble next time.

Thanks for reviewing.

vanzin · 2016-11-23T23:55:58Z

LGTM pending tests.

SparkQA · 2016-11-24T01:48:15Z

Test build #69093 has finished for PR 15971 at commit 49015ac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

markgrover · 2016-11-24T01:51:35Z

Thanks for reviewing, Marcelo.

Hmm, the failures are still unrelated:-(

[info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.jdbc.JDBCWriteSuite *** ABORTED *** (1 second, 791 milliseconds)
[info]   org.h2.jdbc.JdbcSQLException: Schema "TEST" already exists; SQL statement:
[info] create schema test [90078-183]
[info]   at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
[info]   at org.h2.message.DbException.get(DbException.java:179)
[info]   at org.h2.message.DbException.get(DbException.java:155)
[info]   at org.h2.command.ddl.CreateSchema.update(CreateSchema.java:48)
[info]   at org.h2.command.CommandContainer.update(CommandContainer.java:78)
[info]   at org.h2.command.Command.executeUpdate(Command.java:254)
[info]   at org.h2.jdbc.JdbcPreparedStatement.executeUpdateInternal(JdbcPreparedStatement.java:157)
[info]   at org.h2.jdbc.JdbcPreparedStatement.executeUpdate(JdbcPreparedStatement.java:143)
[info]   at org.apache.spark.sql.jdbc.JDBCWriteSuite$$anonfun$29.apply(JDBCWriteSuite.scala:56)
[info]   at org.apache.spark.sql.jdbc.JDBCWriteSuite$$anonfun$29.apply(JDBCWriteSuite.scala:53)
[info]   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:195)
[info]   at org.apache.spark.sql.jdbc.JDBCWriteSuite.runTest(JDBCWriteSuite.scala:34)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
[info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
[info]   at scala.collection.immutable.List.foreach(List.scala:381)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
[info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
[info]   at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
[info]   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
[info]   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
[info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
[info]   at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
[info]   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
[info]   at org.apache.spark.sql.jdbc.JDBCWriteSuite.org$scalatest$BeforeAndAfter$$super$run(JDBCWriteSuite.scala:34)
[info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
[info]   at org.apache.spark.sql.jdbc.JDBCWriteSuite.run(JDBCWriteSuite.scala:34)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info]   at java.lang.Thread.run(Thread.java:745)

I won't be available for the next few days. If you deem fit to you, I'd appreciate if you could commit this after the test runs stabilize. Thanks!

markgrover · 2016-11-24T01:51:46Z

Jenkins, retest this please.

SparkQA · 2016-11-24T01:52:17Z

Test build #69091 has finished for PR 15971 at commit 61a961c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

markgrover · 2016-11-24T01:54:43Z

Ok, looks like all is good now!

SparkQA · 2016-11-24T04:28:11Z

Test build #69100 has finished for PR 15971 at commit 49015ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-11-28T16:59:11Z

Merging to master.

…and UI ## What changes were proposed in this pull request? This patch adds a new property called `spark.secret.redactionPattern` that allows users to specify a scala regex to decide which Spark configuration properties and environment variables in driver and executor environments contain sensitive information. When this regex matches the property or environment variable name, its value is redacted from the environment UI and various logs like YARN and event logs. This change uses this property to redact information from event logs and YARN logs. It also, updates the UI code to adhere to this property instead of hardcoding the logic to decipher which properties are sensitive. Here's an image of the UI post-redaction: ![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png) Here's the text in the YARN logs, post-redaction: ``HADOOP_CREDSTORE_PASSWORD -> *********(redacted)`` Here's the text in the event logs, post-redaction: ``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...`` ## How was this patch tested? 1. Unit tests are added to ensure that redaction works. 2. A YARN job reading data off of S3 with confidential information (hadoop credential provider password) being provided in the environment variables of driver and executor. And, afterwards, logs were grepped to make sure that no mention of secret password was present. It was also ensure that the job was able to read the data off of S3 correctly, thereby ensuring that the sensitive information was being trickled down to the right places to read the data. 3. The event logs were checked to make sure no mention of secret password was present. 4. UI environment tab was checked to make sure there was no secret information being displayed. Author: Mark Grover <mark@apache.org> Closes apache#15971 from markgrover/master_redaction.

ajbozarth reviewed Nov 22, 2016

View reviewed changes

vanzin reviewed Nov 22, 2016

View reviewed changes

markgrover added 3 commits November 22, 2016 16:04

Review feedback:

b0ad319

1. Renaming the property to not have word secret in it. 2. Making the regex case insensitive. 3. Other minor changes

Fixing a typo

78e4398

More review feedback. Having Utils.redact() take in the list of tuple…

eed33db

…s to update. The only pending comment left is the making the test in EventLoggingListenerSuite better

vanzin reviewed Nov 23, 2016

View reviewed changes

markgrover added 2 commits November 23, 2016 11:07

More review feedback

84a7ef3

Making the test in EventLoggingListenerSuite less bloated

549881b

vanzin reviewed Nov 23, 2016

View reviewed changes

More feedback

61a961c

vanzin reviewed Nov 23, 2016

View reviewed changes

Function indentation at 4

49015ac

asfgit closed this in 237c3b9 Nov 28, 2016

markgrover deleted the master_redaction branch February 21, 2017 23:42

mridulm mentioned this pull request Nov 20, 2020

[SPARK-33504][CORE] The application log in the Spark history server contains sensitive attributes should be redacted #30446

Closed

		@@ -231,6 +233,19 @@ private[spark] class EventLoggingListener(
		}
		}

[SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI #15971

[SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI #15971

Conversation

markgrover commented Nov 22, 2016

What changes were proposed in this pull request?

How was this patch tested?

tejasapatil commented Nov 22, 2016

markgrover commented Nov 22, 2016

SparkQA commented Nov 22, 2016

ajbozarth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin Nov 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markgrover commented Nov 23, 2016

SparkQA commented Nov 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 23, 2016

SparkQA commented Nov 23, 2016

markgrover commented Nov 23, 2016

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin commented Nov 23, 2016

SparkQA commented Nov 24, 2016

markgrover commented Nov 24, 2016

markgrover commented Nov 24, 2016

SparkQA commented Nov 24, 2016

markgrover commented Nov 24, 2016

vanzin Nov 22, 2016 •

edited

Loading