Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI #15971

Closed
wants to merge 8 commits into from

Conversation

markgrover
Copy link
Member

What changes were proposed in this pull request?

This patch adds a new property called spark.secret.redactionPattern that
allows users to specify a scala regex to decide which Spark configuration
properties and environment variables in driver and executor environments
contain sensitive information. When this regex matches the property or
environment variable name, its value is redacted from the environment UI and
various logs like YARN and event logs.

This change uses this property to redact information from event logs and YARN
logs. It also, updates the UI code to adhere to this property instead of
hardcoding the logic to decipher which properties are sensitive.

Here's an image of the UI post-redaction:
image

Here's the text in the YARN logs, post-redaction:
HADOOP_CREDSTORE_PASSWORD -> *********(redacted)

Here's the text in the event logs, post-redaction:
...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...

How was this patch tested?

  1. Unit tests are added to ensure that redaction works.
  2. A YARN job reading data off of S3 with confidential information
    (hadoop credential provider password) being provided in the environment
    variables of driver and executor. And, afterwards, logs were grepped to make
    sure that no mention of secret password was present. It was also ensure that
    the job was able to read the data off of S3 correctly, thereby ensuring that
    the sensitive information was being trickled down to the right places to read
    the data.
  3. The event logs were checked to make sure no mention of secret password was
    present.
  4. UI environment tab was checked to make sure there was no secret information
    being displayed.

…and UI

This commit adds a new property called `spark.secret.redactionPattern` that
allows users to specify a scala regex to decide which Spark configuration
properties and environment variables in driver and executor environments
contain sensitive information. When this regex matches the property or
environment variable name, its value is redacted from the environment UI and
various logs like YARN and event logs.

This change uses this property to redact information from event logs and YARN
logs. It also, updates the UI code to adhere to this property instead of
hardcoding the logic to decipher which properties are sensitive.

For testing:
1. Unit tests are added to ensure that redaction works.
2. A YARN job reading data off of S3 with confidential information
(hadoop credential provider password) being provided in the environment
variables of driver and executor. And, afterwards, logs were grepped to make
sure that no mention of secret password was present. It was also ensure that
the job was able to read the data off of S3 correctly, thereby ensuring that
the sensitive information was being trickled down to the right places to read
the data.
3. The event logs were checked to make sure no mention of secret password was
present.
4. UI environment tab was checked to make sure there was no secret information
being displayed.
@tejasapatil
Copy link
Contributor

Does this protect against doing sc.getConf.getAll.foreach(println) in spark shell ?

@markgrover
Copy link
Member Author

Does this protect against doing sc.getConf.getAll.foreach(println) in spark shell ?
No, it doesn't. The goal of the JIRA is the first step - to stop printing sensitive information all over the place.

If a user has access to spark-shell (say, in a cluster secured through kerberos), it may be reasonable for them to see the sensitive information.

This, however, prevents anyone from going to the unauthenticated Spark UI, and having all the creds to S3.

@SparkQA
Copy link

SparkQA commented Nov 22, 2016

Test build #68970 has finished for PR 15971 at commit 5dd3630.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good upgrade over the last fix. I made a couple notes but overall I think this looks pretty good.

@@ -223,4 +223,13 @@ package object config {
" bigger files.")
.longConf
.createWithDefault(4 * 1024 * 1024)

private[spark] val SECRET_REDACTION_PATTERN =
ConfigBuilder("spark.secret.redactionPattern")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this mean this pattern would get redacted since it contains secret?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I think it's good for the actual regex to not be redacted. So, I will rename the property to be clearer anyways (and not have secret in the name) to spark.redaction.regex.

" When this regex matches the property or environment variable name, its value is " +
"redacted from the environment UI and various logs like YARN and event logs")
.stringConf
.createWithDefault("secret|password|SECRET|PASSWORD")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a case-insensitive version be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


private[spark] val SECRET_REDACTION_PATTERN =
ConfigBuilder("spark.secret.redactionPattern")
.doc("Scala regex(case-sensitive) to decide which Spark configuration properties and " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after "regex"

Just call it a "regex", since the regex syntax is actually defined by the JDK libraries and not by Scala.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

ConfigBuilder("spark.secret.redactionPattern")
.doc("Scala regex(case-sensitive) to decide which Spark configuration properties and " +
"environment variables in driver and executor environments contain sensitive information." +
" When this regex matches the property or environment variable name, its value is " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... matches a property ..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -2555,6 +2555,15 @@ private[spark] object Utils extends Logging {
sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
}
}

private[util] val REDACTION_REPLACEMENT_TEXT = "*********(redacted)"
def redact(conf: SparkConf)(kv: (String, String)): (String, String) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add empty line

Also, not sure currying is buying you anything. In fact the caller syntax becomes clearer if you don't use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the empty line. So, have the redact method take in a conf as a parameter and return a method?
I see them as equivalent, but don't feel strongly about it, so I will change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the redact method should take 2 parameters, the config and the list of things to be redacted, instead of using currying for the second parameter.


private[util] val REDACTION_REPLACEMENT_TEXT = "*********(redacted)"
def redact(conf: SparkConf)(kv: (String, String)): (String, String) = {
val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very expensive. How about a version that takes a list of tuples and redacts them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What part do you think is expensive? Going through all the configuration properties and matching them with the regex?
If so, I agree. However, that has to be done somewhere. All the callers of this function have a SparkConf that they want stuff redacted from. So, if this function accepts a list of tuples, they have to run the regex check to find that list first before sending it to redact(). So, overall, unless I am missing something, I don't think we can avoid the expense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiling the regex once for every item in the list being redacted, instead of doing it once for the whole list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. Let me fix this.

@@ -231,6 +233,19 @@ private[spark] class EventLoggingListener(
}
}


private def redactEvent(event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why you chose this instead of redacting at the source of SparkListenerEnvironmentUpdate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I thought about this quite a bit when making this change. I should have posted that decision in the PR description, apologies for not providing that context. The way I see it - there are three places where redaction can be done:

  1. Right at the source of SparkListenerEnvironmentUpdate (here, in SparkContext.scala).
  2. In JsonProtocol.scala when converting the event to JSON.
  3. In EventLogginListener.scala (where it is right now), when being persisted to disk.

A user could write a custom listener that listened to the environment updates and did something useful with them. And, I didn't want to impose redaction on them. They could be using it to create a clone of their environment, for example and may need to the same sensitive properties. So, I ruled out 1.

And, JsonProtocol seemed like a generic utility to convert events to JSON. While I could be selective about only redacting events of SparkListenerEnvironmentUpdate type, I didn't want to assume that everyone translating the environment update to JSON should only be seeing redacted configuration. So, that made me rule out 2.

I decided that it was best redact "closest to disk", which made me put the redaction code where I did - in EventLoggingListener. Hope that makes sense, happy to hear your thoughts if you think otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is that with this choice you have more complicated tests, and you have to do redaction at every place where this information might be written (which at the moment is just two - the event logger and the UI).

JsonProtocol is not really a choice because it doesn't cover the UI.

A user could write a custom listener that listened to the environment updates and did something useful with them. And, I didn't want to impose redaction on them.

That's the only argument I can buy, and said user has different ways to get that information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the only argument I can buy, and said user has different ways to get that information.

True, sparkConf.getAll is always available.

I still think it's better to put the redaction closer to the "sinks" for now. The good thing though is that if we see the number of "sinks" increasing, and everyone wanting redaction, we can take redaction more upstream. For now, 2 sinks seem manageable and it's hard to guess if future sinks are going to want redaction or not. So, unless you strongly object, I'd like to keep the redaction closer to the "sinks" now.


test("redact sensitive information") {
val sparkConf = new SparkConf
sparkConf.set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", "secret_password")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner if you do something like:

val keys = Seq("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", "spark.my.password", ...)
keys.foreach { key =>
  // test redaction for that key
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// Make sure nothing secret shows up anywhere
assert(!eventLog.contains(secretPassword), s"Secret password ($secretPassword) not redacted " +
s"from event logs:\n $eventLog")
val expected = """"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)""""
Copy link
Contributor

@vanzin vanzin Nov 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty hacky. It makes assumptions about how things are formatted in the event log.

The previous assert should be enough (ignoring my previous comment about changing this test).

val regex = """"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"([^"]*)"""".r
val matches = regex.findAllIn(eventLog)
assert(matches.nonEmpty)
matches.foreach(matched => assert(matched.equals(expected)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring my previous comments, this should be .foreach { matched => ... }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

" When this regex matches the property or environment variable name, its value is " +
"redacted from the environment UI and various logs like YARN and event logs")
.stringConf
.createWithDefault("secret|password|SECRET|PASSWORD")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

val secretPassword = "secret_password"
val conf = getLoggingConf(testDirPath, None).set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD",
secretPassword)
sc = new SparkContext("local-cluster[2,2,1024]", "test", conf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty expensive way of testing this. Why not just call the redaction method and make sure it's doing the right thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wanted a little more than just a "unit" test. This was more broader and checked for actual redaction taking place in event logs, so I have it here.

I think you have a valid point though, if you think this is too expensive, I think the method in UtilsSuite.scala does a pretty good job at "unit testing" redact(), so I'd rather take this out completely. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are not doing the exact same thing. There's logic in the EventLoggingListener code that also needs to be tested. But you can have a more targeted unit test instead of running a distributed Spark application.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have updated this test to be less bloated. Thanks!

1. Renaming the property to not have word secret in it.
2. Making the regex case insensitive.
3. Other minor changes
…s to update.

The only pending comment left is the making the test in EventLoggingListenerSuite better
@markgrover
Copy link
Member Author

Thanks for your feedback @vanzin and @ajbozarth.
I have updated the PR with your suggestions. The only one that's pending is @vanzin's suggestion of improving the EventLoggingSuiteTest. Let me think about that one for a bit before I update it.

@SparkQA
Copy link

SparkQA commented Nov 23, 2016

Test build #69042 has finished for PR 15971 at commit eed33db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (redactionPattern.findFirstIn(kv._1).isDefined) {
(kv._1, REDACTION_REPLACEMENT_TEXT)
}
else kv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style here is wrong. Better to use the Option api:

regex.findFirstIn(...).map(...).getOrElse(...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much better, thanks.

val sparkConf = new SparkConf

// Set some secret keys
val secretKeys = Seq("" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "" + accomplishing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me trying to format things in my editor led to this, apologies. Fixed it.


// Assert that secret information got redacted while the regular property remained the same
secretKeys.foreach { key =>
assert(redactedConf.get(key).get == Utils.REDACTION_REPLACEMENT_TEXT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(conf(key) === ...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

secretKeys.foreach { key =>
assert(redactedConf.get(key).get == Utils.REDACTION_REPLACEMENT_TEXT)
}
assert(redactedConf.get("spark.regular.property").get == "not_a_secret")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

val props = event
.environmentDetails
.get("Spark Properties")
.get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using map.get(key).get, just use map(key).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

ConfigBuilder("spark.redaction.regex")
.doc("Regex to decide which Spark configuration properties and environment variables in " +
"driver and executor environments contain sensitive information. When this regex matches " +
"a property , its value is redacted from the environment UI and various logs like YARN " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no space before comma.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

.doc("Regex to decide which Spark configuration properties and environment variables in " +
"driver and executor environments contain sensitive information. When this regex matches " +
"a property , its value is redacted from the environment UI and various logs like YARN " +
"and event logs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -231,6 +233,19 @@ private[spark] class EventLoggingListener(
}
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't add this blank line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@SparkQA
Copy link

SparkQA commented Nov 23, 2016

Test build #69085 has finished for PR 15971 at commit 84a7ef3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 23, 2016

Test build #69086 has finished for PR 15971 at commit 549881b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markgrover
Copy link
Member Author

Thanks for your review, @vanzin I have incorporated all suggestions, I'd appreciate if you could take another look.

And, the test failures in the last run seem unrelated.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things to take care of.

@@ -231,6 +233,17 @@ private[spark] class EventLoggingListener(
}
}

private[spark] def redactEvent(event: SparkListenerEnvironmentUpdate):
SparkListenerEnvironmentUpdate = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of an edge case, but this line needs indentation. I recommend:

private[spark] def redactEvent(
    event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = {

SparkListenerEnvironmentUpdate = {
// "Spark Properties" entry will always exist because the map is always populated with it.
val props = event
.environmentDetails("Spark Properties")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fits in previous line. You can even merge it with the next statement.

val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r
kvs.map { kv =>
redactionPattern.findFirstIn(kv._1)
.map{ ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after map

redactionPattern.findFirstIn(kv._1)
.map{ ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) }
.getOrElse(kv)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indented too far

val eventLogger = new EventLoggingListener("test", None, testDirPath.toUri(), conf)
val envDetails = SparkEnv.environmentDetails(conf, "FIFO", Seq.empty, Seq.empty)
val event = SparkListenerEnvironmentUpdate(envDetails)
val redactedProps = eventLogger.redactEvent(event).environmentDetails("Spark Properties").toMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Spark Properties" is begging to be turned into a constant somewhere...

val redactedConf = Utils.redact(sparkConf, sparkConf.getAll).toMap

// Assert that secret information got redacted while the regular property remained the same
secretKeys.foreach { key => assert(redactedConf(key) == Utils.REDACTION_REPLACEMENT_TEXT) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

===


// Assert that secret information got redacted while the regular property remained the same
secretKeys.foreach { key => assert(redactedConf(key) == Utils.REDACTION_REPLACEMENT_TEXT) }
assert(redactedConf("spark.regular.property") == "not_a_secret")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

===

@@ -231,6 +233,15 @@ private[spark] class EventLoggingListener(
}
}

private[spark] def redactEvent(
event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this needs to be indented one more level...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll go through Spark style guide so I don't cause as much trouble next time.

Thanks for reviewing.

@vanzin
Copy link
Contributor

vanzin commented Nov 23, 2016

LGTM pending tests.

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69093 has finished for PR 15971 at commit 49015ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markgrover
Copy link
Member Author

Thanks for reviewing, Marcelo.

Hmm, the failures are still unrelated:-(

[info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.jdbc.JDBCWriteSuite *** ABORTED *** (1 second, 791 milliseconds)
[info]   org.h2.jdbc.JdbcSQLException: Schema "TEST" already exists; SQL statement:
[info] create schema test [90078-183]
[info]   at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
[info]   at org.h2.message.DbException.get(DbException.java:179)
[info]   at org.h2.message.DbException.get(DbException.java:155)
[info]   at org.h2.command.ddl.CreateSchema.update(CreateSchema.java:48)
[info]   at org.h2.command.CommandContainer.update(CommandContainer.java:78)
[info]   at org.h2.command.Command.executeUpdate(Command.java:254)
[info]   at org.h2.jdbc.JdbcPreparedStatement.executeUpdateInternal(JdbcPreparedStatement.java:157)
[info]   at org.h2.jdbc.JdbcPreparedStatement.executeUpdate(JdbcPreparedStatement.java:143)
[info]   at org.apache.spark.sql.jdbc.JDBCWriteSuite$$anonfun$29.apply(JDBCWriteSuite.scala:56)
[info]   at org.apache.spark.sql.jdbc.JDBCWriteSuite$$anonfun$29.apply(JDBCWriteSuite.scala:53)
[info]   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:195)
[info]   at org.apache.spark.sql.jdbc.JDBCWriteSuite.runTest(JDBCWriteSuite.scala:34)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
[info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
[info]   at scala.collection.immutable.List.foreach(List.scala:381)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
[info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
[info]   at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
[info]   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
[info]   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
[info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
[info]   at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
[info]   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
[info]   at org.apache.spark.sql.jdbc.JDBCWriteSuite.org$scalatest$BeforeAndAfter$$super$run(JDBCWriteSuite.scala:34)
[info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
[info]   at org.apache.spark.sql.jdbc.JDBCWriteSuite.run(JDBCWriteSuite.scala:34)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info]   at java.lang.Thread.run(Thread.java:745)

I won't be available for the next few days. If you deem fit to you, I'd appreciate if you could commit this after the test runs stabilize. Thanks!

@markgrover
Copy link
Member Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69091 has finished for PR 15971 at commit 61a961c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markgrover
Copy link
Member Author

Ok, looks like all is good now!

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69100 has finished for PR 15971 at commit 49015ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Nov 28, 2016

Merging to master.

@asfgit asfgit closed this in 237c3b9 Nov 28, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
…and UI

## What changes were proposed in this pull request?

This patch adds a new property called `spark.secret.redactionPattern` that
allows users to specify a scala regex to decide which Spark configuration
properties and environment variables in driver and executor environments
contain sensitive information. When this regex matches the property or
environment variable name, its value is redacted from the environment UI and
various logs like YARN and event logs.

This change uses this property to redact information from event logs and YARN
logs. It also, updates the UI code to adhere to this property instead of
hardcoding the logic to decipher which properties are sensitive.

Here's an image of the UI post-redaction:
![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png)

Here's the text in the YARN logs, post-redaction:
``HADOOP_CREDSTORE_PASSWORD -> *********(redacted)``

Here's the text in the event logs, post-redaction:
``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...``

## How was this patch tested?
1. Unit tests are added to ensure that redaction works.
2. A YARN job reading data off of S3 with confidential information
(hadoop credential provider password) being provided in the environment
variables of driver and executor. And, afterwards, logs were grepped to make
sure that no mention of secret password was present. It was also ensure that
the job was able to read the data off of S3 correctly, thereby ensuring that
the sensitive information was being trickled down to the right places to read
the data.
3. The event logs were checked to make sure no mention of secret password was
present.
4. UI environment tab was checked to make sure there was no secret information
being displayed.

Author: Mark Grover <mark@apache.org>

Closes apache#15971 from markgrover/master_redaction.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…and UI

## What changes were proposed in this pull request?

This patch adds a new property called `spark.secret.redactionPattern` that
allows users to specify a scala regex to decide which Spark configuration
properties and environment variables in driver and executor environments
contain sensitive information. When this regex matches the property or
environment variable name, its value is redacted from the environment UI and
various logs like YARN and event logs.

This change uses this property to redact information from event logs and YARN
logs. It also, updates the UI code to adhere to this property instead of
hardcoding the logic to decipher which properties are sensitive.

Here's an image of the UI post-redaction:
![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png)

Here's the text in the YARN logs, post-redaction:
``HADOOP_CREDSTORE_PASSWORD -> *********(redacted)``

Here's the text in the event logs, post-redaction:
``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...``

## How was this patch tested?
1. Unit tests are added to ensure that redaction works.
2. A YARN job reading data off of S3 with confidential information
(hadoop credential provider password) being provided in the environment
variables of driver and executor. And, afterwards, logs were grepped to make
sure that no mention of secret password was present. It was also ensure that
the job was able to read the data off of S3 correctly, thereby ensuring that
the sensitive information was being trickled down to the right places to read
the data.
3. The event logs were checked to make sure no mention of secret password was
present.
4. UI environment tab was checked to make sure there was no secret information
being displayed.

Author: Mark Grover <mark@apache.org>

Closes apache#15971 from markgrover/master_redaction.
@markgrover markgrover deleted the master_redaction branch February 21, 2017 23:42
dmvieira pushed a commit to dmvieira/spark that referenced this pull request Aug 1, 2017
…and UI

## What changes were proposed in this pull request?

This patch adds a new property called `spark.secret.redactionPattern` that
allows users to specify a scala regex to decide which Spark configuration
properties and environment variables in driver and executor environments
contain sensitive information. When this regex matches the property or
environment variable name, its value is redacted from the environment UI and
various logs like YARN and event logs.

This change uses this property to redact information from event logs and YARN
logs. It also, updates the UI code to adhere to this property instead of
hardcoding the logic to decipher which properties are sensitive.

Here's an image of the UI post-redaction:
![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png)

Here's the text in the YARN logs, post-redaction:
``HADOOP_CREDSTORE_PASSWORD -> *********(redacted)``

Here's the text in the event logs, post-redaction:
``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...``

## How was this patch tested?
1. Unit tests are added to ensure that redaction works.
2. A YARN job reading data off of S3 with confidential information
(hadoop credential provider password) being provided in the environment
variables of driver and executor. And, afterwards, logs were grepped to make
sure that no mention of secret password was present. It was also ensure that
the job was able to read the data off of S3 correctly, thereby ensuring that
the sensitive information was being trickled down to the right places to read
the data.
3. The event logs were checked to make sure no mention of secret password was
present.
4. UI environment tab was checked to make sure there was no secret information
being displayed.

Author: Mark Grover <mark@apache.org>

Closes apache#15971 from markgrover/master_redaction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants