-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI #15971
Conversation
…and UI This commit adds a new property called `spark.secret.redactionPattern` that allows users to specify a scala regex to decide which Spark configuration properties and environment variables in driver and executor environments contain sensitive information. When this regex matches the property or environment variable name, its value is redacted from the environment UI and various logs like YARN and event logs. This change uses this property to redact information from event logs and YARN logs. It also, updates the UI code to adhere to this property instead of hardcoding the logic to decipher which properties are sensitive. For testing: 1. Unit tests are added to ensure that redaction works. 2. A YARN job reading data off of S3 with confidential information (hadoop credential provider password) being provided in the environment variables of driver and executor. And, afterwards, logs were grepped to make sure that no mention of secret password was present. It was also ensure that the job was able to read the data off of S3 correctly, thereby ensuring that the sensitive information was being trickled down to the right places to read the data. 3. The event logs were checked to make sure no mention of secret password was present. 4. UI environment tab was checked to make sure there was no secret information being displayed.
Does this protect against doing |
If a user has access to spark-shell (say, in a cluster secured through kerberos), it may be reasonable for them to see the sensitive information. This, however, prevents anyone from going to the unauthenticated Spark UI, and having all the creds to S3. |
Test build #68970 has finished for PR 15971 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good upgrade over the last fix. I made a couple notes but overall I think this looks pretty good.
@@ -223,4 +223,13 @@ package object config { | |||
" bigger files.") | |||
.longConf | |||
.createWithDefault(4 * 1024 * 1024) | |||
|
|||
private[spark] val SECRET_REDACTION_PATTERN = | |||
ConfigBuilder("spark.secret.redactionPattern") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this mean this pattern would get redacted since it contains secret?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I think it's good for the actual regex to not be redacted. So, I will rename the property to be clearer anyways (and not have secret in the name) to spark.redaction.regex
.
" When this regex matches the property or environment variable name, its value is " + | ||
"redacted from the environment UI and various logs like YARN and event logs") | ||
.stringConf | ||
.createWithDefault("secret|password|SECRET|PASSWORD") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a case-insensitive version be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
private[spark] val SECRET_REDACTION_PATTERN = | ||
ConfigBuilder("spark.secret.redactionPattern") | ||
.doc("Scala regex(case-sensitive) to decide which Spark configuration properties and " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after "regex"
Just call it a "regex", since the regex syntax is actually defined by the JDK libraries and not by Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
ConfigBuilder("spark.secret.redactionPattern") | ||
.doc("Scala regex(case-sensitive) to decide which Spark configuration properties and " + | ||
"environment variables in driver and executor environments contain sensitive information." + | ||
" When this regex matches the property or environment variable name, its value is " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... matches a property ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -2555,6 +2555,15 @@ private[spark] object Utils extends Logging { | |||
sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten | |||
} | |||
} | |||
|
|||
private[util] val REDACTION_REPLACEMENT_TEXT = "*********(redacted)" | |||
def redact(conf: SparkConf)(kv: (String, String)): (String, String) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add empty line
Also, not sure currying is buying you anything. In fact the caller syntax becomes clearer if you don't use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the empty line. So, have the redact method take in a conf as a parameter and return a method?
I see them as equivalent, but don't feel strongly about it, so I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the redact
method should take 2 parameters, the config and the list of things to be redacted, instead of using currying for the second parameter.
|
||
private[util] val REDACTION_REPLACEMENT_TEXT = "*********(redacted)" | ||
def redact(conf: SparkConf)(kv: (String, String)): (String, String) = { | ||
val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very expensive. How about a version that takes a list of tuples and redacts them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What part do you think is expensive? Going through all the configuration properties and matching them with the regex?
If so, I agree. However, that has to be done somewhere. All the callers of this function have a SparkConf
that they want stuff redacted from. So, if this function accepts a list of tuples, they have to run the regex check to find that list first before sending it to redact()
. So, overall, unless I am missing something, I don't think we can avoid the expense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiling the regex once for every item in the list being redacted, instead of doing it once for the whole list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. Let me fix this.
@@ -231,6 +233,19 @@ private[spark] class EventLoggingListener( | |||
} | |||
} | |||
|
|||
|
|||
private def redactEvent(event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why you chose this instead of redacting at the source of SparkListenerEnvironmentUpdate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I thought about this quite a bit when making this change. I should have posted that decision in the PR description, apologies for not providing that context. The way I see it - there are three places where redaction can be done:
- Right at the source of SparkListenerEnvironmentUpdate (here, in SparkContext.scala).
- In JsonProtocol.scala when converting the event to JSON.
- In EventLogginListener.scala (where it is right now), when being persisted to disk.
A user could write a custom listener that listened to the environment updates and did something useful with them. And, I didn't want to impose redaction on them. They could be using it to create a clone of their environment, for example and may need to the same sensitive properties. So, I ruled out 1.
And, JsonProtocol seemed like a generic utility to convert events to JSON. While I could be selective about only redacting events of SparkListenerEnvironmentUpdate
type, I didn't want to assume that everyone translating the environment update to JSON should only be seeing redacted configuration. So, that made me rule out 2.
I decided that it was best redact "closest to disk", which made me put the redaction code where I did - in EventLoggingListener. Hope that makes sense, happy to hear your thoughts if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside is that with this choice you have more complicated tests, and you have to do redaction at every place where this information might be written (which at the moment is just two - the event logger and the UI).
JsonProtocol is not really a choice because it doesn't cover the UI.
A user could write a custom listener that listened to the environment updates and did something useful with them. And, I didn't want to impose redaction on them.
That's the only argument I can buy, and said user has different ways to get that information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the only argument I can buy, and said user has different ways to get that information.
True, sparkConf.getAll
is always available.
I still think it's better to put the redaction closer to the "sinks" for now. The good thing though is that if we see the number of "sinks" increasing, and everyone wanting redaction, we can take redaction more upstream. For now, 2 sinks seem manageable and it's hard to guess if future sinks are going to want redaction or not. So, unless you strongly object, I'd like to keep the redaction closer to the "sinks" now.
|
||
test("redact sensitive information") { | ||
val sparkConf = new SparkConf | ||
sparkConf.set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", "secret_password") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much cleaner if you do something like:
val keys = Seq("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", "spark.my.password", ...)
keys.foreach { key =>
// test redaction for that key
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// Make sure nothing secret shows up anywhere | ||
assert(!eventLog.contains(secretPassword), s"Secret password ($secretPassword) not redacted " + | ||
s"from event logs:\n $eventLog") | ||
val expected = """"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)"""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty hacky. It makes assumptions about how things are formatted in the event log.
The previous assert should be enough (ignoring my previous comment about changing this test).
val regex = """"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"([^"]*)"""".r | ||
val matches = regex.findAllIn(eventLog) | ||
assert(matches.nonEmpty) | ||
matches.foreach(matched => assert(matched.equals(expected))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring my previous comments, this should be .foreach { matched => ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
" When this regex matches the property or environment variable name, its value is " + | ||
"redacted from the environment UI and various logs like YARN and event logs") | ||
.stringConf | ||
.createWithDefault("secret|password|SECRET|PASSWORD") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
val secretPassword = "secret_password" | ||
val conf = getLoggingConf(testDirPath, None).set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", | ||
secretPassword) | ||
sc = new SparkContext("local-cluster[2,2,1024]", "test", conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty expensive way of testing this. Why not just call the redaction method and make sure it's doing the right thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wanted a little more than just a "unit" test. This was more broader and checked for actual redaction taking place in event logs, so I have it here.
I think you have a valid point though, if you think this is too expensive, I think the method in UtilsSuite.scala does a pretty good job at "unit testing" redact()
, so I'd rather take this out completely. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are not doing the exact same thing. There's logic in the EventLoggingListener
code that also needs to be tested. But you can have a more targeted unit test instead of running a distributed Spark application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have updated this test to be less bloated. Thanks!
1. Renaming the property to not have word secret in it. 2. Making the regex case insensitive. 3. Other minor changes
…s to update. The only pending comment left is the making the test in EventLoggingListenerSuite better
Thanks for your feedback @vanzin and @ajbozarth. |
Test build #69042 has finished for PR 15971 at commit
|
if (redactionPattern.findFirstIn(kv._1).isDefined) { | ||
(kv._1, REDACTION_REPLACEMENT_TEXT) | ||
} | ||
else kv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style here is wrong. Better to use the Option
api:
regex.findFirstIn(...).map(...).getOrElse(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's much better, thanks.
val sparkConf = new SparkConf | ||
|
||
// Set some secret keys | ||
val secretKeys = Seq("" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "" +
accomplishing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me trying to format things in my editor led to this, apologies. Fixed it.
|
||
// Assert that secret information got redacted while the regular property remained the same | ||
secretKeys.foreach { key => | ||
assert(redactedConf.get(key).get == Utils.REDACTION_REPLACEMENT_TEXT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(conf(key) === ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
secretKeys.foreach { key => | ||
assert(redactedConf.get(key).get == Utils.REDACTION_REPLACEMENT_TEXT) | ||
} | ||
assert(redactedConf.get("spark.regular.property").get == "not_a_secret") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
val props = event | ||
.environmentDetails | ||
.get("Spark Properties") | ||
.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using map.get(key).get
, just use map(key)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
ConfigBuilder("spark.redaction.regex") | ||
.doc("Regex to decide which Spark configuration properties and environment variables in " + | ||
"driver and executor environments contain sensitive information. When this regex matches " + | ||
"a property , its value is redacted from the environment UI and various logs like YARN " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no space before comma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
.doc("Regex to decide which Spark configuration properties and environment variables in " + | ||
"driver and executor environments contain sensitive information. When this regex matches " + | ||
"a property , its value is redacted from the environment UI and various logs like YARN " + | ||
"and event logs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -231,6 +233,19 @@ private[spark] class EventLoggingListener( | |||
} | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't add this blank line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Test build #69085 has finished for PR 15971 at commit
|
Test build #69086 has finished for PR 15971 at commit
|
Thanks for your review, @vanzin I have incorporated all suggestions, I'd appreciate if you could take another look. And, the test failures in the last run seem unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor things to take care of.
@@ -231,6 +233,17 @@ private[spark] class EventLoggingListener( | |||
} | |||
} | |||
|
|||
private[spark] def redactEvent(event: SparkListenerEnvironmentUpdate): | |||
SparkListenerEnvironmentUpdate = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of an edge case, but this line needs indentation. I recommend:
private[spark] def redactEvent(
event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = {
SparkListenerEnvironmentUpdate = { | ||
// "Spark Properties" entry will always exist because the map is always populated with it. | ||
val props = event | ||
.environmentDetails("Spark Properties") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fits in previous line. You can even merge it with the next statement.
val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r | ||
kvs.map { kv => | ||
redactionPattern.findFirstIn(kv._1) | ||
.map{ ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after map
redactionPattern.findFirstIn(kv._1) | ||
.map{ ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) } | ||
.getOrElse(kv) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indented too far
val eventLogger = new EventLoggingListener("test", None, testDirPath.toUri(), conf) | ||
val envDetails = SparkEnv.environmentDetails(conf, "FIFO", Seq.empty, Seq.empty) | ||
val event = SparkListenerEnvironmentUpdate(envDetails) | ||
val redactedProps = eventLogger.redactEvent(event).environmentDetails("Spark Properties").toMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Spark Properties" is begging to be turned into a constant somewhere...
val redactedConf = Utils.redact(sparkConf, sparkConf.getAll).toMap | ||
|
||
// Assert that secret information got redacted while the regular property remained the same | ||
secretKeys.foreach { key => assert(redactedConf(key) == Utils.REDACTION_REPLACEMENT_TEXT) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
===
|
||
// Assert that secret information got redacted while the regular property remained the same | ||
secretKeys.foreach { key => assert(redactedConf(key) == Utils.REDACTION_REPLACEMENT_TEXT) } | ||
assert(redactedConf("spark.regular.property") == "not_a_secret") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
===
@@ -231,6 +233,15 @@ private[spark] class EventLoggingListener( | |||
} | |||
} | |||
|
|||
private[spark] def redactEvent( | |||
event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this needs to be indented one more level...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll go through Spark style guide so I don't cause as much trouble next time.
Thanks for reviewing.
LGTM pending tests. |
Test build #69093 has finished for PR 15971 at commit
|
Thanks for reviewing, Marcelo. Hmm, the failures are still unrelated:-(
I won't be available for the next few days. If you deem fit to you, I'd appreciate if you could commit this after the test runs stabilize. Thanks! |
Jenkins, retest this please. |
Test build #69091 has finished for PR 15971 at commit
|
Ok, looks like all is good now! |
Test build #69100 has finished for PR 15971 at commit
|
Merging to master. |
…and UI ## What changes were proposed in this pull request? This patch adds a new property called `spark.secret.redactionPattern` that allows users to specify a scala regex to decide which Spark configuration properties and environment variables in driver and executor environments contain sensitive information. When this regex matches the property or environment variable name, its value is redacted from the environment UI and various logs like YARN and event logs. This change uses this property to redact information from event logs and YARN logs. It also, updates the UI code to adhere to this property instead of hardcoding the logic to decipher which properties are sensitive. Here's an image of the UI post-redaction: ![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png) Here's the text in the YARN logs, post-redaction: ``HADOOP_CREDSTORE_PASSWORD -> *********(redacted)`` Here's the text in the event logs, post-redaction: ``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...`` ## How was this patch tested? 1. Unit tests are added to ensure that redaction works. 2. A YARN job reading data off of S3 with confidential information (hadoop credential provider password) being provided in the environment variables of driver and executor. And, afterwards, logs were grepped to make sure that no mention of secret password was present. It was also ensure that the job was able to read the data off of S3 correctly, thereby ensuring that the sensitive information was being trickled down to the right places to read the data. 3. The event logs were checked to make sure no mention of secret password was present. 4. UI environment tab was checked to make sure there was no secret information being displayed. Author: Mark Grover <mark@apache.org> Closes apache#15971 from markgrover/master_redaction.
…and UI ## What changes were proposed in this pull request? This patch adds a new property called `spark.secret.redactionPattern` that allows users to specify a scala regex to decide which Spark configuration properties and environment variables in driver and executor environments contain sensitive information. When this regex matches the property or environment variable name, its value is redacted from the environment UI and various logs like YARN and event logs. This change uses this property to redact information from event logs and YARN logs. It also, updates the UI code to adhere to this property instead of hardcoding the logic to decipher which properties are sensitive. Here's an image of the UI post-redaction: ![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png) Here's the text in the YARN logs, post-redaction: ``HADOOP_CREDSTORE_PASSWORD -> *********(redacted)`` Here's the text in the event logs, post-redaction: ``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...`` ## How was this patch tested? 1. Unit tests are added to ensure that redaction works. 2. A YARN job reading data off of S3 with confidential information (hadoop credential provider password) being provided in the environment variables of driver and executor. And, afterwards, logs were grepped to make sure that no mention of secret password was present. It was also ensure that the job was able to read the data off of S3 correctly, thereby ensuring that the sensitive information was being trickled down to the right places to read the data. 3. The event logs were checked to make sure no mention of secret password was present. 4. UI environment tab was checked to make sure there was no secret information being displayed. Author: Mark Grover <mark@apache.org> Closes apache#15971 from markgrover/master_redaction.
…and UI ## What changes were proposed in this pull request? This patch adds a new property called `spark.secret.redactionPattern` that allows users to specify a scala regex to decide which Spark configuration properties and environment variables in driver and executor environments contain sensitive information. When this regex matches the property or environment variable name, its value is redacted from the environment UI and various logs like YARN and event logs. This change uses this property to redact information from event logs and YARN logs. It also, updates the UI code to adhere to this property instead of hardcoding the logic to decipher which properties are sensitive. Here's an image of the UI post-redaction: ![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png) Here's the text in the YARN logs, post-redaction: ``HADOOP_CREDSTORE_PASSWORD -> *********(redacted)`` Here's the text in the event logs, post-redaction: ``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...`` ## How was this patch tested? 1. Unit tests are added to ensure that redaction works. 2. A YARN job reading data off of S3 with confidential information (hadoop credential provider password) being provided in the environment variables of driver and executor. And, afterwards, logs were grepped to make sure that no mention of secret password was present. It was also ensure that the job was able to read the data off of S3 correctly, thereby ensuring that the sensitive information was being trickled down to the right places to read the data. 3. The event logs were checked to make sure no mention of secret password was present. 4. UI environment tab was checked to make sure there was no secret information being displayed. Author: Mark Grover <mark@apache.org> Closes apache#15971 from markgrover/master_redaction.
What changes were proposed in this pull request?
This patch adds a new property called
spark.secret.redactionPattern
thatallows users to specify a scala regex to decide which Spark configuration
properties and environment variables in driver and executor environments
contain sensitive information. When this regex matches the property or
environment variable name, its value is redacted from the environment UI and
various logs like YARN and event logs.
This change uses this property to redact information from event logs and YARN
logs. It also, updates the UI code to adhere to this property instead of
hardcoding the logic to decipher which properties are sensitive.
Here's an image of the UI post-redaction:
![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png)
Here's the text in the YARN logs, post-redaction:
HADOOP_CREDSTORE_PASSWORD -> *********(redacted)
Here's the text in the event logs, post-redaction:
...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*********(redacted)",...
How was this patch tested?
(hadoop credential provider password) being provided in the environment
variables of driver and executor. And, afterwards, logs were grepped to make
sure that no mention of secret password was present. It was also ensure that
the job was able to read the data off of S3 correctly, thereby ensuring that
the sensitive information was being trickled down to the right places to read
the data.
present.
being displayed.