[SPARK-23850][sql] Add separate config for SQL options redaction. #21158

vanzin · 2018-04-25T17:36:24Z

The old code was relying on a core configuration and extended its
default value to include things that redact desired things in the
app's environment. Instead, add a SQL-specific option for which
options to redact, and apply both the core and SQL-specific rules
when redacting the options in the save command.

This is a little sub-optimal since it adds another config, but it
retains the current default behavior.

While there I also fixed a typo and a couple of minor config API
usage issues in the related redaction option that SQL already had.

Tested with existing unit tests, plus checking the env page on
a shell UI.

The old code was relying on a core configuration and extended its default value to include things that redact desired things in the app's environment. Instead, add a SQL-specific option for which options to redact, and apply both the core and SQL-specific rules when redacting the options in the save command. This is a little sub-optimal since it adds another config, but it retains the current default behavior. While there I also fixed a typo and a couple of minor config API usage issues in the "sistes" redaction option that SQL already had. Tested with existing unit tests, plus checking the env page on a shell UI.

gatorsmile · 2018-04-25T17:53:45Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

@@ -342,7 +342,7 @@ package object config {
        "a property key or value, the value is redacted from the environment UI and various logs " +
        "like YARN and event logs.")
      .regexConf
-      .createWithDefault("(?i)secret|password|url|user|username".r)
+      .createWithDefault("(?i)secret|password".r)


If the users want to see the values, they can change the default by themselves, right?

Not sure what you're trying to imply? The default is redacting too much. That needs to be fixed.

Normally, making it stricter is not a big deal. However, if we want to relax it, this could be a big deal since users might expect these variables have been hided by default. cc the original reviewers and author @onursatici @ash211 @hvanhovell @jiangxb1987

URIs could contain the access keys and usernames could have potentially personal identifiable information. Could we do not change the default?

gatorsmile · 2018-04-25T17:55:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

@@ -225,7 +225,7 @@ class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) {
   * Redact the sensitive information in the given string.
   */
  private def withRedaction(message: String): String = {
-    Utils.redact(sparkSession.sessionState.conf.stringRedationPattern, message)
+    Utils.redact(sparkSession.sessionState.conf.stringRedactionPattern, message)


Thanks for fixing this typo

SparkQA · 2018-04-25T22:01:25Z

Test build #89855 has finished for PR 21158 at commit deb299b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-04-26T13:48:46Z

Looks good. I didn't have a jdbc connection handy to test with.

I was looking for documentation on these sql confs and I don't see them in the SET -v output, did we decide not to document these? I haven't looked to see how that works.

vanzin · 2018-04-26T15:33:38Z

I was looking for documentation on these sql confs

That's part of this fix, if you're talking about the old config. Since it was using ConfigBuilder directly instead of buildConf, it did not make it to the output of "SET -v".

If that was intentional then probably that part should be reverted - don't see why it should be hidden, but maybe @gatorsmile had a reason.

vanzin · 2018-04-30T19:31:14Z

Ping @gatorsmile

gatorsmile · 2018-04-30T23:57:16Z

The confs in the core modules are not part of the outputs of SQL statement SET -v, which only outputs the confs in Spark SQL.

We are making a behavior change here. I am not confident about the changes in this PR. I would like to hear more options from the others.

vanzin · 2018-05-01T02:18:18Z

We are making a behavior change here.

We are not. That's the whole reason why I'm adding the SQL-specific option to extend the behavior of the core options.

If I just wanted to revert the original behavior change, which added more unwanted default redactions to the core options, I'd just have changed the default value of the core option back to what it was.

gatorsmile · 2018-05-01T04:05:44Z

Is the default (?i)secret|password|url|user|username".r merged to the released branches? If the default is not in the previously release, I am fine to change it back.

tgravescs · 2018-05-01T13:05:59Z

the original change was merged into 2.3.0, see https://issues.apache.org/jira/browse/SPARK-22479. We are keeping the behavior of that fix and that is to not show the credentials of the jdbc connection. The url and user name were added for some reason in that jira as well and they shouldn't have been. There is nothing sensitive about these. I don't see this being a breaking compatibility issue.

vanzin · 2018-05-01T15:00:22Z

But the main thing is that the change modified the previous behavior which was not to redact these things in the previous places where redaction occurred. So, in a way, this change is fixing a regression in 2.3.0 (which was also added to 2.2.1 according to the bug), and still providing the same feature from SPARK-22479.

vanzin · 2018-05-01T16:01:54Z

Also note that Tom's comment above is not 100% accurate - this change does redact the JDBC URL on the SQL side, for reasons I described in my comments on the bug. That's the whole reason why the SQL changes are even needed.

vanzin · 2018-05-03T18:38:11Z

@gatorsmile is there anything else you're looking for here?

gatorsmile · 2018-05-04T04:11:51Z

The default value introduced by SPARK-22479 was already part of Apache Spark 2.3. It is hard to say this is a regression, since URIs could contain the access keys and usernames could have potentially personal identifiable information. Thus, it also makes sense to block them. Although the original JIRA is for JDBC, we also make it more secure. If users do not want to redact them, they can change the default. I am not feeling comfortable to not redact them in the current situation, since we already did it in Spark 2.3. All the security issues could be serious. Thus, I think we should make it unchanged.

tgravescs · 2018-05-04T13:42:07Z

I disagree on this. I find it a regression that you blocked url's and user names in the first place. Showing these things have been a feature in spark for a long time. The only security issue here is with jdbc driver, block that instance. Are there any other places spark itself prints or uses a url with a password? If so we should fix those.

If not then it would be specified in a user config and if they are doing that, they should change the config to add it to be redacted. I think the current behavior is purely worse from a user perspective. There are so many urls that have no security implications and us blocking them by default I think is wrong.

tgravescs · 2018-05-04T14:02:32Z

Also, Spark by default is shipped with not secure settings. Meaning spark.acls.enable is false and spark.authenticate is false. I see no reason to make the redact configs more strict then our defaults for those (Note I'm not arguing we shouldn't redact credentials where Spark itself is showing). If a user turns on the acls, then by default only the user who submitted the job can see the UI.

Going back to https://issues.apache.org/jira/browse/SPARK-22479 it looks like the user can see the url via the console, logs and I assume in the UI. Is there somewhere else someone can see this information? I want to make sure I understand the vulnerability here.

If you enable security properly on spark no user should have access to those without being given permission. If you are not running with acls and authentication on then I would argue there are a lot of attack vectors to where I could run things as another user anyway.

vanzin · 2018-05-04T16:49:07Z

@gatorsmile seriously, why are you focusing on the fact that I'm changing that default value when I wrote a bunch of code to actually maintain the URL redaction on the SQL plans, which was the original intent? Did you even read the PR at all?

vanzin · 2018-05-04T18:32:38Z

usernames could have potentially personal identifiable information

User names are not sensitive information. There's a ton of places where you can see them outside of the Spark UI (have you tried running ls on you machine lately?).

gatorsmile · 2018-05-04T19:44:44Z

@vanzin That really depends on how the users deploy Spark. Not all the Spark end users are allowed to access the command lines and run ls, but they are allowed to read Spark UI in almost all the cases.

@tgravescs @vanzin This PR tries to not affect the original scenario explained in the JIRA SPARK-22479. I would say thank you @vanzin for your efforts.

The PR to fix SPARK-22479 also impacts more cases (e.g., in the environment page) too. I can understand your arguments. In your scenarios, it would be nice to expose all these info to the external users by default, as what we did in the past. Since the new behaviors have already been introduced in the last release, it would be better to keep them unchanged, if possible, based on the principle of least surprise.

The lessons we learned is to document the behavior change in the migration guide. When users try the RC branches, the users could easily catch the changes. Unfortunately, we did not do it in that PR. Thus, it becomes a surprise to @tgravescs after we already released Spark 2.3. We need to avoid similar cases. We need to enforce every PR to document the behavior changes, no matter whether it is a regression fix or a bug fix.

I am also fine to merge this if the other reviewers of the previous PR think we should change the behavior again. cc @onursatici @ash211 @hvanhovell @jiangxb1987

In the PR, please document the behavior changes. Hopefully, we will not see another PR to change the default value again after we release Spark 2.4. cc @vanzin @tgravescs WDYT?

vanzin · 2018-05-04T19:56:36Z

Not all the Spark end users are allowed to access the command lines and run ls

Come on. That is just an example. You surely are smart enough to think of another example where you can see other people's user names outside of Spark.

it would be better to keep them unchanged, if possible

I disagree. Can you list all the Spark configuration that would be redacted by the conf in 2.3? I could only find two:

spark.driver.userClassPathFirsrt
spark.executor.userClassPathFirst

And those are a great example of why the original change is a regression. It's wrong to redact those. It makes debugging things harder.

And yet there are a ton of config keys where you can provide URLs, the thing you seem really worried about, which are not caught by that regex. Let's see... spark.master, spark.jars, spark.files, spark.yarn.archives, spark.yarn.jars, spark.jars.repositories... I could go on.

Since the new behaviors have already been introduced in the last release,

Every change introduces a behavior change (unless you're like changing a comment string or something like that). Saying a change is not wanted because it changes behavior is really quite silly.

The question is always whether the change is desired or not. And in this case it is. The change that was introduced in 2.3.0 is a regression in behavior from 2.2, or, in other words, it makes things worse. So we should fix it.

please document the behavior changes.

As I've stated in previous comments, I don't think there is a user-visible behavior change - at least one users would care about. As I mentioned above, the default value in 2.3.0 only causes problems, and doesn't apply to anything useful other than the JDBC URL, a behavior that this change keeps.

onursatici · 2018-05-04T23:49:17Z

I think this PR is fine given it preserves the redaction of URL's. That was indeed the main reason why url's were in the default redaction pattern in the previous PR.
To keep all the previous behaviour, SQL_OPTIONS_REDACTION_PATTERN can include user, as there might be environments considering user name information on sql connection options as db credentials. Showing that information could be opt-in if we redact by default.

vanzin · 2018-05-05T02:57:24Z

To keep all the previous behaviour, SQL_OPTIONS_REDACTION_PATTERN can include user

User names, unlike passwords, are useful for debugging. And they're not meant to be secret. They're meant to identify an entity, and by that, it means it's not generally hard to guess them. Which is why you need a password.

(Think it in a different way: if you access a table you shouldn't, wouldn't you get an exception saying "user blah cannot access table foo"? And are you redacting that in the places where that stuff shows up?)

If you have an environment where even user names are considered secret, it's easy enough to change the configuration. But at that time you really should think hard about following Tom's advice above and just enable authentication for your web UIs. Otherwise you're not really taking security seriously.

I really disliked even keeping the URL redacted, since that's even more useful than the user for debugging. But some vendors still support and even document putting passwords in those URLs, so that's why I kept it.

If you guys really feel strongly about redacting user names, I'll add it back in the SQL config. I don't really care about that part that much, even if I don't agree with the premise. But I strongly disagree with keeping the current default value in the core option.

vanzin · 2018-05-09T18:45:38Z

ping

vanzin · 2018-05-11T19:57:28Z

ping again

vanzin · 2018-05-14T20:08:30Z

Last ping before I assume everybody is ok with the patch.

gatorsmile · 2018-05-15T17:16:47Z

I would like to hear more inputs from the reviewers of the original PR. cc @onursatici @ash211 @hvanhovell @jiangxb1987

vanzin · 2018-05-15T17:22:25Z

That's fine, but I'm not going to wait indefinitely for their feedback. They've already been pinged multiple times here and on jira to comment on this issue.

vanzin · 2018-05-16T20:16:15Z

I plan to push this tomorrow by EOD unless I hear strong opposition, and merge it to 2.3 and 2.2 (where the original change was also backported). I want to avoid having another release out with the undesired behavior.

onursatici · 2018-05-16T21:23:13Z

I don't have any strong opinions about usernames, so merging this as it stays is fine by me

vanzin · 2018-05-17T19:46:16Z

retest this please

SparkQA · 2018-05-18T00:09:04Z

Test build #90750 has finished for PR 21158 at commit 6ca25e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-05-18T18:13:53Z

Merging to master / 2.3 / 2.2.

The old code was relying on a core configuration and extended its default value to include things that redact desired things in the app's environment. Instead, add a SQL-specific option for which options to redact, and apply both the core and SQL-specific rules when redacting the options in the save command. This is a little sub-optimal since it adds another config, but it retains the current default behavior. While there I also fixed a typo and a couple of minor config API usage issues in the related redaction option that SQL already had. Tested with existing unit tests, plus checking the env page on a shell UI. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21158 from vanzin/SPARK-23850. (cherry picked from commit ed7ba7d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

vanzin · 2018-05-18T18:32:43Z

(I'll open a separate PR for 2.2 to run tests before pushing, since there are conflicts.)

…daction. The old code was relying on a core configuration and extended its default value to include things that redact desired things in the app's environment. Instead, add a SQL-specific option for which options to redact, and apply both the core and SQL-specific rules when redacting the options in the save command. This is a little sub-optimal since it adds another config, but it retains the current default behavior. While there I also fixed a typo and a couple of minor config API usage issues in the related redaction option that SQL already had. Tested with existing unit tests, plus checking the env page on a shell UI. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#21158 from vanzin/SPARK-23850. (cherry picked from commit ed7ba7d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

gatorsmile reviewed Apr 25, 2018

View reviewed changes

Use SQLConf.get.

6ca25e3

asfgit closed this in ed7ba7d May 18, 2018

vanzin deleted the SPARK-23850 branch May 18, 2018 21:27

[SPARK-23850][sql] Add separate config for SQL options redaction. #21158

[SPARK-23850][sql] Add separate config for SQL options redaction. #21158

Uh oh!

Conversation

vanzin commented Apr 25, 2018

Uh oh!

gatorsmile Apr 25, 2018

Choose a reason for hiding this comment

Uh oh!

vanzin Apr 25, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 1, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 1, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Apr 25, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 25, 2018

Uh oh!

tgravescs commented Apr 26, 2018

Uh oh!

vanzin commented Apr 26, 2018

Uh oh!

vanzin commented Apr 30, 2018

Uh oh!

gatorsmile commented Apr 30, 2018

Uh oh!

vanzin commented May 1, 2018

Uh oh!

gatorsmile commented May 1, 2018

Uh oh!

tgravescs commented May 1, 2018

Uh oh!

vanzin commented May 1, 2018

Uh oh!

vanzin commented May 1, 2018

Uh oh!

vanzin commented May 3, 2018

Uh oh!

gatorsmile commented May 4, 2018

Uh oh!

tgravescs commented May 4, 2018

Uh oh!

tgravescs commented May 4, 2018

Uh oh!

vanzin commented May 4, 2018

Uh oh!

vanzin commented May 4, 2018

Uh oh!

gatorsmile commented May 4, 2018

Uh oh!

vanzin commented May 4, 2018

Uh oh!

onursatici commented May 4, 2018

Uh oh!

vanzin commented May 5, 2018

Uh oh!

vanzin commented May 9, 2018

Uh oh!

vanzin commented May 11, 2018

Uh oh!

vanzin commented May 14, 2018

Uh oh!

gatorsmile commented May 15, 2018

Uh oh!

vanzin commented May 15, 2018

Uh oh!

vanzin commented May 16, 2018

Uh oh!

onursatici commented May 16, 2018

Uh oh!

vanzin commented May 17, 2018

Uh oh!

SparkQA commented May 18, 2018

Uh oh!

vanzin commented May 18, 2018