Skip to content

[SPARK-23850][sql] Add separate config for SQL options redaction. #21158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Apr 25, 2018

The old code was relying on a core configuration and extended its
default value to include things that redact desired things in the
app's environment. Instead, add a SQL-specific option for which
options to redact, and apply both the core and SQL-specific rules
when redacting the options in the save command.

This is a little sub-optimal since it adds another config, but it
retains the current default behavior.

While there I also fixed a typo and a couple of minor config API
usage issues in the related redaction option that SQL already had.

Tested with existing unit tests, plus checking the env page on
a shell UI.

The old code was relying on a core configuration and extended its
default value to include things that redact desired things in the
app's environment. Instead, add a SQL-specific option for which
options to redact, and apply both the core and SQL-specific rules
when redacting the options in the save command.

This is a little sub-optimal since it adds another config, but it
retains the current default behavior.

While there I also fixed a typo and a couple of minor config API
usage issues in the "sistes" redaction option that SQL already had.

Tested with existing unit tests, plus checking the env page on
a shell UI.
@@ -342,7 +342,7 @@ package object config {
"a property key or value, the value is redacted from the environment UI and various logs " +
"like YARN and event logs.")
.regexConf
.createWithDefault("(?i)secret|password|url|user|username".r)
.createWithDefault("(?i)secret|password".r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the users want to see the values, they can change the default by themselves, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you're trying to imply? The default is redacting too much. That needs to be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, making it stricter is not a big deal. However, if we want to relax it, this could be a big deal since users might expect these variables have been hided by default. cc the original reviewers and author @onursatici @ash211 @hvanhovell @jiangxb1987

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URIs could contain the access keys and usernames could have potentially personal identifiable information. Could we do not change the default?

@@ -225,7 +225,7 @@ class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) {
* Redact the sensitive information in the given string.
*/
private def withRedaction(message: String): String = {
Utils.redact(sparkSession.sessionState.conf.stringRedationPattern, message)
Utils.redact(sparkSession.sessionState.conf.stringRedactionPattern, message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this typo

@SparkQA
Copy link

SparkQA commented Apr 25, 2018

Test build #89855 has finished for PR 21158 at commit deb299b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

Looks good. I didn't have a jdbc connection handy to test with.

I was looking for documentation on these sql confs and I don't see them in the SET -v output, did we decide not to document these? I haven't looked to see how that works.

@vanzin
Copy link
Contributor Author

vanzin commented Apr 26, 2018

I was looking for documentation on these sql confs

That's part of this fix, if you're talking about the old config. Since it was using ConfigBuilder directly instead of buildConf, it did not make it to the output of "SET -v".

If that was intentional then probably that part should be reverted - don't see why it should be hidden, but maybe @gatorsmile had a reason.

@vanzin
Copy link
Contributor Author

vanzin commented Apr 30, 2018

Ping @gatorsmile

@gatorsmile
Copy link
Member

The confs in the core modules are not part of the outputs of SQL statement SET -v, which only outputs the confs in Spark SQL.

We are making a behavior change here. I am not confident about the changes in this PR. I would like to hear more options from the others.

@vanzin
Copy link
Contributor Author

vanzin commented May 1, 2018

We are making a behavior change here.

We are not. That's the whole reason why I'm adding the SQL-specific option to extend the behavior of the core options.

If I just wanted to revert the original behavior change, which added more unwanted default redactions to the core options, I'd just have changed the default value of the core option back to what it was.

@gatorsmile
Copy link
Member

Is the default (?i)secret|password|url|user|username".r merged to the released branches? If the default is not in the previously release, I am fine to change it back.

@tgravescs
Copy link
Contributor

the original change was merged into 2.3.0, see https://issues.apache.org/jira/browse/SPARK-22479. We are keeping the behavior of that fix and that is to not show the credentials of the jdbc connection. The url and user name were added for some reason in that jira as well and they shouldn't have been. There is nothing sensitive about these. I don't see this being a breaking compatibility issue.

@vanzin
Copy link
Contributor Author

vanzin commented May 1, 2018

But the main thing is that the change modified the previous behavior which was not to redact these things in the previous places where redaction occurred. So, in a way, this change is fixing a regression in 2.3.0 (which was also added to 2.2.1 according to the bug), and still providing the same feature from SPARK-22479.

@vanzin
Copy link
Contributor Author

vanzin commented May 1, 2018

Also note that Tom's comment above is not 100% accurate - this change does redact the JDBC URL on the SQL side, for reasons I described in my comments on the bug. That's the whole reason why the SQL changes are even needed.

@vanzin
Copy link
Contributor Author

vanzin commented May 3, 2018

@gatorsmile is there anything else you're looking for here?

@gatorsmile
Copy link
Member

The default value introduced by SPARK-22479 was already part of Apache Spark 2.3. It is hard to say this is a regression, since URIs could contain the access keys and usernames could have potentially personal identifiable information. Thus, it also makes sense to block them. Although the original JIRA is for JDBC, we also make it more secure. If users do not want to redact them, they can change the default. I am not feeling comfortable to not redact them in the current situation, since we already did it in Spark 2.3. All the security issues could be serious. Thus, I think we should make it unchanged.

@tgravescs
Copy link
Contributor

I disagree on this. I find it a regression that you blocked url's and user names in the first place. Showing these things have been a feature in spark for a long time. The only security issue here is with jdbc driver, block that instance. Are there any other places spark itself prints or uses a url with a password? If so we should fix those.

If not then it would be specified in a user config and if they are doing that, they should change the config to add it to be redacted. I think the current behavior is purely worse from a user perspective. There are so many urls that have no security implications and us blocking them by default I think is wrong.

@tgravescs
Copy link
Contributor

Also, Spark by default is shipped with not secure settings. Meaning spark.acls.enable is false and spark.authenticate is false. I see no reason to make the redact configs more strict then our defaults for those (Note I'm not arguing we shouldn't redact credentials where Spark itself is showing). If a user turns on the acls, then by default only the user who submitted the job can see the UI.

Going back to https://issues.apache.org/jira/browse/SPARK-22479 it looks like the user can see the url via the console, logs and I assume in the UI. Is there somewhere else someone can see this information? I want to make sure I understand the vulnerability here.

If you enable security properly on spark no user should have access to those without being given permission. If you are not running with acls and authentication on then I would argue there are a lot of attack vectors to where I could run things as another user anyway.

@vanzin
Copy link
Contributor Author

vanzin commented May 4, 2018

@gatorsmile seriously, why are you focusing on the fact that I'm changing that default value when I wrote a bunch of code to actually maintain the URL redaction on the SQL plans, which was the original intent? Did you even read the PR at all?

@vanzin
Copy link
Contributor Author

vanzin commented May 4, 2018

usernames could have potentially personal identifiable information

User names are not sensitive information. There's a ton of places where you can see them outside of the Spark UI (have you tried running ls on you machine lately?).

@gatorsmile
Copy link
Member

@vanzin That really depends on how the users deploy Spark. Not all the Spark end users are allowed to access the command lines and run ls, but they are allowed to read Spark UI in almost all the cases.

@tgravescs @vanzin This PR tries to not affect the original scenario explained in the JIRA SPARK-22479. I would say thank you @vanzin for your efforts.

The PR to fix SPARK-22479 also impacts more cases (e.g., in the environment page) too. I can understand your arguments. In your scenarios, it would be nice to expose all these info to the external users by default, as what we did in the past. Since the new behaviors have already been introduced in the last release, it would be better to keep them unchanged, if possible, based on the principle of least surprise.

The lessons we learned is to document the behavior change in the migration guide. When users try the RC branches, the users could easily catch the changes. Unfortunately, we did not do it in that PR. Thus, it becomes a surprise to @tgravescs after we already released Spark 2.3. We need to avoid similar cases. We need to enforce every PR to document the behavior changes, no matter whether it is a regression fix or a bug fix.

I am also fine to merge this if the other reviewers of the previous PR think we should change the behavior again. cc @onursatici @ash211 @hvanhovell @jiangxb1987

In the PR, please document the behavior changes. Hopefully, we will not see another PR to change the default value again after we release Spark 2.4. cc @vanzin @tgravescs WDYT?

@vanzin
Copy link
Contributor Author

vanzin commented May 4, 2018

Not all the Spark end users are allowed to access the command lines and run ls

Come on. That is just an example. You surely are smart enough to think of another example where you can see other people's user names outside of Spark.

it would be better to keep them unchanged, if possible

I disagree. Can you list all the Spark configuration that would be redacted by the conf in 2.3? I could only find two:

  • spark.driver.userClassPathFirsrt
  • spark.executor.userClassPathFirst

And those are a great example of why the original change is a regression. It's wrong to redact those. It makes debugging things harder.

And yet there are a ton of config keys where you can provide URLs, the thing you seem really worried about, which are not caught by that regex. Let's see... spark.master, spark.jars, spark.files, spark.yarn.archives, spark.yarn.jars, spark.jars.repositories... I could go on.

Since the new behaviors have already been introduced in the last release,

Every change introduces a behavior change (unless you're like changing a comment string or something like that). Saying a change is not wanted because it changes behavior is really quite silly.

The question is always whether the change is desired or not. And in this case it is. The change that was introduced in 2.3.0 is a regression in behavior from 2.2, or, in other words, it makes things worse. So we should fix it.

please document the behavior changes.

As I've stated in previous comments, I don't think there is a user-visible behavior change - at least one users would care about. As I mentioned above, the default value in 2.3.0 only causes problems, and doesn't apply to anything useful other than the JDBC URL, a behavior that this change keeps.

@onursatici
Copy link
Contributor

I think this PR is fine given it preserves the redaction of URL's. That was indeed the main reason why url's were in the default redaction pattern in the previous PR.
To keep all the previous behaviour, SQL_OPTIONS_REDACTION_PATTERN can include user, as there might be environments considering user name information on sql connection options as db credentials. Showing that information could be opt-in if we redact by default.

@vanzin
Copy link
Contributor Author

vanzin commented May 5, 2018

To keep all the previous behaviour, SQL_OPTIONS_REDACTION_PATTERN can include user

User names, unlike passwords, are useful for debugging. And they're not meant to be secret. They're meant to identify an entity, and by that, it means it's not generally hard to guess them. Which is why you need a password.

(Think it in a different way: if you access a table you shouldn't, wouldn't you get an exception saying "user blah cannot access table foo"? And are you redacting that in the places where that stuff shows up?)

If you have an environment where even user names are considered secret, it's easy enough to change the configuration. But at that time you really should think hard about following Tom's advice above and just enable authentication for your web UIs. Otherwise you're not really taking security seriously.

I really disliked even keeping the URL redacted, since that's even more useful than the user for debugging. But some vendors still support and even document putting passwords in those URLs, so that's why I kept it.

If you guys really feel strongly about redacting user names, I'll add it back in the SQL config. I don't really care about that part that much, even if I don't agree with the premise. But I strongly disagree with keeping the current default value in the core option.

@vanzin
Copy link
Contributor Author

vanzin commented May 9, 2018

ping

@vanzin
Copy link
Contributor Author

vanzin commented May 11, 2018

ping again

@vanzin
Copy link
Contributor Author

vanzin commented May 14, 2018

Last ping before I assume everybody is ok with the patch.

@gatorsmile
Copy link
Member

I would like to hear more inputs from the reviewers of the original PR. cc @onursatici @ash211 @hvanhovell @jiangxb1987

@vanzin
Copy link
Contributor Author

vanzin commented May 15, 2018

That's fine, but I'm not going to wait indefinitely for their feedback. They've already been pinged multiple times here and on jira to comment on this issue.

@vanzin
Copy link
Contributor Author

vanzin commented May 16, 2018

I plan to push this tomorrow by EOD unless I hear strong opposition, and merge it to 2.3 and 2.2 (where the original change was also backported). I want to avoid having another release out with the undesired behavior.

@onursatici
Copy link
Contributor

I don't have any strong opinions about usernames, so merging this as it stays is fine by me

@vanzin
Copy link
Contributor Author

vanzin commented May 17, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 18, 2018

Test build #90750 has finished for PR 21158 at commit 6ca25e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented May 18, 2018

Merging to master / 2.3 / 2.2.

@asfgit asfgit closed this in ed7ba7d May 18, 2018
asfgit pushed a commit that referenced this pull request May 18, 2018
The old code was relying on a core configuration and extended its
default value to include things that redact desired things in the
app's environment. Instead, add a SQL-specific option for which
options to redact, and apply both the core and SQL-specific rules
when redacting the options in the save command.

This is a little sub-optimal since it adds another config, but it
retains the current default behavior.

While there I also fixed a typo and a couple of minor config API
usage issues in the related redaction option that SQL already had.

Tested with existing unit tests, plus checking the env page on
a shell UI.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #21158 from vanzin/SPARK-23850.

(cherry picked from commit ed7ba7d)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@vanzin
Copy link
Contributor Author

vanzin commented May 18, 2018

(I'll open a separate PR for 2.2 to run tests before pushing, since there are conflicts.)

vanzin pushed a commit to vanzin/spark that referenced this pull request May 18, 2018
…daction.

The old code was relying on a core configuration and extended its
default value to include things that redact desired things in the
app's environment. Instead, add a SQL-specific option for which
options to redact, and apply both the core and SQL-specific rules
when redacting the options in the save command.

This is a little sub-optimal since it adds another config, but it
retains the current default behavior.

While there I also fixed a typo and a couple of minor config API
usage issues in the related redaction option that SQL already had.

Tested with existing unit tests, plus checking the env page on
a shell UI.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#21158 from vanzin/SPARK-23850.

(cherry picked from commit ed7ba7d)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
@vanzin vanzin deleted the SPARK-23850 branch May 18, 2018 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants