-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-23850][sql] Add separate config for SQL options redaction. #21158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The old code was relying on a core configuration and extended its default value to include things that redact desired things in the app's environment. Instead, add a SQL-specific option for which options to redact, and apply both the core and SQL-specific rules when redacting the options in the save command. This is a little sub-optimal since it adds another config, but it retains the current default behavior. While there I also fixed a typo and a couple of minor config API usage issues in the "sistes" redaction option that SQL already had. Tested with existing unit tests, plus checking the env page on a shell UI.
@@ -342,7 +342,7 @@ package object config { | |||
"a property key or value, the value is redacted from the environment UI and various logs " + | |||
"like YARN and event logs.") | |||
.regexConf | |||
.createWithDefault("(?i)secret|password|url|user|username".r) | |||
.createWithDefault("(?i)secret|password".r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the users want to see the values, they can change the default by themselves, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you're trying to imply? The default is redacting too much. That needs to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, making it stricter is not a big deal. However, if we want to relax it, this could be a big deal since users might expect these variables have been hided by default. cc the original reviewers and author @onursatici @ash211 @hvanhovell @jiangxb1987
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URIs could contain the access keys and usernames could have potentially personal identifiable information. Could we do not change the default?
@@ -225,7 +225,7 @@ class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) { | |||
* Redact the sensitive information in the given string. | |||
*/ | |||
private def withRedaction(message: String): String = { | |||
Utils.redact(sparkSession.sessionState.conf.stringRedationPattern, message) | |||
Utils.redact(sparkSession.sessionState.conf.stringRedactionPattern, message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this typo
Test build #89855 has finished for PR 21158 at commit
|
Looks good. I didn't have a jdbc connection handy to test with. I was looking for documentation on these sql confs and I don't see them in the SET -v output, did we decide not to document these? I haven't looked to see how that works. |
That's part of this fix, if you're talking about the old config. Since it was using If that was intentional then probably that part should be reverted - don't see why it should be hidden, but maybe @gatorsmile had a reason. |
Ping @gatorsmile |
The confs in the core modules are not part of the outputs of SQL statement We are making a behavior change here. I am not confident about the changes in this PR. I would like to hear more options from the others. |
We are not. That's the whole reason why I'm adding the SQL-specific option to extend the behavior of the core options. If I just wanted to revert the original behavior change, which added more unwanted default redactions to the core options, I'd just have changed the default value of the core option back to what it was. |
Is the default |
the original change was merged into 2.3.0, see https://issues.apache.org/jira/browse/SPARK-22479. We are keeping the behavior of that fix and that is to not show the credentials of the jdbc connection. The url and user name were added for some reason in that jira as well and they shouldn't have been. There is nothing sensitive about these. I don't see this being a breaking compatibility issue. |
But the main thing is that the change modified the previous behavior which was not to redact these things in the previous places where redaction occurred. So, in a way, this change is fixing a regression in 2.3.0 (which was also added to 2.2.1 according to the bug), and still providing the same feature from SPARK-22479. |
Also note that Tom's comment above is not 100% accurate - this change does redact the JDBC URL on the SQL side, for reasons I described in my comments on the bug. That's the whole reason why the SQL changes are even needed. |
@gatorsmile is there anything else you're looking for here? |
The default value introduced by SPARK-22479 was already part of Apache Spark 2.3. It is hard to say this is a regression, since URIs could contain the access keys and usernames could have potentially personal identifiable information. Thus, it also makes sense to block them. Although the original JIRA is for JDBC, we also make it more secure. If users do not want to redact them, they can change the default. I am not feeling comfortable to not redact them in the current situation, since we already did it in Spark 2.3. All the security issues could be serious. Thus, I think we should make it unchanged. |
I disagree on this. I find it a regression that you blocked url's and user names in the first place. Showing these things have been a feature in spark for a long time. The only security issue here is with jdbc driver, block that instance. Are there any other places spark itself prints or uses a url with a password? If so we should fix those. If not then it would be specified in a user config and if they are doing that, they should change the config to add it to be redacted. I think the current behavior is purely worse from a user perspective. There are so many urls that have no security implications and us blocking them by default I think is wrong. |
Also, Spark by default is shipped with not secure settings. Meaning spark.acls.enable is false and spark.authenticate is false. I see no reason to make the redact configs more strict then our defaults for those (Note I'm not arguing we shouldn't redact credentials where Spark itself is showing). If a user turns on the acls, then by default only the user who submitted the job can see the UI. Going back to https://issues.apache.org/jira/browse/SPARK-22479 it looks like the user can see the url via the console, logs and I assume in the UI. Is there somewhere else someone can see this information? I want to make sure I understand the vulnerability here. If you enable security properly on spark no user should have access to those without being given permission. If you are not running with acls and authentication on then I would argue there are a lot of attack vectors to where I could run things as another user anyway. |
@gatorsmile seriously, why are you focusing on the fact that I'm changing that default value when I wrote a bunch of code to actually maintain the URL redaction on the SQL plans, which was the original intent? Did you even read the PR at all? |
User names are not sensitive information. There's a ton of places where you can see them outside of the Spark UI (have you tried running |
@vanzin That really depends on how the users deploy Spark. Not all the Spark end users are allowed to access the command lines and run @tgravescs @vanzin This PR tries to not affect the original scenario explained in the JIRA SPARK-22479. I would say thank you @vanzin for your efforts. The PR to fix SPARK-22479 also impacts more cases (e.g., in the environment page) too. I can understand your arguments. In your scenarios, it would be nice to expose all these info to the external users by default, as what we did in the past. Since the new behaviors have already been introduced in the last release, it would be better to keep them unchanged, if possible, based on the principle of least surprise. The lessons we learned is to document the behavior change in the migration guide. When users try the RC branches, the users could easily catch the changes. Unfortunately, we did not do it in that PR. Thus, it becomes a surprise to @tgravescs after we already released Spark 2.3. We need to avoid similar cases. We need to enforce every PR to document the behavior changes, no matter whether it is a regression fix or a bug fix. I am also fine to merge this if the other reviewers of the previous PR think we should change the behavior again. cc @onursatici @ash211 @hvanhovell @jiangxb1987 In the PR, please document the behavior changes. Hopefully, we will not see another PR to change the default value again after we release Spark 2.4. cc @vanzin @tgravescs WDYT? |
Come on. That is just an example. You surely are smart enough to think of another example where you can see other people's user names outside of Spark.
I disagree. Can you list all the Spark configuration that would be redacted by the conf in 2.3? I could only find two:
And those are a great example of why the original change is a regression. It's wrong to redact those. It makes debugging things harder. And yet there are a ton of config keys where you can provide URLs, the thing you seem really worried about, which are not caught by that regex. Let's see... spark.master, spark.jars, spark.files, spark.yarn.archives, spark.yarn.jars, spark.jars.repositories... I could go on.
Every change introduces a behavior change (unless you're like changing a comment string or something like that). Saying a change is not wanted because it changes behavior is really quite silly. The question is always whether the change is desired or not. And in this case it is. The change that was introduced in 2.3.0 is a regression in behavior from 2.2, or, in other words, it makes things worse. So we should fix it.
As I've stated in previous comments, I don't think there is a user-visible behavior change - at least one users would care about. As I mentioned above, the default value in 2.3.0 only causes problems, and doesn't apply to anything useful other than the JDBC URL, a behavior that this change keeps. |
I think this PR is fine given it preserves the redaction of URL's. That was indeed the main reason why url's were in the default redaction pattern in the previous PR. |
User names, unlike passwords, are useful for debugging. And they're not meant to be secret. They're meant to identify an entity, and by that, it means it's not generally hard to guess them. Which is why you need a password. (Think it in a different way: if you access a table you shouldn't, wouldn't you get an exception saying "user blah cannot access table foo"? And are you redacting that in the places where that stuff shows up?) If you have an environment where even user names are considered secret, it's easy enough to change the configuration. But at that time you really should think hard about following Tom's advice above and just enable authentication for your web UIs. Otherwise you're not really taking security seriously. I really disliked even keeping the URL redacted, since that's even more useful than the user for debugging. But some vendors still support and even document putting passwords in those URLs, so that's why I kept it. If you guys really feel strongly about redacting user names, I'll add it back in the SQL config. I don't really care about that part that much, even if I don't agree with the premise. But I strongly disagree with keeping the current default value in the core option. |
ping |
ping again |
Last ping before I assume everybody is ok with the patch. |
I would like to hear more inputs from the reviewers of the original PR. cc @onursatici @ash211 @hvanhovell @jiangxb1987 |
That's fine, but I'm not going to wait indefinitely for their feedback. They've already been pinged multiple times here and on jira to comment on this issue. |
I plan to push this tomorrow by EOD unless I hear strong opposition, and merge it to 2.3 and 2.2 (where the original change was also backported). I want to avoid having another release out with the undesired behavior. |
I don't have any strong opinions about usernames, so merging this as it stays is fine by me |
retest this please |
Test build #90750 has finished for PR 21158 at commit
|
Merging to master / 2.3 / 2.2. |
The old code was relying on a core configuration and extended its default value to include things that redact desired things in the app's environment. Instead, add a SQL-specific option for which options to redact, and apply both the core and SQL-specific rules when redacting the options in the save command. This is a little sub-optimal since it adds another config, but it retains the current default behavior. While there I also fixed a typo and a couple of minor config API usage issues in the related redaction option that SQL already had. Tested with existing unit tests, plus checking the env page on a shell UI. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21158 from vanzin/SPARK-23850. (cherry picked from commit ed7ba7d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(I'll open a separate PR for 2.2 to run tests before pushing, since there are conflicts.) |
…daction. The old code was relying on a core configuration and extended its default value to include things that redact desired things in the app's environment. Instead, add a SQL-specific option for which options to redact, and apply both the core and SQL-specific rules when redacting the options in the save command. This is a little sub-optimal since it adds another config, but it retains the current default behavior. While there I also fixed a typo and a couple of minor config API usage issues in the related redaction option that SQL already had. Tested with existing unit tests, plus checking the env page on a shell UI. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#21158 from vanzin/SPARK-23850. (cherry picked from commit ed7ba7d) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
The old code was relying on a core configuration and extended its
default value to include things that redact desired things in the
app's environment. Instead, add a SQL-specific option for which
options to redact, and apply both the core and SQL-specific rules
when redacting the options in the save command.
This is a little sub-optimal since it adds another config, but it
retains the current default behavior.
While there I also fixed a typo and a couple of minor config API
usage issues in the related redaction option that SQL already had.
Tested with existing unit tests, plus checking the env page on
a shell UI.