Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48392][CORE] Also load spark-defaults.conf when provided --properties-file #46709

Closed
wants to merge 3 commits into from

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented May 22, 2024

What changes were proposed in this pull request?

Currently if a property file is provided as argument to Spark submit, the spark-defaults.conf is ignored. This PR changes the behavior such that spark-defaults.conf is still loaded in this scenario, and any Spark configurations that are in the file but not in the input property file will be loaded.

Why are the changes needed?

Currently if a property file is provided as argument to Spark submit, the spark-defaults.conf is ignored. This causes inconveniences for users who want to split Spark configurations into the two. For example, in Spark on K8S users may want to store system wide default settings in spark-defaults.conf, while user specified configurations that are more dynamic in an property file and pass to Spark applications via --properties-file parameter. Currently this is not possible. See also kubeflow/spark-operator#1321.

Does this PR introduce any user-facing change?

Yes, now when a property file is provided via --properties-file, spark-defaults.conf will also be loaded. However, those configurations specified in the former will take precedence over the same configurations in the latter.

How was this patch tested?

Existing tests and a new test.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label May 22, 2024
@sunchao sunchao changed the title [WIP][SPARK-48392][CORE] Also load spark-defaults.conf when provided --properties-file [SPARK-48392][CORE] Also load spark-defaults.conf when provided --properties-file May 28, 2024
@sunchao sunchao marked this pull request as ready for review May 28, 2024 15:05
}

if (propertiesFile == null) {
propertiesFile = defaultSparkConf
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly to make the test happy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this follows original behavior to overwrite propertiesFile value. Although it looks like a strange behavior.

@@ -1623,6 +1640,22 @@ class SparkSubmitSuite
}
}

private def withPropertyFile(fileName: String, conf: Map[String, String])(f: String => Unit) = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract some common logic of creating a property file into a util function.

@sunchao
Copy link
Member Author

sunchao commented May 28, 2024

cc @viirya @dongjoon-hyun @cloud-fan let me know what you think

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this proposed behavior looks reasonable to me. If there is any document on this, we probably need to update it too.

Do you also think we should add an item to migration guide for this?

@sunchao
Copy link
Member Author

sunchao commented May 28, 2024

I think this proposed behavior looks reasonable to me. If there is any document on this, we probably need to update it too.

Thanks @viirya . There is a related documentation here: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties. It looks like the behavior change in this PR is consistent with the description there:

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.

Do you also think we should add an item to migration guide for this?

Yes sure. I can add this to the migration guide as a separate PR afterwards.

@viirya
Copy link
Member

viirya commented May 28, 2024

Thanks @viirya . There is a related documentation here: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties. It looks like the behavior change in this PR is consistent with the description there:

Yea, but seems it doesn't mention how does --properties-file work with spark-defaults.conf. Maybe we can add few words there too.

@@ -125,14 +125,27 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S
* When this is called, `sparkProperties` is already filled with configs from the latter.
*/
private def mergeDefaultSparkProperties(): Unit = {
// Use common defaults file, if not specified by user
propertiesFile = Option(propertiesFile).getOrElse(Utils.getDefaultPropertiesFile(env))
// Honor --conf before the defaults file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Honor --conf before the defaults file
// Honor --conf before the specified properties file and defaults file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions github-actions bot added the DOCS label May 28, 2024
@sunchao
Copy link
Member Author

sunchao commented May 28, 2024

Yea makes sense. Just updated the doc too.

@sunchao sunchao closed this in eac413a May 29, 2024
@sunchao
Copy link
Member Author

sunchao commented May 29, 2024

Thanks! merged to master/4.0

@yaooqinn
Copy link
Member

You need to update all the usage information for commands with --help

@sunchao
Copy link
Member Author

sunchao commented May 29, 2024

@yaooqinn do you mean the --help from spark-submit / spark-shell? currently the usage about --properties-file is as follow:

  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

which I think is also consistent with the behavior here although not complete? I can add more info there.

@yaooqinn
Copy link
Member

--properties-file FILE   Path to a custom Spark properties file. Default is conf/spark-defaults.conf.

Do the commands with the above message also get affected?

@sunchao
Copy link
Member Author

sunchao commented May 29, 2024

No they are not affected. This PR only changes SparkSubmitArguments. Classes such as HistoryServerArguments calls Utils.loadDefaultSparkProperties.

@yaooqinn
Copy link
Member

I'm not sure if such an implicit discrepancy is user-friendly or not.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a behavior change and it seems there's no easy way for users to restore the original behavior. Do you think it would be safer to add a new cli argument, such as --extra-properties-file instead? By adding --extra-properties-files, I think the affect scope could be minimum.

A potential use case for the original --properties-file: there might be multiple spark clusters(or yarn clusters) to submit spark jobs and there are multiple spark-config files under $SPARK_CONF_DIR, such as:

  1. spark-defaults.conf --> which stores the default cluster settings
  2. spark-cluster-a.conf --> which stores the cluster settings for cluster a
  3. spark-cluster-b.conf --> which stores the cluster settings for cluster b

For most cases, the default cluster is used. For some specific jobs, cluster-a might be used. After this change, both spark-defaults.conf and spark-cluster-a.conf are loaded, which might be undesired as users are already assuming only one property file will be loaded.

// Use common defaults file, if not specified by user
propertiesFile = Option(propertiesFile).getOrElse(Utils.getDefaultPropertiesFile(env))
// Honor --conf before the defaults file
// Honor --conf before the specified properties file and defaults file
defaultSparkProperties.foreach { case (k, v) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when running with --verbose, the defaultSparkProperties evaluation will logging: "Using properties file: null" if --properties-file is not provided as propertiesFile is not set yet., which I think it's not expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops that's right. I'll try to address it in the follow-up PR.

@viirya
Copy link
Member

viirya commented May 29, 2024

  • spark-defaults.conf --> which stores the default cluster settings
  • spark-cluster-a.conf --> which stores the cluster settings for cluster a
  • spark-cluster-b.conf --> which stores the cluster settings for cluster b

For the properties defined in spark-cluster-a.conf, it will be effective. So any duplicate properties defined in spark-defaults.conf as default ones, will be overwritten. I think this still follows original behavior.

For what is described, it seems giving this change more support? As users can put common default ones in spark-defaults.conf and apply them to all clusters. Only the different properties need to be defined in separate files.

Actually I feel this is more aligned with the meanings of "default" configs.

@sunchao
Copy link
Member Author

sunchao commented May 29, 2024

+1 on what @viirya said.

@advancedxy to be clear the configurations from spark-cluster-a.conf will still take higher precedence than those from spark-defaults.conf, and I do think this is more aligned with the purpose of spark-defaults.conf. This also addresses an inconsistency in the current implementation: for configurations passed through --conf flags, Spark will still load spark-defaults.conf, but it will not for those passed from --properties-file, which is a bit confusing.

Since we are preparing a new major release, I'm just hoping to take the opportunity to improve this. This is also important for kubeflow Spark k8s operator users, since currently there is no way to separate system-wide default configurations and user and job specific configurations.

@advancedxy
Copy link
Contributor

For the properties defined in spark-cluster-a.conf, it will be effective. So any duplicate properties defined in spark-defaults.conf as default ones, will be overwritten. I think this still follows original behavior.

@advancedxy to be clear the configurations from spark-cluster-a.conf will still take higher precedence than those from spark-defaults.conf, and I do think this is more aligned with the purpose of spark-defaults.conf. This also addresses an inconsistency in the current implementation: for configurations passed through --conf flags, Spark will still load spark-defaults.conf, but it will not for those passed from --properties-file, which is a bit confusing.

@sunchao and @viirya thanks for your replies and clarifications. Yeah, I'm aware of the effectiveness and precedence between the spark-defaults.conf and the --properties-files files after this pr. And I would prefer this new behavior if we are implementing this feature from day one.

The biggest problem is that the potential unwanted configurations defined in spark-defaults.conf are also loaded if not overridden by --properties-files. Take the multi-cluster setups above for example, the cluster admin might put more settings which are suitable for the default cluster, such as spark.sql.shuffle.partitions, spark.yarn.maxAppAttempts etc, which are not overridden by spark-cluster-a.conf. Users might expect another default setting defined in spark's source code, rather than the values in spark-defaults.conf. That's why I think it might be a bit unsafe.

Since we are preparing a new major release, I'm just hoping to take the opportunity to improve this.

This sounds legit and it might be worthy of taking the risk.

This is also important for kubeflow Spark k8s operator users, since currently there is no way to separate system-wide default configurations and user and job specific configurations.

This is a legit use case and we should definitely support that. It could be addressed by another cli argument though.

Anyway, I would prefer an extra cli argument such as --extra-properties-files due to the principle of least surprise. However, it would also be worthy to take the leap for spark 4.0 and refine the current behavior. I think it's up to you guys and I trust your judgment on this one. Just to make sure the implications are clear after this pr.

@cloud-fan
Copy link
Contributor

cloud-fan commented May 30, 2024

IIUC, we want to have a way to load the default config file, and it's cleaner if --properties-file can do it. However, I do share the same concern of @advancedxy . Adding --extra-properties-files won't hurt and you can use this new option to load the default config file. Then we can leave --properties-file unchanged and avoid the risk of breaking Spark users during version upgrade.

@cloud-fan
Copy link
Contributor

BTW, I agree that it's a better behavior to just let --properties-file load the default config file. We should have used this behavior on the first day. But now I think it's a bit overkill to make a breaking change just for a cleaner API. --extra-properties-files should be sufficient functional wise.

@yaooqinn
Copy link
Member

+1 for --extra-properties-files.

As a maintainer of Apache Kyuubi, which supports multitenant and multi-runtime environments. This change might pollute an isolated custom spark conf.

@sunchao
Copy link
Member Author

sunchao commented May 30, 2024

Thanks for all the feedback! Let me re-purpose the follow up PR #46782 to add this extra flag then (and we no longer need to update the migration guide). One minor inconvenience is we'd also need to update kubeflow k8s operator to consume this flag, and then users of the operator need to upgrade first in order to use this. I'll make a PR there too.

sunchao added a commit that referenced this pull request Jun 2, 2024
…ide whether to load `spark-defaults.conf`

### What changes were proposed in this pull request?

Following the discussions in #46709, this PR adds a flag `--load-spark-defaults` to control whether `spark-defaults.conf` should be loaded when `--properties-file` is specified. By default, the flag is turned off.

### Why are the changes needed?

Ideally we should avoid behavior change and reduce user disruptions. For this reason, a flag is preferred.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46782 from sunchao/SPARK-48392-followup.

Authored-by: Chao Sun <chao@openai.com>
Signed-off-by: Chao Sun <chao@openai.com>
riyaverm-db pushed a commit to riyaverm-db/spark that referenced this pull request Jun 7, 2024
…properties-file`

### What changes were proposed in this pull request?

Currently if a property file is provided as argument to Spark submit, the `spark-defaults.conf` is ignored. This PR changes the behavior such that `spark-defaults.conf` is still loaded in this scenario, and any Spark configurations that are in the file but not in the input property file will be loaded.

### Why are the changes needed?

Currently if a property file is provided as argument to Spark submit, the `spark-defaults.conf` is ignored. This causes inconveniences for users who want to split Spark configurations into the two. For example, in Spark on K8S users may want to store system wide default settings in `spark-defaults.conf`, while user specified configurations that are more dynamic in an property file and pass to Spark applications via `--properties-file` parameter. Currently this is not possible. See also kubeflow/spark-operator#1321.

### Does this PR introduce _any_ user-facing change?

Yes, now when a property file is provided via `--properties-file`, `spark-defaults.conf` will also be loaded. However, those configurations specified in the former will take precedence over the same configurations in the latter.

### How was this patch tested?

Existing tests and a new test.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46709 from sunchao/SPARK-48392.

Authored-by: Chao Sun <chao@openai.com>
Signed-off-by: Chao Sun <chao@openai.com>
riyaverm-db pushed a commit to riyaverm-db/spark that referenced this pull request Jun 7, 2024
…ide whether to load `spark-defaults.conf`

### What changes were proposed in this pull request?

Following the discussions in apache#46709, this PR adds a flag `--load-spark-defaults` to control whether `spark-defaults.conf` should be loaded when `--properties-file` is specified. By default, the flag is turned off.

### Why are the changes needed?

Ideally we should avoid behavior change and reduce user disruptions. For this reason, a flag is preferred.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#46782 from sunchao/SPARK-48392-followup.

Authored-by: Chao Sun <chao@openai.com>
Signed-off-by: Chao Sun <chao@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants