Skip to content

[SPARK-25880][CORE] user set's hadoop conf should not overwrite by sparkcontext's conf #22887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gjhkael
Copy link

@gjhkael gjhkael commented Oct 30, 2018

What changes were proposed in this pull request?

Hadoop conf which is set by user which is use sparksql's set command should not overwrite by sparkContext's conf which is read from spark-default.conf.

How was this patch tested?

manually verified with 2.2.0

@gjhkael gjhkael changed the title user set's hadoop conf should not overwrite by sparkcontext's conf [SPARK-25880] user set's hadoop conf should not overwrite by sparkcontext's conf Oct 30, 2018
@gjhkael gjhkael changed the title [SPARK-25880] user set's hadoop conf should not overwrite by sparkcontext's conf [SPARK-25880][Core] user set's hadoop conf should not overwrite by sparkcontext's conf Oct 30, 2018
@gjhkael gjhkael changed the title [SPARK-25880][Core] user set's hadoop conf should not overwrite by sparkcontext's conf [SPARK-25880][CORE] user set's hadoop conf should not overwrite by sparkcontext's conf Oct 30, 2018
@gjhkael
Copy link
Author

gjhkael commented Oct 30, 2018

test this please

@gengliangwang
Copy link
Member

Hi @gjhkael ,
can you explain more about why you make the change?
Did you try spark.SessionState.newHadoopConf()?

@gjhkael
Copy link
Author

gjhkael commented Oct 31, 2018

can you explain more about why you make the change?
Some hadoop configuration set it in spark-default.conf, we want it to be global, but in some cases, user need to override the configuration but cannot works, for the sparkContext's conf fill the hadoopConf again finally before broadcast this hadoop conf.
Did you try spark.SessionState.newHadoopConf()?
We have this problem is in Spark-sql, not use the datafame api

@cxzl25
Copy link
Contributor

cxzl25 commented Nov 9, 2018

user set hadoop conf can't overwrite spark-defaults.conf

SparkHadoopUtil.get.appendS3AndSparkHadoopConfigurations overwrite the user-set spark.hadoop with the default configuration (sparkSession.sparkContext.conf)

@gengliangwang @cloud-fan @gatorsmile
Could you please give some comments when you have time?
Thanks so much.

SparkHadoopUtil.get.appendS3AndSparkHadoopConfigurations(
sparkSession.sparkContext.conf, hadoopConf)
private val _broadcastedHadoopConf =
sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))

test:

spark-defaults.conf

spark.hadoop.mapreduce.input.fileinputformat.split.maxsize  2

spark-shell

val hadoopConfKey="mapreduce.input.fileinputformat.split.maxsize"
spark.conf.get("spark.hadoop."+hadoopConfKey) // 2
var hadoopConf=spark.sessionState.newHadoopConf
hadoopConf.get(hadoopConfKey) // 2

spark.conf.set(hadoopConfKey,1) // set 1
hadoopConf=spark.sessionState.newHadoopConf
hadoopConf.get(hadoopConfKey) // 1

//org.apache.spark.sql.hive.HadoopTableReader append Conf
org.apache.spark.deploy.SparkHadoopUtil.get.appendS3AndSparkHadoopConfigurations(spark.sparkContext.getConf, hadoopConf)

//org.apache.spark.sql.hive.HadoopTableReader _broadcastedHadoopConf
hadoopConf.get("mapreduce.input.fileinputformat.split.maxsize") // 2

@cloud-fan
Copy link
Contributor

looks reasonable, cc @gatorsmile

@vanzin
Copy link
Contributor

vanzin commented Nov 26, 2018

Sorry, this is a breaking change. It changes the behavior from "I can currently override any Hadoop configs, even final ones, using spark.hadoop.*" to "I can never do that".

If there's an issue with the SQL "set" command that needs to be addressed, this is the wrong place to do it.

Basically, if my "core-size.xml" says "mapreduce.input.fileinputformat.split.maxsize" is 2, and my Spark conf says "spark.hadoop.mapreduce.input.fileinputformat.split.maxsize" is 3, then the value from the config generated by the method you're changing must be 3.

@gjhkael
Copy link
Author

gjhkael commented Nov 27, 2018

@vanzin Thanks for you review, I add a new commit to let the user's "set" command take effect. Let me know if you have an easier way. Thanks.

@cloud-fan
Copy link
Contributor

Basically, if my "core-size.xml" says "mapreduce.input.fileinputformat.split.maxsize" is 2, and my Spark conf says "spark.hadoop.mapreduce.input.fileinputformat.split.maxsize" is 3, then the value from the config generated by the method you're changing must be 3.

I think this is what this PR tries to fix? the hadoopConf parameter of appendS3AndSparkHadoopConfigurations is either an empty one, or a one from spark.SessionState.newHadoopConf() which contains user-provided hadoop conf.

@vanzin
Copy link
Contributor

vanzin commented Nov 27, 2018

I think this is what this PR tries to fix?

To be fair I'm not sure I fully understand the PR description. But I know that the previous patch (which I commented on) broke the functionality I described - not in the context of SQL, but in the context of everything else in Spark that calls that code.

@gjhkael
Copy link
Author

gjhkael commented Nov 28, 2018

@vanzin @cloud-fan
The simplest description: user set 'spark.hadoop.xxx' through setCommand will not cover the same configutation that set in spark-defaults.conf file.
I don't know if that description makes sense?

@vanzin
Copy link
Contributor

vanzin commented Nov 28, 2018

ok, that makes sense as in I understand what you're saying, but not sure it's what you actually want?

Why shouldn't "set spark.hadoop.*" override spark-defaults.conf?

But, in any case, it seems like the patch for SPARK-26060 (#23031) should take care of this (by raising an error).

@cloud-fan
Copy link
Contributor

Spark SQL SET command can't update any static config or Spark core configs, but I think hadoop configs are different. It's not static as users can update it via SparkContext.hadoopConfiguration. SparkSession.SessionState.newHadoopConf() is a mechanism to allow users to set hadoop config per-session in Spark SQL.

So it's reasonble for users to expect that, if they set hadoop config via the SQL SET command, it should override the one in spark-defaults.conf.

Looking back at appendS3AndSparkHadoopConfigurations, it has 2 parameters: spark conf and hadoop conf. The spark conf comes from spark-defaults.conf and any user provided configs when building the SparkContext. The user provided configs override spark-defaults.conf. The hadoop conf is either an empty config(if appendS3AndSparkHadoopConfigurations is called from SparkHadoopUtil.newHadoopConfiguration), or from SparkSession.SessionState.newHadoopConf()(if appendS3AndSparkHadoopConfigurations is called from HadoopTableReader).

For the first case, nothing we need to worry about. For the second case, I think the hadoop config should take priority, as it contains the configs specified by users at rutime.

@vanzin
Copy link
Contributor

vanzin commented Nov 28, 2018

So it's reasonble for users to expect that, if they set hadoop config via the SQL SET command, it should override the one in spark-defaults.conf.

I agree with that. But the previous explanation seemed to be saying that's the undesired behavior. Maybe I'm just having trouble with understanding what @gjhkael wrote.

@srowen
Copy link
Member

srowen commented Dec 4, 2018

@gjhkael can you clarify further what the undesirable behavior is, and what behavior you are looking for?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@tgravescs
Copy link
Contributor

@gjhkael can you please clarify this or close it if it is no longer relavent?

@srowen srowen closed this Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants