[SPARK-25880][CORE] user set's hadoop conf should not overwrite by sparkcontext's conf #22887

gjhkael · 2018-10-30T07:37:49Z

What changes were proposed in this pull request?

Hadoop conf which is set by user which is use sparksql's set command should not overwrite by sparkContext's conf which is read from spark-default.conf.

How was this patch tested?

manually verified with 2.2.0

gjhkael · 2018-10-30T07:44:06Z

test this please

gengliangwang · 2018-10-31T00:28:22Z

Hi @gjhkael ,
can you explain more about why you make the change?
Did you try spark.SessionState.newHadoopConf()?

gjhkael · 2018-10-31T00:41:02Z

can you explain more about why you make the change?
Some hadoop configuration set it in spark-default.conf, we want it to be global, but in some cases, user need to override the configuration but cannot works, for the sparkContext's conf fill the hadoopConf again finally before broadcast this hadoop conf.
Did you try spark.SessionState.newHadoopConf()?
We have this problem is in Spark-sql, not use the datafame api

cxzl25 · 2018-11-09T08:51:03Z

user set hadoop conf can't overwrite spark-defaults.conf

SparkHadoopUtil.get.appendS3AndSparkHadoopConfigurations overwrite the user-set spark.hadoop with the default configuration (sparkSession.sparkContext.conf)

@gengliangwang @cloud-fan @gatorsmile
Could you please give some comments when you have time?
Thanks so much.

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala

Lines 85 to 89 in 80813e1

    
           SparkHadoopUtil.get.appendS3AndSparkHadoopConfigurations( 
        
             sparkSession.sparkContext.conf, hadoopConf) 
        
           private val _broadcastedHadoopConf = 
        
             sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))

test:

spark-defaults.conf

spark.hadoop.mapreduce.input.fileinputformat.split.maxsize  2

spark-shell

val hadoopConfKey="mapreduce.input.fileinputformat.split.maxsize"
spark.conf.get("spark.hadoop."+hadoopConfKey) // 2
var hadoopConf=spark.sessionState.newHadoopConf
hadoopConf.get(hadoopConfKey) // 2

spark.conf.set(hadoopConfKey,1) // set 1
hadoopConf=spark.sessionState.newHadoopConf
hadoopConf.get(hadoopConfKey) // 1

//org.apache.spark.sql.hive.HadoopTableReader append Conf
org.apache.spark.deploy.SparkHadoopUtil.get.appendS3AndSparkHadoopConfigurations(spark.sparkContext.getConf, hadoopConf)

//org.apache.spark.sql.hive.HadoopTableReader _broadcastedHadoopConf
hadoopConf.get("mapreduce.input.fileinputformat.split.maxsize") // 2

cloud-fan · 2018-11-12T07:44:48Z

looks reasonable, cc @gatorsmile

vanzin · 2018-11-26T23:49:40Z

Sorry, this is a breaking change. It changes the behavior from "I can currently override any Hadoop configs, even final ones, using spark.hadoop.*" to "I can never do that".

If there's an issue with the SQL "set" command that needs to be addressed, this is the wrong place to do it.

Basically, if my "core-size.xml" says "mapreduce.input.fileinputformat.split.maxsize" is 2, and my Spark conf says "spark.hadoop.mapreduce.input.fileinputformat.split.maxsize" is 3, then the value from the config generated by the method you're changing must be 3.

gjhkael · 2018-11-27T07:45:22Z

@vanzin Thanks for you review, I add a new commit to let the user's "set" command take effect. Let me know if you have an easier way. Thanks.

cloud-fan · 2018-11-27T12:06:53Z

Basically, if my "core-size.xml" says "mapreduce.input.fileinputformat.split.maxsize" is 2, and my Spark conf says "spark.hadoop.mapreduce.input.fileinputformat.split.maxsize" is 3, then the value from the config generated by the method you're changing must be 3.

I think this is what this PR tries to fix? the hadoopConf parameter of appendS3AndSparkHadoopConfigurations is either an empty one, or a one from spark.SessionState.newHadoopConf() which contains user-provided hadoop conf.

vanzin · 2018-11-27T16:26:54Z

I think this is what this PR tries to fix?

To be fair I'm not sure I fully understand the PR description. But I know that the previous patch (which I commented on) broke the functionality I described - not in the context of SQL, but in the context of everything else in Spark that calls that code.

gjhkael · 2018-11-28T00:45:13Z

@vanzin @cloud-fan
The simplest description: user set 'spark.hadoop.xxx' through setCommand will not cover the same configutation that set in spark-defaults.conf file.
I don't know if that description makes sense?

vanzin · 2018-11-28T00:51:37Z

ok, that makes sense as in I understand what you're saying, but not sure it's what you actually want?

Why shouldn't "set spark.hadoop.*" override spark-defaults.conf?

But, in any case, it seems like the patch for SPARK-26060 (#23031) should take care of this (by raising an error).

cloud-fan · 2018-11-28T12:11:15Z

Spark SQL SET command can't update any static config or Spark core configs, but I think hadoop configs are different. It's not static as users can update it via SparkContext.hadoopConfiguration. SparkSession.SessionState.newHadoopConf() is a mechanism to allow users to set hadoop config per-session in Spark SQL.

So it's reasonble for users to expect that, if they set hadoop config via the SQL SET command, it should override the one in spark-defaults.conf.

Looking back at appendS3AndSparkHadoopConfigurations, it has 2 parameters: spark conf and hadoop conf. The spark conf comes from spark-defaults.conf and any user provided configs when building the SparkContext. The user provided configs override spark-defaults.conf. The hadoop conf is either an empty config(if appendS3AndSparkHadoopConfigurations is called from SparkHadoopUtil.newHadoopConfiguration), or from SparkSession.SessionState.newHadoopConf()(if appendS3AndSparkHadoopConfigurations is called from HadoopTableReader).

For the first case, nothing we need to worry about. For the second case, I think the hadoop config should take priority, as it contains the configs specified by users at rutime.

vanzin · 2018-11-28T17:39:53Z

So it's reasonble for users to expect that, if they set hadoop config via the SQL SET command, it should override the one in spark-defaults.conf.

I agree with that. But the previous explanation seemed to be saying that's the undesired behavior. Maybe I'm just having trouble with understanding what @gjhkael wrote.

srowen · 2018-12-04T14:44:46Z

@gjhkael can you clarify further what the undesirable behavior is, and what behavior you are looking for?

AmplabJenkins · 2019-09-16T18:18:08Z

Can one of the admins verify this patch?

tgravescs · 2019-10-21T15:34:37Z

@gjhkael can you please clarify this or close it if it is no longer relavent?

user set's hadoop conf should not overwrite by sparkcontext's conf

c3d9dea

gjhkael changed the title ~~user set's hadoop conf should not overwrite by sparkcontext's conf~~ [SPARK-25880] user set's hadoop conf should not overwrite by sparkcontext's conf Oct 30, 2018

gjhkael changed the title ~~[SPARK-25880] user set's hadoop conf should not overwrite by sparkcontext's conf~~ [SPARK-25880][Core] user set's hadoop conf should not overwrite by sparkcontext's conf Oct 30, 2018

gjhkael changed the title ~~[SPARK-25880][Core] user set's hadoop conf should not overwrite by sparkcontext's conf~~ [SPARK-25880][CORE] user set's hadoop conf should not overwrite by sparkcontext's conf Oct 30, 2018

fixed the review issues

58d3d0b

ueshin mentioned this pull request Nov 28, 2018

[SPARK-26060][SQL] Track SparkConf entries and make SET command reject such entries. #23031

Closed

dongjoon-hyun added SPARK CORE SQL labels Jun 14, 2019

srowen closed this Nov 6, 2019

[SPARK-25880][CORE] user set's hadoop conf should not overwrite by sparkcontext's conf #22887

[SPARK-25880][CORE] user set's hadoop conf should not overwrite by sparkcontext's conf #22887

Uh oh!

Conversation

gjhkael commented Oct 30, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gjhkael commented Oct 30, 2018

Uh oh!

gengliangwang commented Oct 31, 2018

Uh oh!

gjhkael commented Oct 31, 2018

Uh oh!

cxzl25 commented Nov 9, 2018

test:

spark-defaults.conf

spark-shell

Uh oh!

cloud-fan commented Nov 12, 2018

Uh oh!

vanzin commented Nov 26, 2018

Uh oh!

gjhkael commented Nov 27, 2018

Uh oh!

cloud-fan commented Nov 27, 2018

Uh oh!

vanzin commented Nov 27, 2018

Uh oh!

gjhkael commented Nov 28, 2018

Uh oh!

vanzin commented Nov 28, 2018

Uh oh!

cloud-fan commented Nov 28, 2018

Uh oh!

vanzin commented Nov 28, 2018

Uh oh!

srowen commented Dec 4, 2018

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

tgravescs commented Oct 21, 2019

Uh oh!

Uh oh!