Skip to content

[SPARK-31692][SQL][2.4] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory #28529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

karuppayya
Copy link
Contributor

@karuppayya karuppayya commented May 14, 2020

What changes were proposed in this pull request?

Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory

Why are the changes needed?

BEFORE

➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState@5793cd84

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream@22846025

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem@5a930c03

AFTER

➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState@5c24a636

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream@2ba8f528

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = LocalFS

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass
res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem

The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the jenkins with newly added test cases.

…treamHandlerfactory

Pass hadoop confs  specifed via Spark confs to URLStreamHandlerfactory

**BEFORE**
```
➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5793cd84

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream22846025

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem5a930c03
```

**AFTER**
```
➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5c24a636

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream2ba8f528

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = LocalFS

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass
res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem
```
The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case

No

Tested locally.
Added Unit test

Closes apache#28516 from karuppayya/SPARK-31692.

Authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 7260146)
@karuppayya
Copy link
Contributor Author

Backport commit for SPARK-31692 @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 14, 2020

Thank you, @karuppayya ! I updated the PR description like the original PR.

@@ -157,7 +157,13 @@ private[sql] class SharedState(val sparkContext: SparkContext) extends Logging {

object SharedState extends Logging {
try {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory())
SparkSession.getActiveSession match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, is it the same with branch-3.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because we don't have a configuration patch in branch-2.4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we dont have the configuration patch

@SparkQA
Copy link

SparkQA commented May 14, 2020

Test build #122625 has finished for PR 28529 at commit e7b378f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 14, 2020

Test build #122627 has started for PR 28529 at commit c96d6b0.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 14, 2020

Hmm. In this case, I believe we had better have the conf patch because we can disable it when we have a regression on this PR. Let me take a look at that.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory [SPARK-31692][SQL][2.4] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory May 15, 2020
@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented May 17, 2020

Test build #122748 has finished for PR 28529 at commit c96d6b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you for your patience, @karuppayya .
For this PR, I backported two commits additionally as a preparation.

@dongjoon-hyun
Copy link
Member

@karuppayya . I cherry-picked your original commit from branch-3.0 to branch-2.4 and resolved the conflicts. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants