[SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory #28516

karuppayya · 2020-05-12T22:22:28Z

What changes were proposed in this pull request?

Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory

Why are the changes needed?

BEFORE

➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState@5793cd84

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream@22846025

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem@5a930c03

AFTER

➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState@5c24a636

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream@2ba8f528

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = LocalFS

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass
res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem

The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested locally.
Added Unit test

sql/core/src/test/scala/org/apache/spark/sql/internal/SharedStateSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala

dongjoon-hyun · 2020-05-12T22:52:33Z

BTW, thank you for your first contribution, @karuppayya .

sql/core/src/test/scala/org/apache/spark/sql/internal/SharedStateSuite.scala

dongjoon-hyun · 2020-05-13T09:41:26Z

ok to test

SparkQA · 2020-05-13T15:26:13Z

Test build #122584 has finished for PR 28516 at commit 86f121b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-05-13T20:40:57Z

Thank you for updating, but the following is insufficient.

new URL("s3://<s3_path>/").openConnection().getInputStream

For example, people can try like the following in Apache Spark 3.0.0 RC1. Could you elaborate a little more in a reproducible way which the other people can follow?

scala> spark.version
res3: String = 3.0.0

scala> new java.net.URL("s3://1.txt").openConnection().getInputStream()
java.net.MalformedURLException: unknown protocol: s3

SparkQA · 2020-05-14T00:57:24Z

Test build #122598 has finished for PR 28516 at commit 2e00254.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

karuppayya · 2020-05-14T02:49:44Z

I have come up with an repro with Local filesystem, which will be easier for testing
Before the change

➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState@5793cd84

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream@22846025

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem@5a930c03

After the change

➜  spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

scala> spark.sharedState
res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState@5c24a636

scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream
res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream@2ba8f528

scala> import org.apache.hadoop.fs._
import org.apache.hadoop.fs._

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration)
res2: org.apache.hadoop.fs.FileSystem = LocalFS

scala>  FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass
res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem

The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case
@dongjoon-hyun

dongjoon-hyun · 2020-05-14T06:11:33Z

Thanks. Please put the description in the PR description. It will be perfect.

dongjoon-hyun · 2020-05-14T06:17:02Z

I updated the PR description with yours, @karuppayya .

dongjoon-hyun

+1, LGTM. Thank you, @karuppayya and @HyukjinKwon .
Merged to master/3.0.

…treamHandlerfactory ### What changes were proposed in this pull request? Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory ### Why are the changes needed? **BEFORE** ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5793cd84 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream22846025 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem5a930c03 ``` **AFTER** ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5c24a636 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream2ba8f528 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = LocalFS scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem ``` The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested locally. Added Unit test Closes #28516 from karuppayya/SPARK-31692. Authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7260146) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

karuppayya · 2020-05-14T06:19:29Z

Thanks @dongjoon-hyun for updating the PR description

dongjoon-hyun · 2020-05-14T06:20:32Z

BTW, there is a conflict on branch-2.4.
Could you make a backporting PR against branch-2.4, please?

dongjoon-hyun · 2020-05-14T06:23:48Z

I added you to the Apache Spark contributor group and assigned you SPARK-31692. Thank you so much again.

karuppayya · 2020-05-14T06:33:10Z

Thanks @dongjoon-hyun , I will create a PR against 2.4 as well

…treamHandlerfactory Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory **BEFORE** ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5793cd84 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream22846025 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem5a930c03 ``` **AFTER** ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5c24a636 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream2ba8f528 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = LocalFS scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem ``` The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case No Tested locally. Added Unit test Closes apache#28516 from karuppayya/SPARK-31692. Authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7260146)

…treamHandlerfactory Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory **BEFORE** ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5793cd84 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream22846025 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = org.apache.hadoop.fs.LocalFileSystem5a930c03 ``` **AFTER** ``` ➜ spark git:(SPARK-31692) ✗ ./bin/spark-shell --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem scala> spark.sharedState res0: org.apache.spark.sql.internal.SharedState = org.apache.spark.sql.internal.SharedState5c24a636 scala> new java.net.URL("file:///tmp/1.txt").openConnection.getInputStream res1: java.io.InputStream = org.apache.hadoop.fs.FSDataInputStream2ba8f528 scala> import org.apache.hadoop.fs._ import org.apache.hadoop.fs._ scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration) res2: org.apache.hadoop.fs.FileSystem = LocalFS scala> FileSystem.get(new Path("file:///tmp/1.txt").toUri, spark.sparkContext.hadoopConfiguration).getClass res3: Class[_ <: org.apache.hadoop.fs.FileSystem] = class org.apache.hadoop.fs.RawLocalFileSystem ``` The type of FileSystem object created(you can check the last statement in the above snippets) in the above two cases are different, which should not have been the case No Tested locally. Added Unit test Closes #28516 from karuppayya/SPARK-31692. Authored-by: Karuppayya Rajendran <karuppayya1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7260146) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d639a12) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Pass hadoop conf to stream handlerfactory

f7208a4

probot-autolabeler bot added the SQL label May 12, 2020

Fixing typos

b34802a

dongjoon-hyun reviewed May 12, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/internal/SharedStateSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed May 12, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/internal/SharedStateSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed May 12, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/internal/SharedStateSuite.scala Outdated Show resolved Hide resolved

Address review comments

485fe6a

dongjoon-hyun reviewed May 12, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala Outdated Show resolved Hide resolved

karuppayya added 2 commits May 12, 2020 16:03

Address review comments

2798e1a

Fix failure: Linter checks

86f121b

HyukjinKwon reviewed May 13, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/internal/SharedStateSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun changed the title ~~[SPARK-31692] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory~~ [SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory May 13, 2020

Handle review comments: Add JIRA id to test name

2e00254

dongjoon-hyun approved these changes May 14, 2020

View reviewed changes

dongjoon-hyun closed this in 7260146 May 14, 2020

karuppayya mentioned this pull request May 14, 2020

[SPARK-31692][SQL][2.4] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory #28529

Closed

[SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory #28516

[SPARK-31692][SQL] Pass hadoop confs specifed via Spark confs to URLStreamHandlerfactory #28516

Uh oh!

Conversation

karuppayya commented May 12, 2020 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun commented May 12, 2020

Uh oh!

Uh oh!

dongjoon-hyun commented May 13, 2020

Uh oh!

SparkQA commented May 13, 2020

Uh oh!

dongjoon-hyun commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 14, 2020

Uh oh!

karuppayya commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented May 14, 2020

Uh oh!

dongjoon-hyun commented May 14, 2020

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karuppayya commented May 14, 2020

Uh oh!

dongjoon-hyun commented May 14, 2020

Uh oh!

dongjoon-hyun commented May 14, 2020

Uh oh!

karuppayya commented May 14, 2020

Uh oh!

Uh oh!

karuppayya commented May 12, 2020 •

edited by dongjoon-hyun

Loading

dongjoon-hyun commented May 13, 2020 •

edited

Loading

karuppayya commented May 14, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading