-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL #8909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #42982 has finished for PR 8909 at commit
|
|
Test build #42988 has finished for PR 8909 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be synchronized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleFunctionRegistry is not designed to be used by multiple threads (read/write in the same time), I think it's fine now, because the builtin is constructed in the begging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is technically not part of this, but since you are modifying it ... shouldn't all of these catalogs be thread safe? Since they can be called by multiple threads.
|
Test build #1814 has finished for PR 8909 at commit
|
|
Test build #42991 has finished for PR 8909 at commit
|
|
Test build #42997 has finished for PR 8909 at commit
|
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
|
Test build #43135 has finished for PR 8909 at commit
|
|
Test build #43137 has finished for PR 8909 at commit
|
|
Test build #43141 has finished for PR 8909 at commit
|
|
Test build #43154 has finished for PR 8909 at commit
|
|
Test build #43158 has finished for PR 8909 at commit
|
|
Test build #43155 has finished for PR 8909 at commit
|
|
Test build #43160 has finished for PR 8909 at commit
|
|
Test build #43162 has finished for PR 8909 at commit
|
|
Test build #43356 has finished for PR 8909 at commit
|
|
Test build #43357 has finished for PR 8909 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't used anymore except in tests is it? Do we still need it?
|
Test build #43350 has finished for PR 8909 at commit
|
|
Test build #1856 has finished for PR 8909 at commit
|
|
Test build #43407 has finished for PR 8909 at commit
|
Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
|
@marmbrus @yhuai @andrewor14 I should have addressed all of your comments (except lack of more docs), could you take another pass? The failed test is not related to this. |
|
Test build #43428 has finished for PR 8909 at commit
|
|
Non-hive parts LGTM! |
|
Merged into master, new comments will be addressed in follow-up PR, thanks you all! |
|
Nice work! As it is a blocker could we merge this into branch-1.5 too? |
|
@WangTaoTheTonic Since this is a not a bug fix (instead, a major new feature), we will not merge it to branch-1.5. It will be released with Spark 1.6.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be annotated @volatile ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't AtomicReference itself just a wrapper around volatile field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed. In fact @volatile should never be needed on a val.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright.
I was looking at getOrCreate() at line 1219
Seems the access to instantiatedContext is proper.
|
Since upgrading to Spark 1.6 I can´t query temp tables created from SparkStreaming using Apache Toad. The relevant Spark Streaming part. The SQLContext creation part } Without using spark streaming doesn´t help with the problem (loading parque file a registering a temporary table). I think that this "issue" may be related to this PR (maybe I am wrong and this is not the correct place to put this). Any hints? |
|
@Neuw84 Since 1.6, each SQLContext will have separated session, you should share the same SQLContext in this case (using SQLContext.getOrCreate()). |
|
@Neuw84, i presume you are connecting TOAD to Spark SQL via the thrift server. You might need to set spark.sql.hive.thriftServer.singleSession=true to preserve the pre 1.6 behaviour |
|
@deenar, I saw it after many hours reading the code in the web docs . Although, I think that the HiveSparkContext should implement the same logic as the SparkContext where you can get the same Session programmatically. Thanks by the way |
|
anyone tried use --hivevar or --hiveconf with beeline after this change? variables from JDBC client is no more available in the hive session. is there any test case for --hivevar and --hiveconf in test suites? |
| executionHive.addJar(path) | ||
| metadataHive.addJar(path) | ||
| Thread.currentThread().setContextClassLoader(executionHive.clientLoader.classLoader) | ||
| super.addJar(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…s whitespaces in the path ### What changes were proposed in this pull request? This PR fixes an issue that `ADD JAR` command can't add jar files which contain whitespaces in the path though `ADD FILE` and `ADD ARCHIVE` work with such files. If we have `/some/path/test file.jar` and execute the following command: ``` ADD JAR "/some/path/test file.jar"; ``` The following exception is thrown. ``` 21/04/05 10:40:38 ERROR SparkSQLDriver: Failed in [add jar "/some/path/test file.jar"] java.lang.IllegalArgumentException: Illegal character in path at index 9: /some/path/test file.jar at java.net.URI.create(URI.java:852) at org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129) at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:34) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) ``` This is because `HiveSessionStateBuilder` and `SessionStateBuilder` don't check whether the form of the path is URI or plain path and it always regards the path as URI form. Whitespces should be encoded to `%20` so `/some/path/test file.jar` is rejected. We can resolve this part by checking whether the given path is URI form or not. Unfortunatelly, if we fix this part, another problem occurs. When we execute `ADD JAR` command, Hive's `ADD JAR` command is executed in `HiveClientImpl.addJar` and `AddResourceProcessor.run` is transitively invoked. In `AddResourceProcessor.run`, the command line is just split by ` s+` and the path is also split into `/some/path/test` and `file.jar` and passed to `ss.add_resources`. https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProcessor.java#L56-L75 So, the command still fails. Even if we convert the form of the path to URI like `file:/some/path/test%20file.jar` and execute the following command: ``` ADD JAR "file:/some/path/test%20file"; ``` The following exception is thrown. ``` 21/04/05 10:40:53 ERROR SessionState: file:/some/path/test%20file.jar does not exist java.lang.IllegalArgumentException: file:/some/path/test%20file.jar does not exist at org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:1168) at org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1289) at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1278) at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1378) at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1336) at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:74) ``` The reason is `Utilities.realFile` invoked in `SessionState.validateFiles` returns `null` as the result of `fs.exists(path)` is `false`. https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1052-L1064 `fs.exists` checks the existence of the given path by comparing the string representation of Hadoop's `Path`. The string representation of `Path` is similar to URI but it's actually different. `Path` doesn't encode the given path. For example, the URI form of `/some/path/jar file.jar` is `file:/some/path/jar%20file.jar` but the `Path` form of it is `file:/some/path/jar file.jar`. So `fs.exists` returns false. So the solution I come up with is removing Hive's `ADD JAR` from `HiveClientimpl.addJar`. I think Hive's `ADD JAR` was used to add jar files to the class loader for metadata and isolate the class loader from the one for execution. https://github.com/apache/spark/pull/6758/files#diff-cdb07de713c84779a5308f65be47964af865e15f00eb9897ccf8a74908d581bbR94-R103 But, as of SPARK-10810 and SPARK-10902 (#8909) are resolved, the class loaders for metadata and execution seem to be isolated with different way. https://github.com/apache/spark/pull/8909/files#diff-8ef7cabf145d3fe7081da799fa415189d9708892ed76d4d13dd20fa27021d149R635-R641 In the current implementation, such class loaders seem to be isolated by `SharedState.jarClassLoader` and `IsolatedClientLoader.classLoader`. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L173-L188 https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L956-L967 So I wonder we can remove Hive's `ADD JAR` from `HiveClientImpl.addJar`. ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #32052 from sarutak/add-jar-whitespace. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? This PR refactors three parts of the comments in `HiveClientImpl.withHiveState` One is about the following comment. ``` // The classloader in clientLoader could be changed after addJar, always use the latest // classloader. ``` The comment was added in SPARK-10810 (#8909) because `IsolatedClientLoader.classLoader` was declared as `var`. But the field is now `val` and cannot be changed after instanciation. So, the comment can confuse developers. One is about the following code and comment. ``` // classloader. We explicitly set the context class loader since "conf.setClassLoader" does // not do that, and the Hive client libraries may need to load classes defined by the client's // class loader. Thread.currentThread().setContextClassLoader(clientLoader.classLoader) ``` It's not trivial why this part is necessary and it's difficult when we can remove this code in the future. So, I revised the comment by adding the reference of the related JIRA. And the last one is about the following code and comment. ``` // Replace conf in the thread local Hive with current conf Hive.get(conf) ``` It's also not trivial why this part is necessary. I revised the comment by adding the reference of the related discussion. ### Why are the changes needed? To make code more readable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It's just a comment refactoring so I add no new test. Closes #32162 from sarutak/refactor-HiveClientImpl. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This PR improve the sessions management by replacing the thread-local based to one SQLContext per session approach, introduce separated temporary tables and UDFs/UDAFs for each session.
A new session of SQLContext could be created by:
For HiveContext, in order to reduce the cost for each session, the classloader and Hive client are shared across multiple sessions (created by newSession).
CacheManager is also shared by multiple sessions, so cache a table multiple times in different sessions will not cause multiple copies of in-memory cache.
Added jars are still shared by all the sessions, because SparkContext does not support sessions.
cc @marmbrus @yhuai @rxin