[SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL #8909

davies · 2015-09-24T21:26:37Z

This PR improve the sessions management by replacing the thread-local based to one SQLContext per session approach, introduce separated temporary tables and UDFs/UDAFs for each session.

A new session of SQLContext could be created by:

create an new SQLContext
call newSession() on existing SQLContext

For HiveContext, in order to reduce the cost for each session, the classloader and Hive client are shared across multiple sessions (created by newSession).

CacheManager is also shared by multiple sessions, so cache a table multiple times in different sessions will not cause multiple copies of in-memory cache.

Added jars are still shared by all the sessions, because SparkContext does not support sessions.

cc @marmbrus @yhuai @rxin

SparkQA · 2015-09-24T21:34:18Z

Test build #42982 has finished for PR 8909 at commit bc9f064.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-24T23:10:04Z

Test build #42988 has finished for PR 8909 at commit 0cf26a5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-09-24T23:27:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

should this be synchronized?

SimpleFunctionRegistry is not designed to be used by multiple threads (read/write in the same time), I think it's fine now, because the builtin is constructed in the begging.

OK this is technically not part of this, but since you are modifying it ... shouldn't all of these catalogs be thread safe? Since they can be called by multiple threads.

SparkQA · 2015-09-24T23:41:10Z

Test build #1814 has finished for PR 8909 at commit 0cf26a5.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-24T23:46:56Z

Test build #42991 has finished for PR 8909 at commit 3723768.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-25T02:43:48Z

Test build #42997 has finished for PR 8909 at commit 9c01c27.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

SparkQA · 2015-09-30T20:07:29Z

Test build #43135 has finished for PR 8909 at commit 5160fea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05)

SparkQA · 2015-09-30T22:55:18Z

Test build #43137 has finished for PR 8909 at commit e942e93.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05)

SparkQA · 2015-10-01T02:26:03Z

Test build #43141 has finished for PR 8909 at commit bcaddb3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05)

SparkQA · 2015-10-01T17:47:39Z

Test build #43154 has finished for PR 8909 at commit 2e365ad.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05)

SparkQA · 2015-10-01T18:59:10Z

Test build #43158 has finished for PR 8909 at commit 69f87cb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

SparkQA · 2015-10-01T19:59:59Z

Test build #43155 has finished for PR 8909 at commit 9fe30cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-10-01T22:05:59Z

Test build #43160 has finished for PR 8909 at commit d10a97a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

SparkQA · 2015-10-01T22:35:17Z

Test build #43162 has finished for PR 8909 at commit 9b82c55.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

SparkQA · 2015-10-07T23:55:37Z

Test build #43356 has finished for PR 8909 at commit e16737c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

SparkQA · 2015-10-08T00:00:10Z

Test build #43357 has finished for PR 8909 at commit 3aeae47.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

andrewor14 · 2015-10-08T00:17:42Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

this isn't used anymore except in tests is it? Do we still need it?

SparkQA · 2015-10-08T01:04:23Z

Test build #43350 has finished for PR 8909 at commit b70720f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

SparkQA · 2015-10-08T02:31:01Z

Test build #1856 has finished for PR 8909 at commit 3aeae47.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

SparkQA · 2015-10-08T19:54:39Z

Test build #43407 has finished for PR 8909 at commit 19b91df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala

davies · 2015-10-08T22:00:53Z

@marmbrus @yhuai @andrewor14 I should have addressed all of your comments (except lack of more docs), could you take another pass? The failed test is not related to this.

SparkQA · 2015-10-08T23:29:33Z

Test build #43428 has finished for PR 8909 at commit ab5e46c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-10-09T00:29:31Z

Non-hive parts LGTM!

davies · 2015-10-09T00:35:14Z

Merged into master, new comments will be addressed in follow-up PR, thanks you all!

WangTaoTheTonic · 2015-10-09T09:33:14Z

Nice work! As it is a blocker could we merge this into branch-1.5 too?

yhuai · 2015-10-09T15:07:47Z

@WangTaoTheTonic Since this is a not a bug fix (instead, a major new feature), we will not merge it to branch-1.5. It will be released with Spark 1.6.0.

tedyu · 2015-10-22T22:56:51Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

Should this be annotated @volatile ?

isn't AtomicReference itself just a wrapper around volatile field?

Yeah agreed. In fact @volatile should never be needed on a val.

Alright.
I was looking at getOrCreate() at line 1219
Seems the access to instantiatedContext is proper.

Neuw84 · 2016-01-05T16:10:13Z

Since upgrading to Spark 1.6 I can´t query temp tables created from SparkStreaming using Apache Toad.

The relevant Spark Streaming part.

        window.foreachRDD((JavaRDD<Row> rdd, Time time) -> {
        if (!rdd.isEmpty()) {
            HiveContext sqlContext1 = JavaSQLContextSingleton.getInstance(rdd.context());
            if (sqlContext1.isCached("issues_recent")) {
                sqlContext1.uncacheTable("issues_recent");
            }
            DataFrame recentDataFrame = sqlContext1.createDataFrame(rdd, schema).coalesce(20);
            recentDataFrame.registerTempTable("issues_recent");
            recentDataFrame.cache();             
        }

The SQLContext creation part
class JavaSQLContextSingleton {

static private transient HiveContext instance = null;

static public HiveContext getInstance(SparkContext sparkContext) {
    if (instance == null) {
        instance = new HiveContext(sparkContext);
    }
    return instance;
}

}

Without using spark streaming doesn´t help with the problem (loading parque file a registering a temporary table).

I think that this "issue" may be related to this PR (maybe I am wrong and this is not the correct place to put this).

Any hints?

davies · 2016-01-05T17:44:21Z

@Neuw84 Since 1.6, each SQLContext will have separated session, you should share the same SQLContext in this case (using SQLContext.getOrCreate()).

deenar · 2016-01-27T21:45:26Z

@Neuw84, i presume you are connecting TOAD to Spark SQL via the thrift server. You might need to set spark.sql.hive.thriftServer.singleSession=true to preserve the pre 1.6 behaviour

Neuw84 · 2016-01-27T23:56:23Z

@deenar, I saw it after many hours reading the code in the web docs . Although, I think that the HiveSparkContext should implement the same logic as the SparkContext where you can get the same Session programmatically. Thanks by the way

chutium · 2016-03-17T20:37:18Z

anyone tried use --hivevar or --hiveconf with beeline after this change? variables from JDBC client is no more available in the hive session.
refer to https://issues.apache.org/jira/browse/SPARK-13983

is there any test case for --hivevar and --hiveconf in test suites?

JoshRosen · 2016-07-15T22:10:05Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala

+    executionHive.addJar(path)
+    metadataHive.addJar(path)
+    Thread.currentThread().setContextClassLoader(executionHive.clientLoader.classLoader)
+    super.addJar(path)


@davies @yhuai Do we need to restore the original classloader after calling super.addJar() or is it intentional that calling HiveContext.addJar will permanently mutate the calling thread's context classloader?

…s whitespaces in the path ### What changes were proposed in this pull request? This PR fixes an issue that `ADD JAR` command can't add jar files which contain whitespaces in the path though `ADD FILE` and `ADD ARCHIVE` work with such files. If we have `/some/path/test file.jar` and execute the following command: ``` ADD JAR "/some/path/test file.jar"; ``` The following exception is thrown. ``` 21/04/05 10:40:38 ERROR SparkSQLDriver: Failed in [add jar "/some/path/test file.jar"] java.lang.IllegalArgumentException: Illegal character in path at index 9: /some/path/test file.jar at java.net.URI.create(URI.java:852) at org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129) at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:34) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) ``` This is because `HiveSessionStateBuilder` and `SessionStateBuilder` don't check whether the form of the path is URI or plain path and it always regards the path as URI form. Whitespces should be encoded to `%20` so `/some/path/test file.jar` is rejected. We can resolve this part by checking whether the given path is URI form or not. Unfortunatelly, if we fix this part, another problem occurs. When we execute `ADD JAR` command, Hive's `ADD JAR` command is executed in `HiveClientImpl.addJar` and `AddResourceProcessor.run` is transitively invoked. In `AddResourceProcessor.run`, the command line is just split by ` s+` and the path is also split into `/some/path/test` and `file.jar` and passed to `ss.add_resources`. https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProcessor.java#L56-L75 So, the command still fails. Even if we convert the form of the path to URI like `file:/some/path/test%20file.jar` and execute the following command: ``` ADD JAR "file:/some/path/test%20file"; ``` The following exception is thrown. ``` 21/04/05 10:40:53 ERROR SessionState: file:/some/path/test%20file.jar does not exist java.lang.IllegalArgumentException: file:/some/path/test%20file.jar does not exist at org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:1168) at org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1289) at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1278) at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1378) at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1336) at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:74) ``` The reason is `Utilities.realFile` invoked in `SessionState.validateFiles` returns `null` as the result of `fs.exists(path)` is `false`. https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1052-L1064 `fs.exists` checks the existence of the given path by comparing the string representation of Hadoop's `Path`. The string representation of `Path` is similar to URI but it's actually different. `Path` doesn't encode the given path. For example, the URI form of `/some/path/jar file.jar` is `file:/some/path/jar%20file.jar` but the `Path` form of it is `file:/some/path/jar file.jar`. So `fs.exists` returns false. So the solution I come up with is removing Hive's `ADD JAR` from `HiveClientimpl.addJar`. I think Hive's `ADD JAR` was used to add jar files to the class loader for metadata and isolate the class loader from the one for execution. https://github.com/apache/spark/pull/6758/files#diff-cdb07de713c84779a5308f65be47964af865e15f00eb9897ccf8a74908d581bbR94-R103 But, as of SPARK-10810 and SPARK-10902 (#8909) are resolved, the class loaders for metadata and execution seem to be isolated with different way. https://github.com/apache/spark/pull/8909/files#diff-8ef7cabf145d3fe7081da799fa415189d9708892ed76d4d13dd20fa27021d149R635-R641 In the current implementation, such class loaders seem to be isolated by `SharedState.jarClassLoader` and `IsolatedClientLoader.classLoader`. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L173-L188 https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L956-L967 So I wonder we can remove Hive's `ADD JAR` from `HiveClientImpl.addJar`. ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #32052 from sarutak/add-jar-whitespace. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR refactors three parts of the comments in `HiveClientImpl.withHiveState` One is about the following comment. ``` // The classloader in clientLoader could be changed after addJar, always use the latest // classloader. ``` The comment was added in SPARK-10810 (#8909) because `IsolatedClientLoader.classLoader` was declared as `var`. But the field is now `val` and cannot be changed after instanciation. So, the comment can confuse developers. One is about the following code and comment. ``` // classloader. We explicitly set the context class loader since "conf.setClassLoader" does // not do that, and the Hive client libraries may need to load classes defined by the client's // class loader. Thread.currentThread().setContextClassLoader(clientLoader.classLoader) ``` It's not trivial why this part is necessary and it's difficult when we can remove this code in the future. So, I revised the comment by adding the reference of the related JIRA. And the last one is about the following code and comment. ``` // Replace conf in the thread local Hive with current conf Hive.get(conf) ``` It's also not trivial why this part is necessary. I revised the comment by adding the reference of the related discussion. ### Why are the changes needed? To make code more readable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It's just a comment refactoring so I add no new test. Closes #32162 from sarutak/refactor-HiveClientImpl. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Davies Liu added 2 commits September 24, 2015 13:56

sessions for SQLContext

af8df73

improve executionHive

bc9f064

fix scala style and current_database

3723768

davies force-pushed the sessions branch from 0cf26a5 to 3723768 Compare September 24, 2015 23:22

rxin reviewed Sep 24, 2015
View reviewed changes

fix mima

9c01c27

marmbrus mentioned this pull request Sep 29, 2015

[SPARK-8552] [THRIFTSERVER] Using incorrect database in multiple sessions #7118

Closed

Merge branch 'master' of github.com:apache/spark into sessions

5160fea

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

fix some tests

e942e93

fix tests

bcaddb3

refactor

9fe30cf

davies force-pushed the sessions branch from 2e365ad to 9fe30cf Compare October 1, 2015 17:50

davies changed the title ~~[SPARK-10810] [WIP] [SQL] Improve session management in SQL~~ [SPARK-10810] [SQL] Improve session management in SQL Oct 1, 2015

add more tests

69f87cb

fix test

d10a97a

fix tests

9b82c55

fix mima

72e137d

andrewor14 reviewed Oct 8, 2015
View reviewed changes

fix tests and address comments

19b91df

Merge branch 'master' of github.com:apache/spark into sessions

ab5e46c

Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala

asfgit closed this in 3390b40 Oct 9, 2015

davies mentioned this pull request Oct 9, 2015

[SPARK-8861][SPARK-8862][SQL] Add basic instrumentation to each SparkPlan operator and add a new SQL tab #7774

Closed

sun-rui mentioned this pull request Oct 22, 2015

[SPARK-10903] [SPARKR] R - Simplify SQLContext method signatures and use a singleton #9192

Closed

tedyu reviewed Oct 22, 2015
View reviewed changes

JoshRosen reviewed Jul 15, 2016
View reviewed changes

gatorsmile mentioned this pull request Aug 31, 2016

[SPARK-17319] [SQL] Move addJar from HiveSessionState to SharedState #14883

Closed

sarutak mentioned this pull request Apr 5, 2021

[SPARK-34955][SQL] ADD JAR command cannot add jar files which contains whitespaces in the path #32052

Closed

sarutak mentioned this pull request Apr 14, 2021

[MINOR][SQL] Refactor the comments in HiveClientImpl.withHiveState #32162

Closed

[SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL #8909

[SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL #8909

Uh oh!

Conversation

davies commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 24, 2015

Uh oh!

SparkQA commented Sep 25, 2015

Uh oh!

SparkQA commented Sep 30, 2015

Uh oh!

SparkQA commented Sep 30, 2015

Uh oh!

SparkQA commented Oct 1, 2015

Uh oh!

SparkQA commented Oct 1, 2015

Uh oh!

SparkQA commented Oct 1, 2015

Uh oh!

SparkQA commented Oct 1, 2015

Uh oh!

SparkQA commented Oct 1, 2015

Uh oh!

SparkQA commented Oct 1, 2015

Uh oh!

SparkQA commented Oct 7, 2015

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

davies commented Oct 8, 2015

Uh oh!

SparkQA commented Oct 8, 2015

Uh oh!

andrewor14 commented Oct 9, 2015

Uh oh!

davies commented Oct 9, 2015

Uh oh!

WangTaoTheTonic commented Oct 9, 2015

Uh oh!

yhuai commented Oct 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Neuw84 commented Jan 5, 2016

Uh oh!

davies commented Jan 5, 2016

Uh oh!

deenar commented Jan 27, 2016

Uh oh!

Neuw84 commented Jan 27, 2016

Uh oh!

chutium commented Mar 17, 2016

Uh oh!