Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Sep 24, 2015

This PR improve the sessions management by replacing the thread-local based to one SQLContext per session approach, introduce separated temporary tables and UDFs/UDAFs for each session.

A new session of SQLContext could be created by:

  1. create an new SQLContext
  2. call newSession() on existing SQLContext

For HiveContext, in order to reduce the cost for each session, the classloader and Hive client are shared across multiple sessions (created by newSession).

CacheManager is also shared by multiple sessions, so cache a table multiple times in different sessions will not cause multiple copies of in-memory cache.

Added jars are still shared by all the sessions, because SparkContext does not support sessions.

cc @marmbrus @yhuai @rxin

@SparkQA
Copy link

SparkQA commented Sep 24, 2015

Test build #42982 has finished for PR 8909 at commit bc9f064.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 24, 2015

Test build #42988 has finished for PR 8909 at commit 0cf26a5.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be synchronized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleFunctionRegistry is not designed to be used by multiple threads (read/write in the same time), I think it's fine now, because the builtin is constructed in the begging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is technically not part of this, but since you are modifying it ... shouldn't all of these catalogs be thread safe? Since they can be called by multiple threads.

@SparkQA
Copy link

SparkQA commented Sep 24, 2015

Test build #1814 has finished for PR 8909 at commit 0cf26a5.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 24, 2015

Test build #42991 has finished for PR 8909 at commit 3723768.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 25, 2015

Test build #42997 has finished for PR 8909 at commit 9c01c27.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
@SparkQA
Copy link

SparkQA commented Sep 30, 2015

Test build #43135 has finished for PR 8909 at commit 5160fea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05)

@SparkQA
Copy link

SparkQA commented Sep 30, 2015

Test build #43137 has finished for PR 8909 at commit e942e93.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05)

@SparkQA
Copy link

SparkQA commented Oct 1, 2015

Test build #43141 has finished for PR 8909 at commit bcaddb3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05)

@SparkQA
Copy link

SparkQA commented Oct 1, 2015

Test build #43154 has finished for PR 8909 at commit 2e365ad.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class HyperLogLogPlusPlus(child: Expression, relativeSD: Double = 0.05)

@davies davies changed the title [SPARK-10810] [WIP] [SQL] Improve session management in SQL [SPARK-10810] [SQL] Improve session management in SQL Oct 1, 2015
@SparkQA
Copy link

SparkQA commented Oct 1, 2015

Test build #43158 has finished for PR 8909 at commit 69f87cb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

@SparkQA
Copy link

SparkQA commented Oct 1, 2015

Test build #43155 has finished for PR 8909 at commit 9fe30cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 1, 2015

Test build #43160 has finished for PR 8909 at commit d10a97a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

@SparkQA
Copy link

SparkQA commented Oct 1, 2015

Test build #43162 has finished for PR 8909 at commit 9b82c55.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

@SparkQA
Copy link

SparkQA commented Oct 7, 2015

Test build #43356 has finished for PR 8909 at commit e16737c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43357 has finished for PR 8909 at commit 3aeae47.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't used anymore except in tests is it? Do we still need it?

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43350 has finished for PR 8909 at commit b70720f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #1856 has finished for PR 8909 at commit 3aeae47.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43407 has finished for PR 8909 at commit 19b91df.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • /** Run a function within Hive state (SessionState, HiveConf, Hive client and class loader) */

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
@davies
Copy link
Contributor Author

davies commented Oct 8, 2015

@marmbrus @yhuai @andrewor14 I should have addressed all of your comments (except lack of more docs), could you take another pass? The failed test is not related to this.

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43428 has finished for PR 8909 at commit ab5e46c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Non-hive parts LGTM!

@davies
Copy link
Contributor Author

davies commented Oct 9, 2015

Merged into master, new comments will be addressed in follow-up PR, thanks you all!

@asfgit asfgit closed this in 3390b40 Oct 9, 2015
@WangTaoTheTonic
Copy link
Contributor

Nice work! As it is a blocker could we merge this into branch-1.5 too?

@yhuai
Copy link
Contributor

yhuai commented Oct 9, 2015

@WangTaoTheTonic Since this is a not a bug fix (instead, a major new feature), we will not merge it to branch-1.5. It will be released with Spark 1.6.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be annotated @volatile ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't AtomicReference itself just a wrapper around volatile field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed. In fact @volatile should never be needed on a val.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright.
I was looking at getOrCreate() at line 1219
Seems the access to instantiatedContext is proper.

@Neuw84
Copy link

Neuw84 commented Jan 5, 2016

Since upgrading to Spark 1.6 I can´t query temp tables created from SparkStreaming using Apache Toad.

The relevant Spark Streaming part.

        window.foreachRDD((JavaRDD<Row> rdd, Time time) -> {
        if (!rdd.isEmpty()) {
            HiveContext sqlContext1 = JavaSQLContextSingleton.getInstance(rdd.context());
            if (sqlContext1.isCached("issues_recent")) {
                sqlContext1.uncacheTable("issues_recent");
            }
            DataFrame recentDataFrame = sqlContext1.createDataFrame(rdd, schema).coalesce(20);
            recentDataFrame.registerTempTable("issues_recent");
            recentDataFrame.cache();             
        }

The SQLContext creation part
class JavaSQLContextSingleton {

static private transient HiveContext instance = null;

static public HiveContext getInstance(SparkContext sparkContext) {
    if (instance == null) {
        instance = new HiveContext(sparkContext);
    }
    return instance;
}

}

Without using spark streaming doesn´t help with the problem (loading parque file a registering a temporary table).

I think that this "issue" may be related to this PR (maybe I am wrong and this is not the correct place to put this).

Any hints?

@davies
Copy link
Contributor Author

davies commented Jan 5, 2016

@Neuw84 Since 1.6, each SQLContext will have separated session, you should share the same SQLContext in this case (using SQLContext.getOrCreate()).

@deenar
Copy link

deenar commented Jan 27, 2016

@Neuw84, i presume you are connecting TOAD to Spark SQL via the thrift server. You might need to set spark.sql.hive.thriftServer.singleSession=true to preserve the pre 1.6 behaviour

@Neuw84
Copy link

Neuw84 commented Jan 27, 2016

@deenar, I saw it after many hours reading the code in the web docs . Although, I think that the HiveSparkContext should implement the same logic as the SparkContext where you can get the same Session programmatically. Thanks by the way

@chutium
Copy link
Contributor

chutium commented Mar 17, 2016

anyone tried use --hivevar or --hiveconf with beeline after this change? variables from JDBC client is no more available in the hive session.
refer to https://issues.apache.org/jira/browse/SPARK-13983

is there any test case for --hivevar and --hiveconf in test suites?

executionHive.addJar(path)
metadataHive.addJar(path)
Thread.currentThread().setContextClassLoader(executionHive.clientLoader.classLoader)
super.addJar(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies @yhuai Do we need to restore the original classloader after calling super.addJar() or is it intentional that calling HiveContext.addJar will permanently mutate the calling thread's context classloader?

dongjoon-hyun pushed a commit that referenced this pull request Apr 7, 2021
…s whitespaces in the path

### What changes were proposed in this pull request?

This PR fixes an issue that `ADD JAR` command can't add jar files which contain whitespaces in the path though `ADD FILE` and `ADD ARCHIVE` work with such files.

If we have `/some/path/test file.jar` and execute the following command:

```
ADD JAR "/some/path/test file.jar";
```

The following exception is thrown.

```
21/04/05 10:40:38 ERROR SparkSQLDriver: Failed in [add jar "/some/path/test file.jar"]
java.lang.IllegalArgumentException: Illegal character in path at index 9: /some/path/test file.jar
	at java.net.URI.create(URI.java:852)
	at org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:129)
	at org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:34)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
```

This is because `HiveSessionStateBuilder` and `SessionStateBuilder` don't check whether the form of the path is URI or plain path and it always regards the path as URI form.
Whitespces should be encoded to `%20` so `/some/path/test file.jar` is rejected.
We can resolve this part by checking whether the given path is URI form or not.

Unfortunatelly, if we fix this part, another problem occurs.
When we execute `ADD JAR` command, Hive's `ADD JAR` command is executed in `HiveClientImpl.addJar` and `AddResourceProcessor.run` is transitively invoked.
In `AddResourceProcessor.run`, the command line is just split by `
s+` and the path is also split into `/some/path/test` and `file.jar` and passed to `ss.add_resources`.
https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/processors/AddResourceProcessor.java#L56-L75
So, the command still fails.

Even if we convert the form of the path to URI like `file:/some/path/test%20file.jar` and execute the following command:

```
ADD JAR "file:/some/path/test%20file";
```

The following exception is thrown.

```
21/04/05 10:40:53 ERROR SessionState: file:/some/path/test%20file.jar does not exist
java.lang.IllegalArgumentException: file:/some/path/test%20file.jar does not exist
	at org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:1168)
	at org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1289)
	at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1278)
	at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1378)
	at org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1336)
	at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:74)
```

The reason is `Utilities.realFile` invoked in `SessionState.validateFiles` returns `null` as the result of `fs.exists(path)` is `false`.
https://github.com/apache/hive/blob/f1e87137034e4ecbe39a859d4ef44319800016d7/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1052-L1064

`fs.exists` checks the existence of the given path by comparing the string representation of Hadoop's `Path`.
The string representation of `Path` is similar to URI but it's actually different.
`Path` doesn't encode the given path.
For example, the URI form of `/some/path/jar file.jar` is `file:/some/path/jar%20file.jar` but the `Path` form of it is `file:/some/path/jar file.jar`. So `fs.exists` returns false.

So the solution I come up with is removing Hive's `ADD JAR` from `HiveClientimpl.addJar`.
I think Hive's `ADD JAR` was used to add jar files to the class loader for metadata and isolate the class loader from the one for execution.
https://github.com/apache/spark/pull/6758/files#diff-cdb07de713c84779a5308f65be47964af865e15f00eb9897ccf8a74908d581bbR94-R103

But, as of SPARK-10810 and SPARK-10902 (#8909) are resolved, the class loaders for metadata and execution seem to be isolated with different way.
https://github.com/apache/spark/pull/8909/files#diff-8ef7cabf145d3fe7081da799fa415189d9708892ed76d4d13dd20fa27021d149R635-R641

In the current implementation, such class loaders seem to be isolated by `SharedState.jarClassLoader` and `IsolatedClientLoader.classLoader`.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala#L173-L188
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L956-L967

So I wonder we can remove Hive's `ADD JAR` from `HiveClientImpl.addJar`.
### Why are the changes needed?

This is a bug.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #32052 from sarutak/add-jar-whitespace.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Apr 15, 2021
### What changes were proposed in this pull request?

This PR refactors three parts of the comments in `HiveClientImpl.withHiveState`

One is about the following comment.
```
// The classloader in clientLoader could be changed after addJar, always use the latest
// classloader.
```
The comment was added in SPARK-10810 (#8909) because `IsolatedClientLoader.classLoader` was declared as `var`.
But the field is now `val` and cannot be changed after instanciation.
So, the comment can confuse developers.

One is about the following code and comment.
```
// classloader. We explicitly set the context class loader since "conf.setClassLoader" does
// not do that, and the Hive client libraries may need to load classes defined by the client's
// class loader.
Thread.currentThread().setContextClassLoader(clientLoader.classLoader)
```
It's not trivial why this part is necessary and it's difficult when we can remove this code in the future.
So, I revised the comment by adding the reference of the related JIRA.

And the last one is about the following code and comment.
```
// Replace conf in the thread local Hive with current conf
Hive.get(conf)
```
It's also not trivial why this part is necessary.
I revised the comment by adding the reference of the related discussion.

### Why are the changes needed?

To make code more readable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It's just a comment refactoring so I add no new test.

Closes #32162 from sarutak/refactor-HiveClientImpl.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.