[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark #21990

RussellSpitzer · 2018-08-03T16:07:12Z

Master

What changes were proposed in this pull request?

Previously Pyspark used the private constructor for SparkSession when
building that object. This resulted in a SparkSession without checking
the sql.extensions parameter for additional session extensions. To fix
this we instead use the Session.builder() path as SparkR uses, this
loads the extensions and allows their use in PySpark.

How was this patch tested?

An integration test was added which mimics the Scala test for the same feature.

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2018-08-03T16:09:53Z

Test build #94149 has finished for PR 21990 at commit f790ae0.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark.

SparkQA · 2018-08-03T16:55:26Z

Test build #94151 has finished for PR 21990 at commit 84c2513.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-07T03:26:41Z

@RussellSpitzer, let's close other ones except for this and name it [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark. Let me review this one within few days. Also, I don't think we should do it, at least, to branch-2.2. This logic here is quite convoluted and I would rather avoid to backport even to branch-2.3 actually.

HyukjinKwon · 2018-08-13T07:01:26Z

python/pyspark/sql/session.py

-                jsparkSession = self._jvm.SparkSession(self._jsc.sc())
+                jsparkSession = self._jvm.SparkSession.builder() \
+                    .sparkContext(self._jsc.sc()) \
+                    .getOrCreate()


@RussellSpitzer, mind checking the logic getOrCreate inside Scala side and deduplicate them here while we are here? Some logics for instance setting default session, etc. are duplicated Here in Python side and there in Scala side.

It would be nicer if we have some tests as well. spark.sql.extensions are static configuration, right? in that case, we could add a test, for example, please refer #21007. I added a test with static configuration before there.

Yeah let me add in the test, and then I'll clear out all the python duplication of Scala code. I can make it more of a wrapper and less of a reimplementer.

HyukjinKwon · 2018-08-13T07:03:18Z

@RussellSpitzer, also please ping me here if you face any difficulties. am wiling to help or push some changes to your branch.

SparkQA · 2018-08-16T21:19:27Z

Test build #94866 has finished for PR 21990 at commit 7ba70b5.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):

SparkQA · 2018-08-16T23:54:18Z

Test build #94870 has finished for PR 21990 at commit d5c37b7.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):

python/pyspark/sql/tests.py

HyukjinKwon · 2018-08-17T02:22:27Z

python/pyspark/sql/tests.py

This wouldn't be needed since I did this for testing if the callback is called or not in the PR pointed out.

Sounds good to me, i'll take that out.

SparkQA · 2018-08-17T02:41:01Z

Test build #94873 has finished for PR 21990 at commit 0eea205.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):

SparkQA · 2018-08-20T20:48:53Z

Test build #94975 has finished for PR 21990 at commit 21ff627.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):

RussellSpitzer · 2018-08-20T21:11:45Z

@HyukjinKwon So i've been staring at this for a while today, and I guess the big issue is that we always need to make a Python SparkContext to get a handle on the JavaGateway, so everything that happens before the context is made cannot just be a wrapper of SQL methods and must reimplement. Unless we decided to refactor the code so that the JVM is more generally available (probably not possible) we will be stuck with redoing code in python ...

HyukjinKwon

@RussellSpitzer, does that mean it's difficult to deduplicate the logics between here and getOrCreate from Scala side? Will take a close look soon anyway.

This looks better be addressed since the current change actually executes quite duplicated code paths there ..

RussellSpitzer · 2018-08-27T15:42:24Z

What I wanted was to just call the Scala Methods, instead of having half the code and half in python, but we create the JVM in the SparkContext creation code so this ends up not being a good method I think. We could just translate the rest of GetOrCreate into Python but then every time there is a patch of the code in scala it will need a Python mod as well.

HyukjinKwon · 2018-09-03T04:15:33Z

python/pyspark/sql/tests.py

nit: SparkSessionExtensionSuite. -> SparkSessionExtensionSuite

HyukjinKwon · 2018-09-03T07:14:21Z

We could just translate the rest of GetOrCreate into Python but then every time there is a patch of the code in scala it will need a Python mod as well.

@RussellSpitzer, I actually think we already duplicate some codes in Python side and Scala side at this code path now (see getOrCreate at Scala side and __init__ at Python side). They are now duplicatedly executed with this change. I believe this isn't a orthogonal stuff to handle separately.

If that requires a bit of duplicated codes in Python side to avoid duplicated code path executions, then I think that can be a workaround to get through for now.

Adds a test which sets spark.sql.extensions to a custom extension class. This is the same as the SparkExtensionsSuite which does the same thing in Scala.

RussellSpitzer · 2018-09-18T20:23:50Z

@HyukjinKwon so you want me to rewrite the code in python? I will note SparkR is doing this exact same thing.

SparkQA · 2018-09-18T21:00:55Z

Test build #96199 has finished for PR 21990 at commit def4f3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):

RussellSpitzer · 2018-09-19T13:19:55Z

Added new method of injecting extensions, this way the "getOrCreate" code from the scala method is not needed. @HyukjinKwon

SparkQA · 2018-09-19T13:25:08Z

Test build #96253 has finished for PR 21990 at commit 7775c08.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

Previously the only way to add extensions to the session was via the getOrCreate method of the SparkSession Builder. To facilitate non-scala Session creation we add a new constructor which takes in just the context and Extensions. Then we also add a new Extensions constructor which given a SparkConf generates an Extensions object with user config already applied.

SparkQA · 2018-09-19T17:41:28Z

Test build #96257 has finished for PR 21990 at commit 67d9772.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-09-27T14:56:12Z

I'm +1 on switching to the builder and not using the private interface.

RussellSpitzer · 2018-09-29T17:29:32Z

I'm fine with anything really, I still think the ideal solution is probably not to tie the creation of the py4j gateway to the SparkContext, but that's probably a much bigger refactor.

SparkQA · 2018-10-01T18:15:06Z

Test build #96825 has finished for PR 21990 at commit 27c9d28.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-01T22:17:11Z

Test build #96828 has finished for PR 21990 at commit 4ddaff8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-01T22:25:35Z

Test build #96827 has finished for PR 21990 at commit cf7cf75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-14T08:41:34Z

python/pyspark/sql/session.py

                jsparkSession = self._jvm.SparkSession.getDefaultSession().get()
            else:
                jsparkSession = self._jvm.SparkSession(self._jsc.sc())
+


Oh haha, let's get rid of this change

I'm addicted to whitespace apparently

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

HyukjinKwon · 2018-10-14T08:45:04Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

+   * Initialize extensions if the user has defined a configurator class in their SparkConf.
+   * This class will be applied to the extensions passed into this function.
+   */
+  private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: SparkSessionExtensions) {


Let's make it private

HyukjinKwon · 2018-10-14T08:46:41Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala


  private[sql] def this(sc: SparkContext) {
    this(sc, None, None, new SparkSessionExtensions)
+    SparkSession.applyExtensionsFromConf(sc.getConf, this.extensions)


Let's add some comments why this is only here in this constructor. It might look weird why this constructor specifically requires to run applyExtensionsFromConf alone.

HyukjinKwon · 2018-10-14T08:54:04Z

python/pyspark/sql/tests.py

                "The callback from the query execution listener should be called after 'toPandas'")


+class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):


I think SQLTestUtils is not needed.

HyukjinKwon · 2018-10-14T08:58:48Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

+   * This class will be applied to the extensions passed into this function.
+   */
+  private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: SparkSessionExtensions) {
+    val extensionConfOption = conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)


I think we can even only pass conf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS) as its argument instead of SparkConf, and name it applyExtensions.

HyukjinKwon

LGTM otherwise

ueshin · 2018-10-15T08:27:59Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

+   * Initialize extensions if the user has defined a configurator class in their SparkConf.
+   * This class will be applied to the extensions passed into this function.
+   */
+  private[sql] def applyExtensionsFromConf(conf: SparkConf, extensions: SparkSessionExtensions) {


How about returning SparkSessionExtensions from this method, and modifying the secondary constructor of SparkSession as:

private[sql] def this(sc: SparkContext) { this(sc, None, None, SparkSession.applyExtensionsFromConf(sc.getConf, new SparkSessionExtensions)) }

I'm a little worried whether the order we apply extensions might affect.

On second thoughts, we could move the method call to the top of the default constructor?

The Default constructor of SparkSession?

I thought about this and was worried then about multiple invocations of the extensions
Once every time the SparkSession is cloned

It's difficult here since I'm attempting to cause the least change in behavior for the old code paths :(

I am always a little nervous about having functions return objects they take in as parameters and then modify. Gives an impression to me that they are stateless. If you think that this is clearer I can make the change.

I see, but in that case, we need to ensure that no injection of extensions is used in the default constructor to avoid initializing without injections from the conf.

Eh .. I think it's okay to have a function and returns that updated extensions.

Actually either way looks okay.

Updated with replacement then :)

SparkQA · 2018-10-16T21:20:02Z

Test build #97471 has finished for PR 21990 at commit 8ab76d4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

RussellSpitzer · 2018-10-16T21:20:11Z

Addressed Comments from @HyukjinKwon , I'm interested in @ueshin 's suggestions, but I can't figure out how to do that unless we bake it into the Extensions constructor. If we place it in the Sessions constructor then invocations of "newSession" will reapply existing extensions. I added a note in the code.

Removes SparkConf from applyExtensions, now only accepts a Optional string which can contain a classname for extensions. Removed errant whitespace.

SparkQA · 2018-10-17T00:57:13Z

Test build #97472 has finished for PR 21990 at commit d9b2a55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

It now returns the extensions in modifies

SparkQA · 2018-10-17T17:34:11Z

Test build #97497 has finished for PR 21990 at commit 3629c78.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-17T23:16:46Z

retest this please

ueshin · 2018-10-18T01:56:12Z

LGTM.

SparkQA · 2018-10-18T02:54:20Z

Test build #97510 has finished for PR 21990 at commit 3629c78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-18T04:28:35Z

Merged to master.

Master ## What changes were proposed in this pull request? Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark. ## How was this patch tested? An integration test was added which mimics the Scala test for the same feature. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#21990 from RussellSpitzer/SPARK-25003-master. Authored-by: Russell Spitzer <Russell.Spitzer@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

apache#21990

shiyuhang0 · 2022-06-07T06:28:59Z

Why not port it to Spark < 3

kiszk mentioned this pull request Aug 6, 2018

[SPARK-25003][PYSPARK][BRANCH-2.3] Use SessionExtensions in Pyspark #21989

Closed

RussellSpitzer changed the title ~~[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark~~ [SPARK-25003][PYSPARK][MASTER] Use SessionExtensions in Pyspark Aug 6, 2018

RussellSpitzer changed the title ~~[SPARK-25003][PYSPARK][MASTER] Use SessionExtensions in Pyspark~~ [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark Aug 7, 2018

HyukjinKwon reviewed Aug 13, 2018

View reviewed changes

HyukjinKwon reviewed Aug 17, 2018

View reviewed changes

python/pyspark/sql/tests.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Aug 17, 2018

View reviewed changes

HyukjinKwon reviewed Aug 27, 2018

View reviewed changes

HyukjinKwon reviewed Sep 3, 2018

View reviewed changes

python/pyspark/sql/tests.py Outdated

Copy link

Member

HyukjinKwon Sep 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SparkSessionExtensionSuite. -> SparkSessionExtensionSuite

[SPARK-25003][PYSPARK]: Add Tests for spark.sql.extensions in Pyspark

def4f3e

Adds a test which sets spark.sql.extensions to a custom extension class. This is the same as the SparkExtensionsSuite which does the same thing in Scala.

HyukjinKwon mentioned this pull request Sep 27, 2018

[SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545

Closed

SPARK-25003: Address Reviewer Comments

4ddaff8

HyukjinKwon reviewed Oct 14, 2018

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala Show resolved Hide resolved

HyukjinKwon reviewed Oct 14, 2018

View reviewed changes

HyukjinKwon approved these changes Oct 14, 2018

View reviewed changes

ueshin reviewed Oct 15, 2018

View reviewed changes

SPARK-25003: More Refactoring

d9b2a55

Removes SparkConf from applyExtensions, now only accepts a Optional string which can contain a classname for extensions. Removed errant whitespace.

SPARK-25003: Refactor ApplyExtensions Function

3629c78

It now returns the extensions in modifies

asfgit closed this in c3eaee7 Oct 18, 2018

RussellSpitzer deleted the SPARK-25003-master branch January 11, 2019 19:41

jpolchlo mentioned this pull request Dec 18, 2019

Use Arrow extension types in UDF evaluation geotrellis/rasterframes#14

Closed

nateagr pushed a commit to nateagr/spark that referenced this pull request Dec 4, 2020

Backport PR [SPARK-25003][PYSPARK] to use SessionExtensions in Pyspark

506cb1d

apache#21990

nateagr mentioned this pull request Dec 4, 2020

Backport PR [SPARK-25003][PYSPARK] to use SessionExtensions in Pyspark criteo-forks/spark#106

Merged

		"The callback from the query execution listener should be called after 'toPandas'")


		class SparkExtensionsTest(unittest.TestCase, SQLTestUtils):

[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark #21990

[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark #21990

Uh oh!

Conversation

RussellSpitzer commented Aug 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 3, 2018

Uh oh!

SparkQA commented Aug 3, 2018

Uh oh!

HyukjinKwon commented Aug 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 13, 2018

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 17, 2018

Uh oh!

SparkQA commented Aug 20, 2018

Uh oh!

RussellSpitzer commented Aug 20, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Aug 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 3, 2018

Uh oh!

RussellSpitzer commented Sep 18, 2018

Uh oh!

SparkQA commented Sep 18, 2018

Uh oh!

RussellSpitzer commented Sep 19, 2018

Uh oh!

SparkQA commented Sep 19, 2018

Uh oh!

SparkQA commented Sep 19, 2018

Uh oh!

holdenk commented Sep 27, 2018

Uh oh!

RussellSpitzer commented Sep 29, 2018

Uh oh!

SparkQA commented Oct 1, 2018

Uh oh!

SparkQA commented Oct 1, 2018

Uh oh!

SparkQA commented Oct 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

RussellSpitzer commented Aug 3, 2018 •

edited

Loading

ueshin Oct 17, 2018 •

edited

Loading