[SPARK-25255][PYTHON]Add getActiveSession to SparkSession in PySpark #22295

huaxingao · 2018-08-31T00:53:42Z

What changes were proposed in this pull request?

add getActiveSession in session.py

How was this patch tested?

add doctest

SparkQA · 2018-08-31T00:59:28Z

Test build #95507 has finished for PR 22295 at commit e9885b3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-31T02:11:57Z

Test build #95509 has finished for PR 22295 at commit 89c3b44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-31T03:39:38Z

python/pyspark/sql/session.py

Does this return JVM instance?

@HyukjinKwon Sorry for the late reply. Yes, this returns a JVM instance.
In the scala code, SparkSession.getActiveSession returns an Option[SparkSession]
I am not sure how to do a python equivalent of Scala's Option. In the following code, is there a way to wrap the python session in else path to something equivalent of Scala's Option? If not, can I just return the python session?

if self._jsparkSession.getActiveSession() is None: return None else: return self.__class__(self._sc, self._jsparkSession.getActiveSession().get())

Yea, I think we should return Python session one. JVM instance should not be exposed .. I assume returning None is fine. The thing is, we have the lack of session supports in PySpark. It's partially implemented but not very well tested as far as I can tell.

Can you add a set of tests for it, and manually test them as well? Actually, my guts say this is quite a big deal

@HyukjinKwon I add a set of tests. Some of them are borrowed from SparkSessionBuilderSuite.scala

holdenk · 2018-08-31T16:37:25Z

python/pyspark/sql/session.py

So normally we try and have doc tests like these be examples of how the user should use this. So I would consider getting the active session and then doing something a normal user would with it (like paralleling a collection).

..and probably shouldn't access _jsparkSession

@holdenk @felixcheung Thanks for the review. I will change this.

SparkQA · 2018-09-08T01:41:02Z

Test build #95815 has finished for PR 22295 at commit cd87f06.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-10T18:50:59Z

Test build #95889 has finished for PR 22295 at commit 2345e55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-11T06:00:07Z

python/pyspark/sql/session.py

@huaxingao, let's target this 3.0.

SparkQA · 2018-09-11T16:20:07Z

Test build #95953 has finished for PR 22295 at commit 65fc45f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

This looks really close. The one thing which I'd like to see added is a test for getActiveSession when there is no active session.

holdenk · 2018-09-14T16:33:32Z

python/pyspark/sql/session.py

Thanks for catching this! Filed a follow up https://issues.apache.org/jira/browse/SPARK-25432

Thanks you very much for your comments.
I have a question here. In stop() method, shall we clear the activeSession too? Currently, it has

def stop(self): """Stop the underlying :class:`SparkContext`. """ self._jvm.SparkSession.clearDefaultSession() SparkSession._instantiatedSession = None

Do I need to add the following?

self._jvm.SparkSession.clearActiveSession()

To test for getActiveSession when there is no active session, I am thinking of adding

def test_get_active_session_when_no_active_session(self): spark = SparkSession.builder \ .master("local") \ .getOrCreate() spark.stop() active = spark.getActiveSession() self.assertEqual(active, None)

The test didn't pass because in stop(), the active session is not cleared.

Yes, that sounds like the right approach and I think we need that.

gatorsmile · 2018-09-18T06:21:17Z

python/pyspark/sql/session.py

HyukjinKwon · 2018-09-18T09:59:07Z

python/pyspark/sql/tests.py

nit: let's just name it spark_context and spark_session

I don't strongly agree here. I think given that the method names are camel case in the SparkSession & SparkContext in Python this naming is perfectly reasonable.

HyukjinKwon · 2018-09-18T10:00:20Z

python/pyspark/sql/tests.py

Do we need to extend ReusedSQLTestCase? Looks we can just unittest.TestCase.

@HyukjinKwon there's no strong need for it, however it does mean that the first getOrCreate will already have a session it can use, but given that we set up and tear down the session this may be less than ideal.

HyukjinKwon · 2018-09-18T10:01:15Z

python/pyspark/sql/tests.py

ditto for naming. Let's just follow Python's convention in those names

holdenk

Left some small comments, looking forward to seeing the fix on the stop side as well :)

holdenk · 2018-09-21T16:15:38Z

python/pyspark/sql/tests.py

@HyukjinKwon there's no strong need for it, however it does mean that the first getOrCreate will already have a session it can use, but given that we set up and tear down the session this may be less than ideal.

holdenk · 2018-09-21T16:18:42Z

python/pyspark/sql/tests.py

I don't strongly agree here. I think given that the method names are camel case in the SparkSession & SparkContext in Python this naming is perfectly reasonable.

SparkQA · 2018-09-21T18:37:02Z

Test build #96451 has finished for PR 22295 at commit d7be3bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-27T03:59:41Z

python/pyspark/sql/session.py

Let's change this to 2.5

@HyukjinKwon are you OK to mark this comment as resolved since we're now targeting 3.0?

Yes, at that time, 2.5 was targeted. Now 3.0 is targeted per 9bf397c

holdenk · 2018-09-27T14:27:45Z

LGTM except the 3.0 to 2.5 I'll change that during the merge.

holdenk · 2018-09-27T14:32:35Z

nvm, the merge script only triggers the edits if we have conflicts. If you can update 3.0 to 2.5 I'd be happy to merge.

HyukjinKwon · 2018-09-27T17:11:04Z

python/pyspark/sql/session.py

@huaxingao, can you check if the active session is set? for instance when we createDataFrame? From a cursory look, we are not setting it.

@HyukjinKwon Seems to me that active session is set OK in the __init__. When createDataFrame, we already have a session, and the active session is already set in the __init__.

When createDataFrame, we already have a session

but wouldn't we not set the active session properly if session A sets an active session in __init__, and then session B sets an active session in __init__, and then session A calls createDataFrame ?

@HyukjinKwon Do you mean something like this:

def test_two_spark_session(self): session1 = None session2 = None try: session1 = SparkSession.builder.config("key1", "value1").getOrCreate() session2 = SparkSession.builder.config("key2", "value2").getOrCreate() self.assertEqual(session1, session2) df = session1.createDataFrame([(1, 'Alice')], ['age', 'name']) self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')]) activeSession1 = session1.getActiveSession() activeSession2 = session2.getActiveSession() self.assertEqual(activeSession1, activeSession1) finally: if session1 is not None: session1.stop() if session2 is not None: session2.stop()

Simialr. I was expecting something like:

session1 = SparkSession.builder.config("key1", "value1").getOrCreate() session2 = SparkSession.builder.config("key2", "value2").getOrCreate() assert(session2 == SparkSession.getActiveSession()) session1.createDataFrame([(1, 'Alice')], ['age', 'name']) assert(session1 == SparkSession.getActiveSession())

does this work?

So @HyukjinKwon in this code session1 and session2 are already equal:

Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/ / ._/_,// //_\ version 2.3.1
//

Using Python version 3.6.5 (default, Apr 29 2018 16:14:56)
SparkSession available as 'spark'.

session1 = SparkSession.builder.config("key1", "value1").getOrCreate()
session2 = SparkSession.builder.config("key2", "value2").getOrCreate()
session1
<pyspark.sql.session.SparkSession object at 0x7ff6d4843b00>
session2
<pyspark.sql.session.SparkSession object at 0x7ff6d4843b00>
session1 == session2
True

That being said the possibility of having multiple Spark session in Python is doable you manually have to call the init e.g.:

session3 = SparkSession(sc)
session3
<pyspark.sql.session.SparkSession object at 0x7ff6d3dbd160>

And supporting that is reasonable.

If we're going to support this we should have test for it, or if we aren't going to support this right now we should document the behaviour.

Oh, okay. I had to be explicit. I meant:

scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession scala> SparkSession.getActiveSession res0: Option[org.apache.spark.sql.SparkSession] = Some(org.apache.spark.sql.SparkSession@3ef4a8fb) scala> val session1 = spark session1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@3ef4a8fb scala> val session2 = spark.newSession() session2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4b74a4d scala> SparkSession.getActiveSession res1: Option[org.apache.spark.sql.SparkSession] = Some(org.apache.spark.sql.SparkSession@3ef4a8fb) scala> session2.createDataFrame(Seq(Tuple1(1))) res2: org.apache.spark.sql.DataFrame = [_1: int] scala> SparkSession.getActiveSession res3: Option[org.apache.spark.sql.SparkSession] = Some(org.apache.spark.sql.SparkSession@4b74a4d)

@holdenk @HyukjinKwon
Thanks for the comments. I looked the scala code, it setActiveSession in createDataFrame.

def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame = { SparkSession.setActiveSession(this) ... }

I will do the same for python.

def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=True): SparkSession._activeSession = self self._jvm.SparkSession.setActiveSession(self._jsparkSession)

Will also add a test

SparkQA · 2018-09-27T17:18:04Z

Test build #96707 has finished for PR 22295 at commit c966846.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2018-09-27T18:16:05Z

I just saw this fix [SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545
I will remove test_create_SparkContext_then_SparkSession

SparkQA · 2018-09-27T18:55:59Z

Test build #96708 has finished for PR 22295 at commit 3e11d0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-29T14:03:50Z

python/pyspark/sql/session.py

Wait .. this should be class method. since the scala usage is SparkSession. getActiveSession

I think the class method should initialize JVM if non existent (see functions.py). Probably Spark context too. If exists, it should use the existing one.

Also, let's define this as a property since that's closer to Scala's usage.

I know it's difficult to define a static property. You can refer https://github.com/graphframes/graphframes/pull/169/files#diff-e81e6b169c0aa35012a3263b2f31b330R381 or we should consider adding this as a function

SparkQA · 2018-10-01T22:26:56Z

Test build #96831 has finished for PR 22295 at commit 765cf27.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-01T23:13:17Z

Test build #96833 has finished for PR 22295 at commit b83cf8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-03T17:12:03Z

python/pyspark/sql/session.py

The problem here is when we share single JVM like Zeppelin. It should get the session from JVM.

Do you mean in a multi-language notebook environment?

@HyukjinKwon I am not sure if I follow your suggestion correctly. Does the following look right to you?
session.py

@classmethod @since(3.0) def getActiveSession(cls): from pyspark.sql import functions return functions.getActiveSession()

functions.py

@since(3.0) def getActiveSession(): from pyspark.sql import SparkSession sc = SparkContext._active_spark_context if sc is None: sc = SparkContext() if sc._jvm.SparkSession.getActiveSession().isDefined(): SparkSession(sc, sc._jvm.SparkSession.getActiveSession().get()) return SparkSession._activeSession else: return None

Yea, it should look like that

HyukjinKwon · 2018-10-05T10:20:02Z

python/pyspark/sql/session.py

Let's do this to 3.0. Per 9bf397c, looks we are going ahead for 3.0 now.

SparkQA · 2018-10-08T18:39:23Z

Test build #97124 has finished for PR 22295 at commit 55f1b03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

Thanks for working on this. I have some questions but I think we're getting really close :)

holdenk · 2018-10-12T17:18:43Z

python/pyspark/sql/functions.py

If this is being done to simplify implementation and we don't expect people to call it directly here we should mention that in the docstring and also use an _ prefix.

I disagree with @HyukjinKwon about this behaviour being what people would expect -- it doesn't match the Scala behaviour and one of the reasons to have something like getActiveSession() instead of getOrCreate() is to allow folks to do something if we have an active session or do something else if we don't.

What about if sc isNone we just return None since we can't have an activeSession without an active SparkContext -- does that sound reasonable?

That being said if folks feel strongly about this I'm ok with us setting up a SparkContext but we need to document that if that's the path we go.

Yea, we should match the behaviour with Scala side - that was my point essentially. The problem about the previous approach was that session was being handled within Python - I believe we will basically reuse JVM's session implementation rather than reimplementing the seperate Python session support within PySpark side.

What about if sc isNone we just return Nonesince we can't have an activeSession without an active SparkContext -- does that sound reasonable?

In that case, I think we should follow Scala's behaviour.

@holdenk @HyukjinKwon
Thanks for the comments.
I checked Scala's behavior:

test("my test") { val cx = SparkContext.getActive val session = SparkSession.getActiveSession println(cx) println(session) }

The result is

None None

So it returns None if sc isNone. Actually my current code returns None if sc isNone, but I will change the code a bit to make it more obvious. I will also add _ prefix in the function name and mention in the docstring that this function is not supposed to be called directly.

holdenk · 2018-10-12T17:20:12Z

python/pyspark/sql/session.py

@HyukjinKwon are you OK to mark this comment as resolved since we're now targeting 3.0?

holdenk · 2018-10-12T17:26:25Z

python/pyspark/sql/tests.py

Given the change for how we construct the SparkSession can we add a test that makes sure we do whatever we decide to with the SparkContext?

Thanks @holdenk
I will add a test for the above comment and also add a test for your comment regarding

self._jvm.SparkSession.setActiveSession(self._jsparkSession)

holdenk · 2018-10-12T17:28:12Z

python/pyspark/sql/session.py

If we're going to support this we should have test for it, or if we aren't going to support this right now we should document the behaviour.

SparkQA · 2018-10-16T19:32:00Z

Test build #97466 has finished for PR 22295 at commit 1ee58af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-17T18:09:48Z

Test build #97502 has finished for PR 22295 at commit 7c6d2d5.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-17T18:47:36Z

Test build #97503 has finished for PR 22295 at commit 56282da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-18T04:31:17Z

python/pyspark/sql/functions.py

eh.. why is it in functions.py? I thought it should be in getActiveSession at session.py

Do you mean the _ prefix or the function itself?

I mean the function itself ..

HyukjinKwon · 2018-10-18T04:32:59Z

python/pyspark/sql/tests.py

I think you can put this in try-finally

Will change. Thanks!

HyukjinKwon · 2018-10-18T04:33:22Z

python/pyspark/sql/tests.py

Let's just above SparkSession -> session

Will change. Thanks!

HyukjinKwon · 2018-10-18T06:43:24Z

Looks close to go.

SparkQA · 2018-10-19T03:38:31Z

Test build #97577 has finished for PR 22295 at commit 94e3db0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

LGTM

holdenk · 2018-10-19T16:19:41Z

I'll leave this for if @HyukjinKwon has any final comments, otherwise I'm happy to merge.

holdenk · 2018-10-26T16:42:30Z

Merged to master for 3.0. Thanks for fixing this @huaxingao :)

huaxingao · 2018-10-26T16:47:59Z

Thank you very much for your help! ! @holdenk @HyukjinKwon

## What changes were proposed in this pull request? add getActiveSession in session.py ## How was this patch tested? add doctest Closes apache#22295 from huaxingao/spark25255. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Holden Karau <holden@pigscanfly.ca>

HyukjinKwon reviewed Aug 31, 2018

View reviewed changes

holdenk reviewed Aug 31, 2018

View reviewed changes

HyukjinKwon reviewed Sep 11, 2018

View reviewed changes

python/pyspark/sql/session.py Outdated

Copy link

Member

HyukjinKwon Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao, let's target this 3.0.

holdenk reviewed Sep 14, 2018

View reviewed changes

gatorsmile reviewed Sep 18, 2018

View reviewed changes

python/pyspark/sql/session.py Outdated

Copy link

Member

gatorsmile Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ueshin

HyukjinKwon reviewed Sep 18, 2018

View reviewed changes

python/pyspark/sql/tests.py Outdated

Copy link

Member

HyukjinKwon Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for naming. Let's just follow Python's convention in those names

holdenk reviewed Sep 21, 2018

View reviewed changes

HyukjinKwon reviewed Sep 27, 2018

View reviewed changes

HyukjinKwon reviewed Sep 29, 2018

View reviewed changes

HyukjinKwon reviewed Oct 3, 2018

View reviewed changes

HyukjinKwon reviewed Oct 5, 2018

View reviewed changes

python/pyspark/sql/session.py Outdated

Copy link

Member

HyukjinKwon Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this to 3.0. Per 9bf397c, looks we are going ahead for 3.0 now.

holdenk requested changes Oct 12, 2018

View reviewed changes

huaxingao added 12 commits October 17, 2018 07:54

[SPARK-25255][PYTHON]Add getActiveSession to SparkSession in PySpark

5bfe614

fix python style error

9048a36

address comments

221ea01

fix test failure

6f89066

change the target version to 3.0

c223dd2

address comments

091b1d5

change version to 2.5

1cda049

remove test_create_SparkContext_then_SparkSession

69b29e9

change getActiveSession to class method

d8fef1c

fix test failure

59ad7a7

change getActiveSession to class method (2)

f2949f1

address comments

7c6d2d5

huaxingao force-pushed the spark25255 branch from 1ee58af to 7c6d2d5 Compare October 17, 2018 18:06

HyukjinKwon reviewed Oct 18, 2018

View reviewed changes

address comments

fb47432

huaxingao force-pushed the spark25255 branch from 56282da to fb47432 Compare October 19, 2018 02:54

Delete MyTest.scala

94e3db0

holdenk approved these changes Oct 19, 2018

View reviewed changes

asfgit closed this in d367bdc Oct 26, 2018

[SPARK-25255][PYTHON]Add getActiveSession to SparkSession in PySpark #22295

[SPARK-25255][PYTHON]Add getActiveSession to SparkSession in PySpark #22295

Uh oh!

Conversation

huaxingao commented Aug 31, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 8, 2018

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 11, 2018

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Sep 27, 2018

Uh oh!

holdenk commented Sep 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!