-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-41034][CONNECT][PYTHON] Connect DataFrame should require a RemoteSparkSession #38541
Conversation
Can one of the admins verify this patch? |
Merged to master. |
### What changes were proposed in this pull request? Fix python mypy ### Why are the changes needed? #38541 breaks the linter ``` starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/connect/plan.py:683: error: Argument 1 to "plan" of "LogicalPlan" has incompatible type "Optional[RemoteSparkSession]"; expected "RemoteSparkSession" [arg-type] python/pyspark/sql/connect/plan.py:815: error: Argument 1 to "plan" of "LogicalPlan" has incompatible type "Optional[RemoteSparkSession]"; expected "RemoteSparkSession" [arg-type] Found 2 errors in 1 file (checked 369 source files) 1 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? local test Closes #38597 from zhengruifeng/fix_py_lint_1. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@@ -91,6 +92,10 @@ def _session_range( | |||
def _session_sql(cls, query: str) -> "DataFrame": | |||
return DataFrame.withPlan(SQL(query), cls.connect) # type: ignore | |||
|
|||
@classmethod | |||
def _with_plan(cls, plan: LogicalPlan) -> "DataFrame": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @amaliujia . Unfortunately, this seems to break downstream CIs again by introducing pandas
dependency back. In this PR, line 27 seems to be insufficient because _with_plan
is a classmethod
.
if have_pandas:
...
from pyspark.sql.connect.plan import LogicalPlan
The following error happens.
NameError: name 'LogicalPlan' is not defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @amaliujia . Unfortunately, this seems to break downstream CIs again by introducing
pandas
dependency back. In this PR, line 27 seems to be insufficient because_with_plan
is aclassmethod
.if have_pandas: ... from pyspark.sql.connect.plan import LogicalPlan
The following error happens.
NameError: name 'LogicalPlan' is not defined
I think we might be able to replace it with "LogicalPlan" in quotes and use the typing conditional import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, it's also possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a PR with a different way.
…oteSparkSession ### What changes were proposed in this pull request? Connect have marked the session parameter everywhere as `Optional`. This seems to be only useful for testing where test case can ignore the session parameter when it is not applicable. However: 1. There are PySpark DataFrame API that returns SparkSession which is not optional. If Connect keep it as optional then we will have diff on such API. 2. Optional suggests `None` check which seems to not be necessary at many places (or forget to check `None` etc.) This PR proposes to remove the `Optional` on the session from API interface. ### Why are the changes needed? Maintainability ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes apache#38541 from amaliujia/dataframe_must_have_a_session. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Fix python mypy ### Why are the changes needed? apache#38541 breaks the linter ``` starting mypy annotations test... annotations failed mypy checks: python/pyspark/sql/connect/plan.py:683: error: Argument 1 to "plan" of "LogicalPlan" has incompatible type "Optional[RemoteSparkSession]"; expected "RemoteSparkSession" [arg-type] python/pyspark/sql/connect/plan.py:815: error: Argument 1 to "plan" of "LogicalPlan" has incompatible type "Optional[RemoteSparkSession]"; expected "RemoteSparkSession" [arg-type] Found 2 errors in 1 file (checked 369 source files) 1 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? local test Closes apache#38597 from zhengruifeng/fix_py_lint_1. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Connect have marked the session parameter everywhere as
Optional
. This seems to be only useful for testing where test case can ignore the session parameter when it is not applicable.However:
None
check which seems to not be necessary at many places (or forget to checkNone
etc.)This PR proposes to remove the
Optional
on the session from API interface.Why are the changes needed?
Maintainability
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing UT