-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-23081][PYTHON]Add colRegex API to PySpark #20390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyspark/sql/dataframe.py
Outdated
@@ -1881,6 +1881,15 @@ def toDF(self, *cols): | |||
jdf = self._jdf.toDF(self._jseq(cols)) | |||
return DataFrame(jdf, self.sql_ctx) | |||
|
|||
@since(2.3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be 2.4.
python/pyspark/sql/dataframe.py
Outdated
def colRegex(self, colName): | ||
""" | ||
Selects column based on the column name specified as a regex and return it | ||
as :class:`Column`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a doctest and :param
too while we are here?
python/pyspark/sql/dataframe.py
Outdated
Selects column based on the column name specified as a regex and return it | ||
as :class:`Column`. | ||
""" | ||
jc = self._jdf.colRegex(colName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a type check here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thank you very much for your comments. I will submit changes soon.
Test build #86611 has finished for PR 20390 at commit
|
Test build #86617 has finished for PR 20390 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise.
python/pyspark/sql/dataframe.py
Outdated
@@ -1881,6 +1881,28 @@ def toDF(self, *cols): | |||
jdf = self._jdf.toDF(self._jseq(cols)) | |||
return DataFrame(jdf, self.sql_ctx) | |||
|
|||
@since(2.4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put this API between def columns(self):
and def alias(self, alias):
?
python/pyspark/sql/tests.py
Outdated
@@ -2855,6 +2855,10 @@ def test_create_dataframe_from_old_pandas(self): | |||
with self.assertRaisesRegexp(ImportError, 'Pandas >= .* must be installed'): | |||
self.spark.createDataFrame(pdf) | |||
|
|||
def test_colRegex(self): | |||
df = self.spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)]) | |||
self.assertEqual(df.select(df.colRegex("`(_1)?+.+`")).collect(), df.select("_2").collect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is actually being tested in doctest. Seems we can remove out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks! I will make the changes.
python/pyspark/sql/dataframe.py
Outdated
| 3| | ||
+---+ | ||
""" | ||
assert isinstance(colName, basestring), "colName should be a string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think TypeError
with an if could be more correct.
Test build #86626 has finished for PR 20390 at commit
|
python/pyspark/sql/dataframe.py
Outdated
:param colName: string, column name specified as a regex. | ||
|
||
>>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)]) | ||
>>> df.select(df.colRegex("`(_1)?+.+`")).show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps a bit obscure to pick the default column name of _1
?
how about we name the columns in the line above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Thanks for your comment! I will make changes.
python/pyspark/sql/dataframe.py
Outdated
@since(2.4) | ||
def colRegex(self, colName): | ||
""" | ||
Selects column based on the column name specified as a regex and return it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: -> returns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we have the same issue in Dataset.colRegex. Please also correct that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Thanks for your comments. I will make the changes.
python/pyspark/sql/dataframe.py
Outdated
@@ -819,6 +819,29 @@ def columns(self): | |||
""" | |||
return [f.name for f in self.schema.fields] | |||
|
|||
@since(2.4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> 2.3
Since our Spark 2.3 RC2 will fail, we can target it to 2.3 |
LGTM except the above two comments. |
Test build #86649 has finished for PR 20390 at commit
|
Awesome
LGTM pending test passes
|
Test build #86653 has finished for PR 20390 at commit
|
Merged to master and branch-2.3. |
## What changes were proposed in this pull request? Add colRegex API to PySpark ## How was this patch tested? add a test in sql/tests.py Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20390 from huaxingao/spark-23081. (cherry picked from commit 8480c0c) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Thank you all for your help! @HyukjinKwon @gatorsmile @felixcheung |
What changes were proposed in this pull request?
Add colRegex API to PySpark
How was this patch tested?
add a test in sql/tests.py