[SPARK-23081][PYTHON]Add colRegex API to PySpark #20390

huaxingao · 2018-01-25T00:30:51Z

What changes were proposed in this pull request?

Add colRegex API to PySpark

How was this patch tested?

add a test in sql/tests.py

HyukjinKwon · 2018-01-25T00:36:52Z

python/pyspark/sql/dataframe.py

@@ -1881,6 +1881,15 @@ def toDF(self, *cols):
        jdf = self._jdf.toDF(self._jseq(cols))
        return DataFrame(jdf, self.sql_ctx)

+    @since(2.3)


I think this should be 2.4.

HyukjinKwon · 2018-01-25T00:37:24Z

python/pyspark/sql/dataframe.py

+    def colRegex(self, colName):
+        """
+        Selects column based on the column name specified as a regex and return it
+        as :class:`Column`.


Shall we add a doctest and :param too while we are here?

HyukjinKwon · 2018-01-25T00:38:17Z

python/pyspark/sql/dataframe.py

+        Selects column based on the column name specified as a regex and return it
+        as :class:`Column`.
+        """
+        jc = self._jdf.colRegex(colName)


Could we add a type check here too?

@HyukjinKwon Thank you very much for your comments. I will submit changes soon.

SparkQA · 2018-01-25T01:02:53Z

Test build #86611 has finished for PR 20390 at commit d08ed6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-25T02:23:03Z

Test build #86617 has finished for PR 20390 at commit d1b4761.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM otherwise.

HyukjinKwon · 2018-01-25T05:46:18Z

python/pyspark/sql/dataframe.py

@@ -1881,6 +1881,28 @@ def toDF(self, *cols):
        jdf = self._jdf.toDF(self._jseq(cols))
        return DataFrame(jdf, self.sql_ctx)

+    @since(2.4)


Could we put this API between def columns(self): and def alias(self, alias):?

HyukjinKwon · 2018-01-25T05:55:39Z

python/pyspark/sql/tests.py

@@ -2855,6 +2855,10 @@ def test_create_dataframe_from_old_pandas(self):
            with self.assertRaisesRegexp(ImportError, 'Pandas >= .* must be installed'):
                self.spark.createDataFrame(pdf)

+    def test_colRegex(self):
+        df = self.spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)])
+        self.assertEqual(df.select(df.colRegex("`(_1)?+.+`")).collect(), df.select("_2").collect())


I think this is actually being tested in doctest. Seems we can remove out.

@HyukjinKwon Thanks! I will make the changes.

HyukjinKwon · 2018-01-25T05:58:48Z

python/pyspark/sql/dataframe.py

+        |  3|
+        +---+
+        """
+        assert isinstance(colName, basestring), "colName should be a string"


I think TypeError with an if could be more correct.

SparkQA · 2018-01-25T07:10:04Z

Test build #86626 has finished for PR 20390 at commit 92ee53a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-01-25T09:42:08Z

python/pyspark/sql/dataframe.py

+        :param colName: string, column name specified as a regex.
+
+        >>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)])
+        >>> df.select(df.colRegex("`(_1)?+.+`")).show()


nit: perhaps a bit obscure to pick the default column name of _1?
how about we name the columns in the line above?

@felixcheung Thanks for your comment! I will make changes.

gatorsmile · 2018-01-25T18:47:25Z

python/pyspark/sql/dataframe.py

+    @since(2.4)
+    def colRegex(self, colName):
+        """
+        Selects column based on the column name specified as a regex and return it


Nit: -> returns

Unfortunately, we have the same issue in Dataset.colRegex. Please also correct that too.

@gatorsmile Thanks for your comments. I will make the changes.

gatorsmile · 2018-01-25T18:48:11Z

python/pyspark/sql/dataframe.py

@@ -819,6 +819,29 @@ def columns(self):
        """
        return [f.name for f in self.schema.fields]

+    @since(2.4)


gatorsmile · 2018-01-25T18:48:47Z

Since our Spark 2.3 RC2 will fail, we can target it to 2.3

gatorsmile · 2018-01-25T18:50:04Z

LGTM except the above two comments.

SparkQA · 2018-01-25T18:57:27Z

Test build #86649 has finished for PR 20390 at commit 54a26ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-01-25T19:59:27Z

Awesome LGTM pending test passes

SparkQA · 2018-01-25T22:35:41Z

Test build #86653 has finished for PR 20390 at commit 4a58e95.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-25T22:51:11Z

Merged to master and branch-2.3.

## What changes were proposed in this pull request? Add colRegex API to PySpark ## How was this patch tested? add a test in sql/tests.py Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20390 from huaxingao/spark-23081. (cherry picked from commit 8480c0c) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

huaxingao · 2018-01-25T23:40:55Z

Thank you all for your help! @HyukjinKwon @gatorsmile @felixcheung

[SPARK-23081][PYTHON]Add colRegex API to PySpark

d08ed6b

HyukjinKwon reviewed Jan 25, 2018

View reviewed changes

address comments

d1b4761

HyukjinKwon approved these changes Jan 25, 2018

View reviewed changes

address comments(2)

92ee53a

felixcheung approved these changes Jan 25, 2018

View reviewed changes

address comments(3)

54a26ce

gatorsmile reviewed Jan 25, 2018

View reviewed changes

address comments(4)

4a58e95

asfgit closed this in 8480c0c Jan 25, 2018

[SPARK-23081][PYTHON]Add colRegex API to PySpark #20390

[SPARK-23081][PYTHON]Add colRegex API to PySpark #20390

Uh oh!

Conversation

huaxingao commented Jan 25, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jan 25, 2018

Uh oh!

gatorsmile commented Jan 25, 2018

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

felixcheung commented Jan 25, 2018 via email

Uh oh!

SparkQA commented Jan 25, 2018

Uh oh!

HyukjinKwon commented Jan 25, 2018

Uh oh!

huaxingao commented Jan 25, 2018

Uh oh!

Uh oh!