Skip to content

[SPARK-23081][PYTHON]Add colRegex API to PySpark #20390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add colRegex API to PySpark

How was this patch tested?

add a test in sql/tests.py

@@ -1881,6 +1881,15 @@ def toDF(self, *cols):
jdf = self._jdf.toDF(self._jseq(cols))
return DataFrame(jdf, self.sql_ctx)

@since(2.3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be 2.4.

def colRegex(self, colName):
"""
Selects column based on the column name specified as a regex and return it
as :class:`Column`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a doctest and :param too while we are here?

Selects column based on the column name specified as a regex and return it
as :class:`Column`.
"""
jc = self._jdf.colRegex(colName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a type check here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thank you very much for your comments. I will submit changes soon.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86611 has finished for PR 20390 at commit d08ed6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86617 has finished for PR 20390 at commit d1b4761.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

@@ -1881,6 +1881,28 @@ def toDF(self, *cols):
jdf = self._jdf.toDF(self._jseq(cols))
return DataFrame(jdf, self.sql_ctx)

@since(2.4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put this API between def columns(self): and def alias(self, alias):?

@@ -2855,6 +2855,10 @@ def test_create_dataframe_from_old_pandas(self):
with self.assertRaisesRegexp(ImportError, 'Pandas >= .* must be installed'):
self.spark.createDataFrame(pdf)

def test_colRegex(self):
df = self.spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)])
self.assertEqual(df.select(df.colRegex("`(_1)?+.+`")).collect(), df.select("_2").collect())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually being tested in doctest. Seems we can remove out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks! I will make the changes.

| 3|
+---+
"""
assert isinstance(colName, basestring), "colName should be a string"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TypeError with an if could be more correct.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86626 has finished for PR 20390 at commit 92ee53a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

:param colName: string, column name specified as a regex.

>>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)])
>>> df.select(df.colRegex("`(_1)?+.+`")).show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps a bit obscure to pick the default column name of _1?
how about we name the columns in the line above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Thanks for your comment! I will make changes.

@since(2.4)
def colRegex(self, colName):
"""
Selects column based on the column name specified as a regex and return it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: -> returns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we have the same issue in Dataset.colRegex. Please also correct that too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Thanks for your comments. I will make the changes.

@@ -819,6 +819,29 @@ def columns(self):
"""
return [f.name for f in self.schema.fields]

@since(2.4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> 2.3

@gatorsmile
Copy link
Member

Since our Spark 2.3 RC2 will fail, we can target it to 2.3

@gatorsmile
Copy link
Member

LGTM except the above two comments.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86649 has finished for PR 20390 at commit 54a26ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Jan 25, 2018 via email

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86653 has finished for PR 20390 at commit 4a58e95.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master and branch-2.3.

asfgit pushed a commit that referenced this pull request Jan 25, 2018
## What changes were proposed in this pull request?

Add colRegex API to PySpark

## How was this patch tested?

add a test in sql/tests.py

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #20390 from huaxingao/spark-23081.

(cherry picked from commit 8480c0c)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
@asfgit asfgit closed this in 8480c0c Jan 25, 2018
@huaxingao
Copy link
Contributor Author

Thank you all for your help! @HyukjinKwon @gatorsmile @felixcheung

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants