Skip to content

[SPARK-20098][PYSPARK] dataType's typeName fix #17435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

szalai1
Copy link
Contributor

@szalai1 szalai1 commented Mar 26, 2017

What changes were proposed in this pull request?

typeName classmethod has been fixed by using type -> typeName map.

How was this patch tested?

local build

"ShortType": "short",
"ArrayType": "array",
"MapType": "map",
"StructField": "struct",
Copy link
Member

@HyukjinKwon HyukjinKwon Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this problem only applies to StructField. Could we just override typeName with simply throwing an exception? I think users are not supposed to call typeName against StructField but simpleString against the instance.

BTW, It apparently seems a bit odd that it extends DataType though.. I guess probably some tests are broken if we change the parent as it seems it is dependent on the parent assuming from #1598. So, I guess minimised fix would be just to override it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @holdenk and @viirya WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it is valid to call typeName against a StructField. Actually, StructField is not a data type, strictly speaking...

I don't know why StructField inherits DataType in pyspark. In scala, it is not.

Throwing an exception when calling typeName on StructField seems good enough, instead of a map like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya The way I found this bug:
I wanted to figure out the schema of a dataset. I loaded it into a data frame and asked its schema. Then I called the typeName on each column. I do not know this is/was best way to do this, but I think it is valid to call typeName against a dataType to get its real type.

Copy link
Member

@HyukjinKwon HyukjinKwon Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is valid call for DataTypes but I don't think it is against StructField. If you look at the Scala-side codes, StructField is not a DataType (which is "The base type of all Spark SQL data types") but it seems the parent became DataType in Python for some reason in the PR I pointed out. In any way, It seems "A field inside a StructType".

If you want to know the schema, you could simply

>>> spark.range(1).schema[0].simpleString()
'id:bigint'
>>> spark.range(1).schema.simpleString()
'struct<id:bigint>'
>>> type(spark.range(1).schema[0])
<class 'pyspark.sql.types.StructField'>
>>> str(spark.range(1).schema[0])
'StructField(id,LongType,false)'

Could you elaborate your use-case and why simpleString is not enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szalai1 I think @HyukjinKwon 's code snippets should address your request. Doesn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I don't think i.typeName() is a valid usage. We better let it throw an exception when calling typeName on StructField.

i.dataType.typeName() is more reasonable call to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya You are right. I will use it that way. Thanks @HyukjinKwon !

But my bug report is still valid, I think. Can I override the typeName function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I think we should still fix and overriding it is good enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@HyukjinKwon
Copy link
Member

@szalai1, could you update this PR please? Otherwise, we should close this as a stale PR.

@szalai1
Copy link
Contributor Author

szalai1 commented May 8, 2017

@HyukjinKwon sure, I will do it this week. I totally forgot this. Sorry.

@srowen srowen mentioned this pull request May 17, 2017
@HyukjinKwon
Copy link
Member

I think we need a test and @holdenk's review.

@holdenk
Copy link
Contributor

holdenk commented May 20, 2017

Thanks for working on this. I feel like the error message could maybe be improved to suggest what the user should be doing? It would be nicer to eventually not have this depend on DataType since we don't have this in the Scala version as @HyukjinKwon pointed out, but I think this could be a good improvement for now.

@szalai1
Copy link
Contributor Author

szalai1 commented May 20, 2017

@holdenk I am happy to contribute to this project. I changed the error message and added a test case.

@holdenk
Copy link
Contributor

holdenk commented Jun 15, 2017

Sorry for my delay finishing the review. This looks good pending Jenkins and if it passes I'll try and merge it when I get home. (so Jenkins looks good to test. Jenkins test this please. ) :)

@SparkQA
Copy link

SparkQA commented Jun 15, 2017

Test build #78097 has finished for PR 17435 at commit 6bf11f0.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jun 17, 2017

Hey @szalai1 if you've got time to address the style issues its looking good otherwise :)

@szalai1
Copy link
Contributor Author

szalai1 commented Jun 25, 2017

@holdenk Thanks for reviewing it. I fixed the style issues.

@ueshin
Copy link
Member

ueshin commented Jul 28, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Jul 28, 2017

Test build #80013 has finished for PR 17435 at commit 8872e19.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jul 28, 2017

@szalai1 Could you fix tests if you're still working on this please?

@szalai1
Copy link
Contributor Author

szalai1 commented Aug 10, 2017

@ueshin Sure, I'll do it next week.

@holdenk
Copy link
Contributor

holdenk commented Sep 6, 2017

Gentle ping (going through old PySpark PRs) :)

@SparkQA
Copy link

SparkQA commented Sep 7, 2017

Test build #81511 has finished for PR 17435 at commit a18436d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@szalai1
Copy link
Contributor Author

szalai1 commented Sep 7, 2017

@holdenk thanks for reminder. I fixed the problem.

@@ -438,6 +438,11 @@ def toInternal(self, obj):
def fromInternal(self, obj):
return self.dataType.fromInternal(obj)

def typeName(self):
raise TypeError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do like ...

raise TypeError(
    "..."
    "...")

if it doesn't bother you much?

def typeName(self):
raise TypeError(
"StructField does not have typename. \
You can use self.dataType.simpleString() instead.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove self here and just say something like use typeName() on its type explicitly ....

Copy link
Contributor Author

@szalai1 szalai1 Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean simpleString() and not typeName(). right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be typeName on its datatype because typeName was called.

@@ -438,6 +438,11 @@ def toInternal(self, obj):
def fromInternal(self, obj):
return self.dataType.fromInternal(obj)

def typeName(self):
raise TypeError(
"StructField does not have typename. \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little nit: looks a typo, typename -> typeName.

@SparkQA
Copy link

SparkQA commented Sep 8, 2017

Test build #81556 has finished for PR 17435 at commit a451901.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -438,6 +438,11 @@ def toInternal(self, obj):
def fromInternal(self, obj):
return self.dataType.fromInternal(obj)

def typeName(self):
raise TypeError(
"StructField does not have typeName."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry. Could you add a space at the end ..eName." -> ...eName. "?

@SparkQA
Copy link

SparkQA commented Sep 8, 2017

Test build #81559 has finished for PR 17435 at commit 98bac12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Sep 8, 2017

@szalai1 I think @HyukjinKwon wanted the space on the previous line/sentence.

@SparkQA
Copy link

SparkQA commented Sep 9, 2017

Test build #81584 has finished for PR 17435 at commit 28f142c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

LGTM

asfgit pushed a commit that referenced this pull request Sep 10, 2017
## What changes were proposed in this pull request?
`typeName`  classmethod has been fixed by using type -> typeName map.

## How was this patch tested?
local build

Author: Peter Szalai <szalaipeti.vagyok@gmail.com>

Closes #17435 from szalai1/datatype-gettype-fix.

(cherry picked from commit 520d92a)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
@asfgit asfgit closed this in 520d92a Sep 10, 2017
@HyukjinKwon
Copy link
Member

Merged to master and branch-2.2.

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
## What changes were proposed in this pull request?
`typeName`  classmethod has been fixed by using type -> typeName map.

## How was this patch tested?
local build

Author: Peter Szalai <szalaipeti.vagyok@gmail.com>

Closes apache#17435 from szalai1/datatype-gettype-fix.

(cherry picked from commit 520d92a)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants