-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-20098][PYSPARK] dataType's typeName fix #17435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyspark/sql/types.py
Outdated
"ShortType": "short", | ||
"ArrayType": "array", | ||
"MapType": "map", | ||
"StructField": "struct", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this problem only applies to StructField
. Could we just override typeName
with simply throwing an exception? I think users are not supposed to call typeName
against StructField
but simpleString
against the instance.
BTW, It apparently seems a bit odd that it extends DataType
though.. I guess probably some tests are broken if we change the parent as it seems it is dependent on the parent assuming from #1598. So, I guess minimised fix would be just to override it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think it is valid to call typeName
against a StructField
. Actually, StructField
is not a data type, strictly speaking...
I don't know why StructField
inherits DataType
in pyspark. In scala, it is not.
Throwing an exception when calling typeName
on StructField
seems good enough, instead of a map like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya The way I found this bug:
I wanted to figure out the schema of a dataset. I loaded it into a data frame and asked its schema. Then I called the typeName
on each column. I do not know this is/was best way to do this, but I think it is valid to call typeName
against a dataType
to get its real type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is valid call for DataType
s but I don't think it is against StructField
. If you look at the Scala-side codes, StructField
is not a DataType
(which is "The base type of all Spark SQL data types
") but it seems the parent became DataType
in Python for some reason in the PR I pointed out. In any way, It seems "A field inside a StructType
".
If you want to know the schema, you could simply
>>> spark.range(1).schema[0].simpleString()
'id:bigint'
>>> spark.range(1).schema.simpleString()
'struct<id:bigint>'
>>> type(spark.range(1).schema[0])
<class 'pyspark.sql.types.StructField'>
>>> str(spark.range(1).schema[0])
'StructField(id,LongType,false)'
Could you elaborate your use-case and why simpleString
is not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szalai1 I think @HyukjinKwon 's code snippets should address your request. Doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I don't think i.typeName()
is a valid usage. We better let it throw an exception when calling typeName
on StructField
.
i.dataType.typeName()
is more reasonable call to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya You are right. I will use it that way. Thanks @HyukjinKwon !
But my bug report is still valid, I think. Can I override the typeName
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I think we should still fix and overriding it is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@szalai1, could you update this PR please? Otherwise, we should close this as a stale PR. |
@HyukjinKwon sure, I will do it this week. I totally forgot this. Sorry. |
I think we need a test and @holdenk's review. |
Thanks for working on this. I feel like the error message could maybe be improved to suggest what the user should be doing? It would be nicer to eventually not have this depend on DataType since we don't have this in the Scala version as @HyukjinKwon pointed out, but I think this could be a good improvement for now. |
@holdenk I am happy to contribute to this project. I changed the error message and added a test case. |
Sorry for my delay finishing the review. This looks good pending Jenkins and if it passes I'll try and merge it when I get home. (so Jenkins looks good to test. Jenkins test this please. ) :) |
Test build #78097 has finished for PR 17435 at commit
|
Hey @szalai1 if you've got time to address the style issues its looking good otherwise :) |
@holdenk Thanks for reviewing it. I fixed the style issues. |
ok to test |
Test build #80013 has finished for PR 17435 at commit
|
@szalai1 Could you fix tests if you're still working on this please? |
@ueshin Sure, I'll do it next week. |
Gentle ping (going through old PySpark PRs) :) |
Test build #81511 has finished for PR 17435 at commit
|
@holdenk thanks for reminder. I fixed the problem. |
@@ -438,6 +438,11 @@ def toInternal(self, obj): | |||
def fromInternal(self, obj): | |||
return self.dataType.fromInternal(obj) | |||
|
|||
def typeName(self): | |||
raise TypeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do like ...
raise TypeError(
"..."
"...")
if it doesn't bother you much?
python/pyspark/sql/types.py
Outdated
def typeName(self): | ||
raise TypeError( | ||
"StructField does not have typename. \ | ||
You can use self.dataType.simpleString() instead.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove self
here and just say something like use typeName() on its type explicitly ...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean simpleString()
and not typeName()
. right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be typeName
on its datatype because typeName
was called.
python/pyspark/sql/types.py
Outdated
@@ -438,6 +438,11 @@ def toInternal(self, obj): | |||
def fromInternal(self, obj): | |||
return self.dataType.fromInternal(obj) | |||
|
|||
def typeName(self): | |||
raise TypeError( | |||
"StructField does not have typename. \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little nit: looks a typo, typename -> typeName.
Test build #81556 has finished for PR 17435 at commit
|
python/pyspark/sql/types.py
Outdated
@@ -438,6 +438,11 @@ def toInternal(self, obj): | |||
def fromInternal(self, obj): | |||
return self.dataType.fromInternal(obj) | |||
|
|||
def typeName(self): | |||
raise TypeError( | |||
"StructField does not have typeName." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry. Could you add a space at the end ..eName."
-> ...eName. "
?
Test build #81559 has finished for PR 17435 at commit
|
@szalai1 I think @HyukjinKwon wanted the space on the previous line/sentence. |
Test build #81584 has finished for PR 17435 at commit
|
LGTM |
## What changes were proposed in this pull request? `typeName` classmethod has been fixed by using type -> typeName map. ## How was this patch tested? local build Author: Peter Szalai <szalaipeti.vagyok@gmail.com> Closes #17435 from szalai1/datatype-gettype-fix. (cherry picked from commit 520d92a) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
Merged to master and branch-2.2. |
## What changes were proposed in this pull request? `typeName` classmethod has been fixed by using type -> typeName map. ## How was this patch tested? local build Author: Peter Szalai <szalaipeti.vagyok@gmail.com> Closes apache#17435 from szalai1/datatype-gettype-fix. (cherry picked from commit 520d92a) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
What changes were proposed in this pull request?
typeName
classmethod has been fixed by using type -> typeName map.How was this patch tested?
local build