-
Notifications
You must be signed in to change notification settings - Fork 1.8k
add a describe method on DataFrame like Polars #5226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9d0fafe to
7a622c6
Compare
c0500d9 to
1309267
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jiangzhx -- this looks like a very nice addition to DataFusion ❤️
I left some comments -- let me know if anything isn't clear.
And thanks again!
| if field.data_type().is_numeric() { | ||
| Field::new(field.name(), DataType::Float64, true) | ||
| } else { | ||
| Field::new(field.name(), DataType::Utf8, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that the schema for count and null_count were always Int64 and the schema for min/max were always Utf8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the describe method return schema like this.

the each column should have same datatype .
for example :
bool_coloncount/null_countreturn Int64 ; error onmin/max, so makebool_coldatatypeUTF8;float_coloncount/null_countreturn Int64 ; onmin/maxreturn float, so makefloat_coldatatypeFloat64
datafusion/core/src/dataframe.rs
Outdated
| vec![], | ||
| fields_iter | ||
| .clone() | ||
| .filter(|f| matches!(f.data_type().is_numeric(), true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why restrict the min/max aggregation to numeric fields?
In order to get the min/max values in all columns to work, you could call cast to cast them to the same datatype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boolean and binary not work with min/max.
filter out DataType::Binary , DataType::Boolean will be better.
!matches!(f.data_type(), DataType::Binary | DataType::Boolean)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
date_string_col, string_col ’s datatype also Binary.
called Result::unwrap() on an Err value: Internal("Min/Max accumulator not implemented for type Binary")
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @jiangzhx
cc @andygrove (as I think this is a neat thing to expose in datafusion-python)
| )), | ||
| ); | ||
|
|
||
| let describe_record_batch = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
Benchmark runs are scheduled for baseline = ea3b965 and contender = 96aa2a6. 96aa2a6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |


Which issue does this PR close?
Closes #4974 .