add a describe method on DataFrame like Polars #5226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

alamb merged 4 commits into apache:main from jiangzhx:issues/4974

Feb 28, 2023

Contributor

jiangzhx commented Feb 9, 2023

Which issue does this PR close?

Closes #4974 .

jiangzhx marked this pull request as draft

February 9, 2023 13:46

github-actions bot added the core label

Contributor Author

jiangzhx commented Feb 9, 2023

Try to complete this work, there are still many places that can be optimized.

jiangzhx force-pushed the issues/4974 branch 2 times, most recently from 9d0fafe to 7a622c6 Compare

February 13, 2023 08:06

Contributor Author

jiangzhx commented Feb 14, 2023

make describe method return right datatype for record batch。

github-actions bot added development-process documentation logical-expr optimizer physical-expr sql sqllogictest substrait and removed substrait logical-expr physical-expr development-process sqllogictest optimizer sql documentation labels

jiangzhx closed this

jiangzhx force-pushed the issues/4974 branch from c0500d9 to 1309267 Compare

February 24, 2023 10:56

github-actions bot removed the core label


          add describe method like polars

d4730bf

jiangzhx reopened this

github-actions bot added the core label


          clippy fix

c7e6091

jiangzhx marked this pull request as ready for review

February 24, 2023 14:39

alamb reviewed

View reviewed changes

Contributor

alamb left a comment

Thank you @jiangzhx -- this looks like a very nice addition to DataFusion ❤️

I left some comments -- let me know if anything isn't clear.

And thanks again!

datafusion/core/src/dataframe.rs Outdated Show resolved Hide resolved

datafusion/core/src/dataframe.rs Outdated Show resolved Hide resolved

datafusion/core/src/dataframe.rs

    
                              if field.data_type().is_numeric() {

                                  Field::new(field.name(), DataType::Float64, true)

                              } else {

                                  Field::new(field.name(), DataType::Utf8, true)

Contributor

alamb Feb 26, 2023

I would expect that the schema for count and null_count were always Int64 and the schema for min/max were always Utf8

Contributor Author

jiangzhx Feb 27, 2023

the describe method return schema like this.

the each column should have same datatype .
for example :

bool_col on count/null_count return Int64 ; error on min/max , so make bool_col datatype UTF8;
float_col on count/null_count return Int64 ; on min/max return float, so make float_col datatype Float64

datafusion/core/src/dataframe.rs Outdated

    
                                  vec![],

                                  fields_iter

                                      .clone()

                                      .filter(|f| matches!(f.data_type().is_numeric(), true))

Contributor

alamb Feb 26, 2023

I wonder why restrict the min/max aggregation to numeric fields?

In order to get the min/max values in all columns to work, you could call cast to cast them to the same datatype

Contributor Author

jiangzhx Feb 27, 2023 •

edited

Loading

boolean and binary not work with min/max.

filter out DataType::Binary , DataType::Boolean will be better.
!matches!(f.data_type(), DataType::Binary | DataType::Boolean)

Contributor Author

jiangzhx Feb 27, 2023

date_string_col, string_col ’s datatype also Binary.
called Result::unwrap() on an Err value: Internal("Min/Max accumulator not implemented for type Binary")

datafusion/core/src/dataframe.rs Outdated Show resolved Hide resolved

jiangzhx added 2 commits

February 27, 2023 13:22


          commit suggestion

aeac881


          fix typos

ae848b1

This was referenced Feb 27, 2023

Add expr_fn::stddev #5409

Merged

Is there a describe method on DataFrame like Polars? #4974

Closed

jiangzhx requested a review from alamb

February 27, 2023 13:44

alamb approved these changes

View reviewed changes

Contributor

alamb left a comment

Looks good to me -- thank you @jiangzhx

cc @andygrove (as I think this is a neat thing to expose in datafusion-python)

datafusion/core/src/dataframe.rs

    
                          )),

                      );

                      let describe_record_batch =

Contributor

alamb Feb 28, 2023

👍

alamb merged commit 96aa2a6 into apache:main

ursabot commented Feb 28, 2023

Benchmark runs are scheduled for baseline = ea3b965 and contender = 96aa2a6. 96aa2a6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jiangzhx mentioned this pull request

add expr_fn::median #5437

Merged

simicd mentioned this pull request

Implement describe() method in datafusion-python apache/datafusion-python#292

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core