Bug Report: Null columns not supported for Spark dataframe #1305
Description
Current Behaviour
When attempting to profile a Spark dataframe that contains an entirely null column, the process errors.
When the null column is of type integer, the error message is KeyError: '50%'
as thrown by ydata_profiling/model/spark/describe_numeric_spark.py:102, in describe_numeric_1d_spark(config, df, summary)
.
When the null column is a string, the error message is ZeroDivisionError: division by zero
as thrown by ydata_profiling/model/spark/describe_supported_spark.py:31, in describe_supported_spark(config, series, summary)
.
Expected Behaviour
A profile should be produced for the Spark dataframe even with null value columns. The profiler works as expected for the same data when passed as a Pandas dataframe.
Data Description
Any Spark dataframe with an entirely null column:
df.withColumn('empty1', lit(None).cast('string')).withColumn('empty2', lit(None).cast('integer'))
Code that reproduces the bug
# Follow the Spark Databricks example code: https://github.com/ydataai/ydata-profiling/blob/master/examples/integrations/databricks/ydata-profiling%20in%20Databricks.ipynb
# Add the following lines to df before running ProfileReport
df = (
df
.withColumn('empty1', lit(None).cast('string'))
.withColumn('empty2', lit(None).cast('integer'))
)
pandas-profiling version
v4.1.2
Dependencies
numpy==1.21.5
pandas==1.4.2
ydata-profiling==4.1.2
OS
No response
Checklist
- There is not yet another bug report for this issue in the issue tracker
- The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
- The issue has not been resolved by the entries listed under Common Issues.
Metadata
Assignees
Type
Projects
Status
Selected for next release