Skip to content

Bug Report: Null columns not supported for Spark dataframe #1305

Open
@danaack

Description

Current Behaviour

When attempting to profile a Spark dataframe that contains an entirely null column, the process errors.

When the null column is of type integer, the error message is KeyError: '50%' as thrown by ydata_profiling/model/spark/describe_numeric_spark.py:102, in describe_numeric_1d_spark(config, df, summary).

When the null column is a string, the error message is ZeroDivisionError: division by zero as thrown by ydata_profiling/model/spark/describe_supported_spark.py:31, in describe_supported_spark(config, series, summary).

Expected Behaviour

A profile should be produced for the Spark dataframe even with null value columns. The profiler works as expected for the same data when passed as a Pandas dataframe.

Data Description

Any Spark dataframe with an entirely null column:

df.withColumn('empty1', lit(None).cast('string')).withColumn('empty2', lit(None).cast('integer'))

Code that reproduces the bug

# Follow the Spark Databricks example code: https://github.com/ydataai/ydata-profiling/blob/master/examples/integrations/databricks/ydata-profiling%20in%20Databricks.ipynb

# Add the following lines to df before running ProfileReport
df = (
  df
  .withColumn('empty1', lit(None).cast('string'))
  .withColumn('empty2', lit(None).cast('integer'))
)

pandas-profiling version

v4.1.2

Dependencies

numpy==1.21.5
pandas==1.4.2
ydata-profiling==4.1.2

OS

No response

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.

Metadata

Assignees

No one assigned

    Labels

    bug 🐛Something isn't workinggetting started ☝Straight-forward for beginning contributorsspark ⚡PySpark features!

    Type

    No type

    Projects

    • Status

      Selected for next release

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions