ProfileReport.to_json() returns invalid JSON #983

LukasBoersma · 2022-05-17T09:40:26Z

Current Behaviour

Python has the known issue that its standard json implementation produces invalid JSON: NaN and Infinity values are not supported by the json specification, but Python serializes those values anyway.

Because of that, when using ProfileReport.to_json(), the JSON is often not valid. For example, the example code below produces this output that other JSON implementations will fail to parse:

...
"variance": 0.0,
"kurtosis": NaN,
"skewness": 0,
...

I did this on the spark-branch , commit 9017c4a5e26e22152ed3f24c5ec628f70859fa14

Expected Behaviour

This is not really the fault of pandas-profiling, but it would be really nice if to_json could return valid JSON.

I see several ways to solve this:

Never actually return any NaNs in the produced statistics, always set values to None. This would have the problem that some parts of the report contain actual input data and that could still contain NaNs
Use a different JSON library that produces a valid output
(Maybe the easiest solution) Check for non-finite numbers in the existing encode_it function and replace them with None so that the JSON will contain null values

The changed encode_it function could look like this (I would be happy to send a pull request if that is the accepted solution):

def encode_it(o: Any) -> Any:
  if isinstance(o, dict):
      return {encode_it(k): encode_it(v) for k, v in o.items()}
  else:
      if isinstance(o, (bool, int, str)):
          return o
      if isinstance(o, float):
          if not math.isfinite(o):
              # Encode non-finite floats as None.
              # This is necessary because JSON does not support NaN/Infinity values.
              return None
          else:
              return o
      elif isinstance(o, list):
          [...]

Data Description

DataFrame([1, 1, 1], columns=["a"])

Code that reproduces the bug

import pandas as pd
from pandas_profiling import ProfileReport

df = pd.DataFrame([1, 1, 1], columns=["a"])

profile = ProfileReport(df, title="Pandas Profiling Report", minimal=True)

print(profile.to_json())

pandas-profiling version

spark-branch @ 9017c4a

Dependencies

joblib~=1.1.0
scipy>=1.4.1
matplotlib>=3.2.0
pydantic>=1.8.1
PyYAML>=5.0.0
jinja2>=2.11.1
markupsafe~=2.0.1
visions[type_image_path]==0.7.4
numpy>=1.16.0
htmlmin>=0.1.12
missingno>=0.4.2
phik>=0.11.1
tangled-up-in-unicode==0.2.0
requests>=2.24.0
tqdm>=4.48.2
seaborn>=0.10.1
multimethod>=1.4

OS

Windows 10

Checklist

There is not yet another bug report for this issue in the issue tracker
The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
The issue has not been resolved by the entries listed under Frequent Issues.

The text was updated successfully, but these errors were encountered:

sbrugman · 2022-05-17T13:16:37Z

Well spotted @LukasBoersma! In addition to your bug report proposals, we could also consider writing non-finite floats as string (which will result in a mixed-type). This way there is no information loss.

At this moment my preference would go out to your proposed solution above (perhaps with a parameter to toggle between handling). A PR is much appreciated, lets take it from there.

LukasBoersma · 2022-05-19T09:09:08Z

Okay, cool! Then I will prepare a pull request in the next days with a solution where you can configure the behavior.

Vamp1899 · 2022-07-12T19:55:09Z

Hi , Lukas would like to mention some refinements in this code.
Make global class variables as private and call them using function .
Rather than defining and passing one value through config and checking in profile report make different instances for the same .

github-actions bot added the needs-triage label May 17, 2022

sbrugman added bug 🐛 Something isn't working and removed needs-triage labels May 17, 2022

sbrugman added the help wanted 🙋 Contributions are welcome! label May 17, 2022

LukasBoersma mentioned this issue Jun 8, 2022

JSON encoding of non-finite floats #996

Open

This was referenced Oct 9, 2022

Report comparisons #1069

Merged

Schema for dataset summary (JSON) #1102

Open

fabclmnt added spark ⚡ spark ⚡ PySpark features! and removed spark ⚡ labels Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProfileReport.to_json() returns invalid JSON #983

ProfileReport.to_json() returns invalid JSON #983

LukasBoersma commented May 17, 2022

sbrugman commented May 17, 2022

LukasBoersma commented May 19, 2022

Vamp1899 commented Jul 12, 2022

ProfileReport.to_json() returns invalid JSON #983

ProfileReport.to_json() returns invalid JSON #983

Comments

LukasBoersma commented May 17, 2022

Current Behaviour

Expected Behaviour

Data Description

Code that reproduces the bug

pandas-profiling version

Dependencies

OS

Checklist

sbrugman commented May 17, 2022

LukasBoersma commented May 19, 2022

Vamp1899 commented Jul 12, 2022