Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProfileReport.to_json() returns invalid JSON #983

Open
3 tasks done
LukasBoersma opened this issue May 17, 2022 · 3 comments
Open
3 tasks done

ProfileReport.to_json() returns invalid JSON #983

LukasBoersma opened this issue May 17, 2022 · 3 comments
Labels
bug 🐛 Something isn't working help wanted 🙋 Contributions are welcome! spark ⚡ PySpark features!

Comments

@LukasBoersma
Copy link

Current Behaviour

Python has the known issue that its standard json implementation produces invalid JSON: NaN and Infinity values are not supported by the json specification, but Python serializes those values anyway.

Because of that, when using ProfileReport.to_json(), the JSON is often not valid. For example, the example code below produces this output that other JSON implementations will fail to parse:

...
"variance": 0.0,
"kurtosis": NaN,
"skewness": 0,
...

I did this on the spark-branch , commit 9017c4a5e26e22152ed3f24c5ec628f70859fa14

Expected Behaviour

This is not really the fault of pandas-profiling, but it would be really nice if to_json could return valid JSON.

I see several ways to solve this:

  • Never actually return any NaNs in the produced statistics, always set values to None. This would have the problem that some parts of the report contain actual input data and that could still contain NaNs
  • Use a different JSON library that produces a valid output
  • (Maybe the easiest solution) Check for non-finite numbers in the existing encode_it function and replace them with None so that the JSON will contain null values

The changed encode_it function could look like this (I would be happy to send a pull request if that is the accepted solution):

def encode_it(o: Any) -> Any:
  if isinstance(o, dict):
      return {encode_it(k): encode_it(v) for k, v in o.items()}
  else:
      if isinstance(o, (bool, int, str)):
          return o
      if isinstance(o, float):
          if not math.isfinite(o):
              # Encode non-finite floats as None.
              # This is necessary because JSON does not support NaN/Infinity values.
              return None
          else:
              return o
      elif isinstance(o, list):
          [...]

Data Description

DataFrame([1, 1, 1], columns=["a"])

Code that reproduces the bug

import pandas as pd
from pandas_profiling import ProfileReport

df = pd.DataFrame([1, 1, 1], columns=["a"])

profile = ProfileReport(df, title="Pandas Profiling Report", minimal=True)

print(profile.to_json())

pandas-profiling version

spark-branch @ 9017c4a

Dependencies

joblib~=1.1.0
scipy>=1.4.1
matplotlib>=3.2.0
pydantic>=1.8.1
PyYAML>=5.0.0
jinja2>=2.11.1
markupsafe~=2.0.1
visions[type_image_path]==0.7.4
numpy>=1.16.0
htmlmin>=0.1.12
missingno>=0.4.2
phik>=0.11.1
tangled-up-in-unicode==0.2.0
requests>=2.24.0
tqdm>=4.48.2
seaborn>=0.10.1
multimethod>=1.4

OS

Windows 10

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Frequent Issues.
@sbrugman sbrugman added bug 🐛 Something isn't working and removed needs-triage labels May 17, 2022
@sbrugman
Copy link
Collaborator

Well spotted @LukasBoersma! In addition to your bug report proposals, we could also consider writing non-finite floats as string (which will result in a mixed-type). This way there is no information loss.

At this moment my preference would go out to your proposed solution above (perhaps with a parameter to toggle between handling). A PR is much appreciated, lets take it from there.

@sbrugman sbrugman added the help wanted 🙋 Contributions are welcome! label May 17, 2022
@LukasBoersma
Copy link
Author

Okay, cool! Then I will prepare a pull request in the next days with a solution where you can configure the behavior.

@Vamp1899
Copy link

Hi , Lukas would like to mention some refinements in this code.
Make global class variables as private and call them using function .
Rather than defining and passing one value through config and checking in profile report make different instances for the same .

@fabclmnt fabclmnt added spark ⚡ spark ⚡ PySpark features! and removed spark ⚡ labels Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working help wanted 🙋 Contributions are welcome! spark ⚡ PySpark features!
Projects
None yet
Development

No branches or pull requests

4 participants