fix: Multiple Spark Enhancements #1800
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a single PR that addresses multiple issues at once:
I'm closing my other PRs (see list below) because once I got to resolving each issue individually, other rendering issues cropped up, so it's easier to test all of the fixes at once.
Main Goal
Combining all of the previous issues in Spark, it all boils down to the fact that the Spark reports were inaccurate when compared with Pandas. So my main goal here was to make Spark closer to parity with the Pandas output
Example of pandas output on toy dataset:

Example of broken/pre-fixes spark output on toy dataset:

Fixed Spark Output

Issues and Root Causes
There are couple of commits in here that address specific root causes of these discrepancies. Here are the summarized issues with their solutions:
Issue 1: pandas by default will count "NaN" values as Null in summary stats, but Spark SQL does not, so we explicitly address that in one of the commits.
Issue 2: Missing values were not being properly calculated because NaN in Spark is not null, so they weren't considered missing when they should be
Issue 3: Histogram counts and Common Values counts using the summary["value_counts_without_nan"] Series were not correctly summing counts.
Resolution: Adding a sum to the counts, and removing the limit(200) makes everything line up to parity with the Pandas output
NOTE: Since we're pre-aggregating data for the value_counts, I don't think the limit(200) is necessary even with Spark. Since we're pulling this down into a Pandas Series anyway, if the data was too big, then that would explicitly fail the process instead of producing misleading reports. If you're running this in Spark, you're likely using a machine that has a good bit of memory anyway.
Concluding Thoughts
While there is still some very slight variation to the computed stats because of how Spark handles nulls/NaNs differently than pandas, I think this new output is acceptably close to the pandas version and any differences are ultimately negligible. Especially when comparing the initial outputs where the differences are misleading without these fixes, or reports would not even complete/render with some edge cases (all null numeric columns, etc.)
@fabclmnt - Apologies for all of the tags, and I'm still open to all feedback on this approach! I'm happy to discuss further, and hope this is helpful to anyone using the Spark backend.
Misc. Notes: