Skip to content

Feature Proposal: clean_json functionality in the clean module #902

Open
@yixuy

Description

@yixuy

Summary

Implement clean_json() functionality to parse and clean JSON file

Design-level Explanation

  • Investigate approaches for matching and validating JSON files.
  • Discuss output formats to be supported.

Design-level Explanation Actions

def clean_json(
    df: Union[pd.DataFrame, dd.DataFrame],
    col: str,
    fix_missing: str = "empty",
    split: bool = False,
    inplace: bool = False,
    report: bool = True,
    errors: str = "coerce",
) -> pd.DataFrame:
    """
    This function cleans JSON string.

    Parameters
    ----------
    df
        Pandas or Dask DataFrame.
    col
        Column name containing JSON.
    split
        If True, split a column containing a JSON into different
        columns containing individual components.
    inplace
        If True, delete the given column with dirty data. Else, create a new
        column with cleaned data.
    report
        If True, output the summary report. Else, no report is outputted.
    errors {'ignore', 'raise', 'coerce'}, default 'coerce'.
        * If 'raise', then invalid parsing will raise an exception.
        * If 'coerce', then invalid parsing will be set as NaN.
        * If 'ignore', then invalid parsing will return the input.
    """

Design-level Explanation

def validate_json(x: Union[str, pd.Series]) -> Union[bool, pd.Series]:
"""
Function to validate JSON format.

Parameters
----------
x
    String or Pandas Series of JSON to be validated.
"""

Implementation-level Explanation

Rational and Alternatives

Prior Art

JSON Python Library can encode and decode preserve input and output order by default.

Future Possibilities

Implementation-level Actions

Additional Tasks

  • This task is put into a correct pipeline (Development Backlog or In Progress).
  • The label of this task is setting correctly.
  • The issue is assigned to the correct person.
  • The issue is linked to related Epic.
  • The documentation is changed accordingly.
  • Tests are added accordingly.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions