Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: DataFrame.struct.explode(column, *, separator=".") method to pull struct subfields into the parent DataFrame #59585

Open
1 of 3 tasks
tswast opened this issue Aug 23, 2024 · 2 comments
Labels
Arrow pyarrow functionality Enhancement Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue

Comments

@tswast
Copy link
Contributor

tswast commented Aug 23, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, I can use Series.struct.explode() to create a DataFrame out of the subfields of a ArrowDtype(pa.struct(...)) column. Joining these back to the original DataFrame is a little awkward. It'd be nice to have a top-level explode() similar to how DataFrame.explode() works on lists.

Feature Description

Add a new StructFrameAccessor to pandas/core/arrays/arrow/accessors.py. I think implementation could be almost identical to what I did here: googleapis/python-bigquery-dataframes#916

class StructFrameAccessor:
    """
    Accessor object for structured data properties of the DataFrame values.
    """

    def __init__(self, data: DataFrame) -> None:
        self._parent = data


    def explode(self, column, *, separator: str = "."):
        """
        Extract all child fields of struct column(s) and add to the DataFrame.

        **Examples:**

            >>> countries = pd.Series(["cn", "es", "us"])
            >>> files = pd.Series(
            ...     [
            ...         {"version": 1, "project": "pandas"},
            ...         {"version": 2, "project": "pandas"},
            ...         {"version": 1, "project": "numpy"},
            ...     ],
            ...     dtype=pd.ArrowDtype(pa.struct(
            ...         [("version", pa.int64()), ("project", pa.string())]
            ...     ))
            ... )
            >>> downloads = pd.Series([100, 200, 300])
            >>> df = pd.DataFrame({"country": countries, "file": files, "download_count": downloads})
            >>> df.struct.explode("file")
              country  file.version file.project  download_count
            0      cn             1       pandas             100
            1      es             2       pandas             200
            2      us             1        numpy             300
            [3 rows x 4 columns]

        Args:
            column:
                Column(s) to explode. For multiple columns, specify a non-empty
                list with each element be str or tuple, and all specified
                columns their list-like data on same row of the frame must
                have matching length.
            separator:
                Separator/delimiter to use to separate the original column name
                from the sub-field column name.


        Returns:
            DataFrame:
                Original DataFrame with exploded struct column(s).
        """
        df = self._parent
        column_labels = check_column(column)

        for label in column_labels:
            position = df.columns.to_list().index(label)
            df = df.drop(columns=label)
            subfields = self._parent[label].struct.explode()
            for subfield in reversed(subfields.columns):
                df.insert(
                    position, f"{label}{separator}{subfield}", subfields[subfield]
                )

        return df


def check_column(
    column: Union[blocks.Label, Sequence[blocks.Label]],
) -> Sequence[blocks.Label]:
    if not is_list_like(column):
        column_labels = cast(Sequence[blocks.Label], (column,))
    else:
        column_labels = cast(Sequence[blocks.Label], tuple(column))

    if not column_labels:
        raise ValueError("column must be nonempty")
    if len(column_labels) > len(set(column_labels)):
        raise ValueError("column must be unique")

    return column_labels

Alternative Solutions

An alternative could be to modify DataFrame.explode to support exploding a struct into columns. Potentially with an axis parameter to explode into columns instead of rows.

Additional Context

See also, the Series.struct accessor added last year: #54977

@tswast tswast added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 23, 2024
@rhshadrach
Copy link
Member

Thanks for the request. From an API design perspective, having a dtype-specific accessor on DataFrame seems like a bad approach.

For the feature itself, it seems to me you can use Series.struct.explode along with pd.concat([df, ...,], axis=1). Does this not work in your use-case?

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2024
@tswast
Copy link
Contributor Author

tswast commented Aug 26, 2024

From an API design perspective, having a dtype-specific accessor on DataFrame seems like a bad approach.

Gotcha. I did see we have some already. e.g. SparseFrameAccessor, which does seem SparseDtype specific.

For the feature itself, it seems to me you can use Series.struct.explode along with pd.concat([df, ...,], axis=1). Does this not work in your use-case?

Some nice things about this feature not covered by pd.concat([df, ...,], axis=1):

  1. The columns appear in the same order as originally, with the sub-fields replacing the original column.
  2. The column names include the original column name (separated by separator=".", whereas Series.struct.explode() only returns the subfield names as the column names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement Needs Discussion Requires discussion from core team before further action Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

2 participants