Skip to content

[FEATURE REQUEST]: Would like to use mapInPandas interface for "vector/pandas" -style udf. #925

Open
@dbeavon

Description

@dbeavon

Is your feature request related to a problem? Please describe.

I really like using the Group/Apply (RelationalGroupedDataset) interface that allows the rapid exchange of complex data with executors. I use it regularly, in conjunction with the Apache Arrow recordbatch (not FxDataFrame). The only problem is that most of the time when I'm using it, the logic that is executed isn't taking advantage of the "Groupings" (nor does it require the creation of the "RelationalGroupedDataset"). My path that takes me thru Group/Apply is fairly "artificial". I only go down that road for the benefit of the "vector/pandas" udfs. It is a means to an end.

Describe the solution you'd like

Would like it if the project would support this interface?
https://spark.apache.org/docs/3.0.0-preview/api/python/pyspark.sql.html#pyspark.sql.DataFrame.mapInPandas

Describe alternatives you've considered

I can use GroupBy/Apply in the meantime. But my code that calls GroupBy/Apply is not very readable and will cause confusion for others since I'm not actually leveraging the groups in any reasonable way. The groups are just a means to an end.
Also from a performance standpoint my approach typically causes all my data to be shuffled to a single executor, prior to the execution of my recordbatch-UDF . There is also a risk that the worker node of that executor will run out of resources since the resulting partition becomes enormous.

I've also investigated the use of the normal vector udf's (either the Arrow or DataAnalysis variety). The problem is that we don't currently seem allow any complex types for results (ie. no structs are allowed). So if/when I need to return more than one piece of information (like a tuple), then I must revert to a row-by-row-styled UDF which runs slower.

Additional context

It would be very convenient to have "mapInPandas" since this appears to be a fairly common and popular mapping feature on the Python side of things. I think it gives us a tool to solve some performance problems in a fairly natural and intuitive way. I'm not sure if I like the "pandas" terminology but everything else about this interface is appealing. It would be nice to prioritize the Apache Arrow (recordbatch) implementation of this first, and perhaps the FxDataFrame variation can follow along at a later time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions