[FEATURE REQUEST]: Would like to use mapInPandas interface for "vector/pandas" -style udf.

**Is your feature request related to a problem? Please describe.**

I really like using the **Group/Apply** (RelationalGroupedDataset) interface that allows the rapid exchange of complex data with executors.  I use it regularly, in conjunction with the Apache Arrow recordbatch (not FxDataFrame).  The only problem is that most of the time when I'm using it, the logic that is executed isn't taking advantage of the "Groupings" (nor does it require the creation of the "RelationalGroupedDataset").  My path that takes me thru Group/Apply is fairly "artificial".  I only go down that road for the benefit of the "vector/pandas" udfs.  It is a means to an end.


**Describe the solution you'd like**

Would like it if the project would support this interface?
https://spark.apache.org/docs/3.0.0-preview/api/python/pyspark.sql.html#pyspark.sql.DataFrame.mapInPandas


**Describe alternatives you've considered**

I can use **GroupBy/Apply** in the meantime. But my code that calls GroupBy/Apply is not very readable and will cause confusion for others since I'm not actually leveraging the groups in any reasonable way.  The groups are just a means to an end. 
 Also from a performance standpoint my approach typically causes all my data to be shuffled to a single executor, prior to the execution of my recordbatch-UDF .  There is also a risk that the worker node of that executor will run out of resources since the resulting partition becomes enormous.

I've also investigated the use of the normal vector udf's (either the Arrow or DataAnalysis variety).   The problem is that we don't currently seem allow any complex types for results (ie. no structs are allowed). So if/when I need to return more than one piece of information (like a tuple), then I must revert to a row-by-row-styled UDF which runs slower.



**Additional context**

It would be very convenient to have "**mapInPandas**" since this appears to be a fairly common and popular mapping feature on the Python side of things.  I think it gives us a tool to solve some performance problems in a fairly natural and intuitive way.  I'm not sure if I like the "pandas" terminology but everything else about this interface is appealing.  It would be nice to prioritize the Apache Arrow (recordbatch) implementation of this first, and perhaps the FxDataFrame variation can follow along at a later time.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE REQUEST]: Would like to use mapInPandas interface for "vector/pandas" -style udf. #925

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE REQUEST]: Would like to use mapInPandas interface for "vector/pandas" -style udf. #925

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions