Skip to content

[C++] split kernels for strings/binary #26016

@asfimport

Description

@asfimport

Similar to Python str.split and bytes.split, we'd like to have a way to convert str into list[str] (and similarly for bytes).

When the separator is given, the algorithms for both types are the same. Python, however, overloads strip. When given no separator, the algorithm will split considering all whitespace (unicode for str, ascii for bytes) as separator.

I'd rather see not too much overloaded kernels, e.g.

binary_split (takes string/binary separator, and maxsplit arg, no special utf8 version needed)

utf8_split_whitespace (similar to Python's version given no separator)

ascii_split_whitespace (similar to Python's version given no separator, but considering ascii, although this could work on any binary data)

there can also be rsplit versions of these, or they could be an argument.

 

Reporter: Maarten Breddels / @maartenbreddels
Assignee: Maarten Breddels / @maartenbreddels

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-9991. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions