Skip to content

Conversation

@squeek502
Copy link
Member

@squeek502 squeek502 commented May 5, 2023

  • This allows users to choose which version they need for their particular use case, as the previous default (now the 'sequence' version for split, and the 'any' version for tokenize) was (1) not always the desired type of delimiter and (2) performed worse than the scalar version if the delimiter was a single item.
  • std.mem.split is now an alias for std.mem.splitSequence and std.mem.tokenize is now an alias for std.mem.tokenizeAny with a deprecation doc comment on each of split and tokenize
  • Everywhere that can now use the 'scalar' version should get a nice little performance boost. EDIT: The performance boost might only apply to split, didn't check the performance difference of tokenizeAny versus tokenizeScalar

Any suggestions for a better name for the Full version are welcome. EDIT: Maybe Sequence? EDIT#2: Went with Sequence for now

@squeek502 squeek502 requested a review from kristoff-it as a code owner May 5, 2023 01:23
@ghost
Copy link

ghost commented May 5, 2023

I was curious myself so I investigated in Godbolt https://zig.godbolt.org/z/84qYfvejq (see export fn xy at top), and indeed, the code size is a lot less. The compiler is able to figure out splitAny and splitScalar with constant values though and will generate the same code.

How about splitAll instead of splitFull? I think "all" complements the "any" well.
And then instead of "scalar" we could say "one". It sounds very simple... but that's kind of the point:
splitOne, splitAny, splitAll. Now the names are also the same length. Not sure though.

Alternatively, or in addition, what if we have a sensible default which would be simply called split? Maybe we can make splitFull the default so it becomes split as before... I do think that would be okay because that behavior is what most other split functions from other languages use too, such as JavaScript or Python. And from a logical perspective it makes the most sense too.

And then for any advanced uses one can use one of the two new functions you added here.

We already do this for other functions too. There is std.mem.lastIndexOf, and std.mem.lastIndexOfAny and std.mem.lastIndexOfScalar.

@xdBronch
Copy link
Contributor

xdBronch commented May 5, 2023

I agree with keeping the name of the new splitFull as just split. The name makes enough sense and imo its best to avoid any type of breaking change if its not necessary, even something as small as a name change.

@squeek502
Copy link
Member Author

squeek502 commented May 5, 2023

The reason for deprecating split and tokenize in favor of only the explicitly suffixed versions is that the default behavior is not consistent between them: the default for split is what-is-now splitFull, but the default for tokenize is what-is-now tokenizeAny. Not sure how to square that circle without changing the default behavior of tokenize (doesn't sound good) or forcing at least tokenize to be explicit (i.e. deprecrating the un-suffixed version). I just went with deprecating both since at some point it will cause people to need to explicitly choose which to update their code to use (similar to the ensureCapacity -> ensureTotalCapacity/ensureUnusedCapacity split that happened previously).

Other than that,

  • I think All as a suffix is a bit too ambiguous since it sounds like it could apply to the buffer param or the delimiter param (EDIT: Full suffers from this too)
  • Scalar matches indexOfScalar and other std.mem functions, and same with Any (indexOfAny)

EDIT: Also, just for clarity, in this PR std.mem.split is an alias for splitFull and std.mem.tokenize is an alias for tokenizeAny with a deprecation doc comment on each. Added this bit to the OP since it's easy to miss it in the diff.

@squeek502 squeek502 changed the title Split std.mem.split and tokenize into full, any, and scalar versions Split std.mem.split and tokenize into sequence, any, and scalar versions May 7, 2023
squeek502 added 6 commits May 13, 2023 13:43
…ter type: full, any, and scalar

This allows users to choose which version they need for their particular use case, as the previous default (now the 'full' version) was (1) not always the desired type of delimiter and (2) performed worse than the scalar version if the delimiter was a single item.
…y, and scalar

This allows users to choose which version they need for their particular use case, as the previous default (now the 'any' version) was (1) not always the desired type of delimiter and (2) performed worse than the scalar version if the delimiter was a single item.
Everywhere that can now use `tokenizeScalar` should get a nice little performance boost.
Everywhere that can now use `splitScalar` should get a nice little performance boost.
I think this makes the name less ambiguous and more obvious that the suffix applies to the `delimiter`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants