feat: add string trimming and padding functions #248

richtia · 2022-07-15T23:16:07Z

PR to add definitions for string trimming and padding functions.

richtia · 2022-07-15T23:19:34Z

extensions/functions_string.yaml

+          - value: "varchar<L1>"
+          - value: i32
+          - value: "varchar<L2>"
+        return: "string"


The return for these padding functions kind of confused me. I guess the return would be a varchar, where the length is the i32 input?

You can't make a return type depend on an argument value. Imagine for instance a project relation that applies this function to three separate columns; what would the type of the resulting column be? You could do this if Substrait would support something akin to non-type template arguments in C++, but currently it doesn't.

jvanstraten · 2022-07-18T09:51:38Z

extensions/functions_string.yaml

+      Remove any occurrence of the characters from the left side of the string.
+      If no characters are specified, spaces are removed.
+    impls:
+      - args:
+          - value: "varchar<L1>"
+          - value: "varchar<L2>"


I know most (all?) other function definitions don't specify argument names either, but for these functions and descriptions I would really have to guess which argument does what (I can kinda tell based on the varchar sizes, but that's a pretty bad thing to have to rely on). In case you're unaware, according to the schema you should be able to do something like

impls: - args: - value: "varchar<L1>" name: "input" description: "The string to remove characters from." - value: "varchar<L2>" name: "characters" description: "The set of characters to remove."

Oh yeah, I wasn't aware that I could do this. I'll make these changes. I came across a few other functions that probably could be more descriptive like this as well. I can submit a separate PR for those later.

Would it make more sense to do something like this directly in the function description instead of repeating it for all implementations of the function?

name: ltrim description: >- Remove any occurrence of the characters from the left side of the string. If no characters are specified, spaces are removed. arg0: input - The string to remove characters from. arg1: characters - The set of characters to remove.```

Good question... that would be a lot of repetition indeed. Generally speaking though I suppose a function can have implementations with different argument counts as well, since AFAICT those "implementations" are basically just function overloads. Comparing it to a random C++ function for instance, the constructor of std::vector has several overloads with very different semantics that would need separate argument descriptions to be captured. That being said... they also have separate overarching descriptions for each implementation. I'm curious what @jacques-n thinks about this, because I might also be mischaracterizing these implementations as generalized function overloads.

I'd still err toward repetition until something like what you described is standardized though (assuming it will be, I'm on the fence about it). Otherwise different committers are bound to come up with their own formats, and things will quickly become a mess.

thisisnic · 2022-07-20T22:20:32Z

extensions/functions_string.yaml

@@ -163,3 +165,236 @@ scalar_functions:
          - value: "fixedchar<L1>"
          - value: "varchar<L2>"
        return: "BOOLEAN"
+  -
+    name: ltrim


With these trimming functions where the user can specify characters to remove, is there anything to clarify whether the characters are interpreted as-is or as a regex?

My assumption was as-is, since that seems to be how Postgresql and DuckDB do it. For regex, they use different functions. I was planning on looking into those functions in another PR.

thisisnic · 2022-07-20T22:23:22Z

extensions/functions_string.yaml

+        return: "string"
+  -
+    name: center
+    description: Pad the string with characters from each side until the specified length of the string has been reached.


Would this need an option to determine which side to add more items to if the number of character to add was an odd number? Or should the consumer just decide that?

I'm not actually sure either. From what i've seen, the default is that for uneven padding, the lesser amount of padding goes on the left. I can at least specify that in the description.

So the reason I brought it up is that when I was writing R bindings for Arrow, this came up and I think we ended up requesting it being added as an option in Acero as the R library handled it differently to Acero. Not sure how common this is though - definitely worth checking as it's a pain to work around.

I guess the Substrait way would be to add an optional enum argument for that. A producer can leave that unset if it doesn't care what convention the consumer uses. I do think preferring more spacing on the right is the default though, because it just tends to look slightly more aesthetically pleasing (to me, anyway).

jvanstraten · 2022-07-21T09:20:09Z

See #251 (comment), it's mostly inspired by this PR.

jvanstraten · 2022-07-27T08:34:21Z

extensions/functions_string.yaml

+            description: "The length of the output string."
+          - value: "varchar<L2>"
+            name: "characters"
+            description: "The set of characters to use for padding."


I understand the "set of characters" concept for trim functions (trim("foobar", "of") -> "bar"), but not for padding. "Set" implies unordered. I don't think I've ever seen these functions in forms where you can specify more than one character either, but I guess Substrait doesn't really have a character type except maybe fixedchar<1>. Do these functions just repeat the padding string as many times as needed? The question then becomes what they do if the number of characters needed is uneven, especially for the center functions, where I can think of at least two similarly efficient algorithms that would have different results: center("x", 10, "abc") -> replace_slice(substring(repeat("abc", 4), 1, 10), 5, 1, "x") -> "abcaxcabca", and center("x", 10, "abc") -> concat(substring(repeat("abc", 2), 1, 4), "x", substring(repeat("abc", 2), 1, 5)) -> "abcaxabcab".

Good point. I'll update the wording to 'string'. DuckDB only allows one character, but postgresql let's you use a string for the padding functions. I figured we would use whatever is more flexible.

Do these functions just repeat the padding string as many times as needed? The question then becomes what they do if the number of characters needed is uneven

I'll update the lpad and rpad descriptions to address these types of scenarios. Not too sure about center though.

To be honest, I also wasn't too convinced on the center function. I couldn't really find many places that implement it. I imagine the expectation is that someone could easily use a combination of lpad and rpad. I wonder if we should omit here as well?

cc: @ianmcook

Another thought is that maybe we have center only take single character padding. Multiple character strings used for padding can be just for lpad/rpad

@jvanstraten in the interest of getting a bunch of these other more common functions in, i've removed center for now. I'll create a separate task to look more into center and if/how other SQL variants deal with it.

jvanstraten

LGTM

richtia commented Jul 15, 2022

View reviewed changes

jvanstraten reviewed Jul 18, 2022

View reviewed changes

thisisnic reviewed Jul 20, 2022

View reviewed changes

jvanstraten mentioned this pull request Jul 21, 2022

Implicit casts/promotions in functions within Substrait's core extensions #251

Closed

richtia added 2 commits July 25, 2022 13:55

feat: add string trimming and padding functions

f510b7d

refactor: add names and descriptions for args

3bfdb96

richtia force-pushed the string_trimming_padding branch from 8f0d7e3 to 4b994a6 Compare July 25, 2022 20:56

richtia added 2 commits July 25, 2022 20:22

refactor: remove promotions and function definitions for fixedchar

2b0e096

Merge branch 'main' into string_trimming_padding

2ab528d

richtia force-pushed the string_trimming_padding branch from 42077f8 to 2ab528d Compare July 26, 2022 03:24

refactor: update arg lengths

cf41bdf

richtia requested a review from jvanstraten July 26, 2022 21:56

jvanstraten requested changes Jul 27, 2022

View reviewed changes

richtia added 2 commits July 27, 2022 10:59

refactor: update descriptions for lpad and rpad

bdcba3e

refactor: remove center function

0b47bd5

richtia requested a review from jvanstraten July 28, 2022 15:16

jvanstraten approved these changes Jul 28, 2022

View reviewed changes

jacques-n merged commit 8a8f65d into substrait-io:main Jul 28, 2022

richtia mentioned this pull request Aug 8, 2022

feat: add center function #282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add string trimming and padding functions #248

feat: add string trimming and padding functions #248

richtia commented Jul 15, 2022

richtia Jul 15, 2022

jvanstraten Jul 18, 2022

jvanstraten Jul 18, 2022

richtia Jul 18, 2022

richtia Jul 18, 2022

jvanstraten Jul 18, 2022

thisisnic Jul 20, 2022

richtia Jul 20, 2022

thisisnic Jul 20, 2022

richtia Jul 20, 2022

thisisnic Jul 21, 2022

jvanstraten Jul 21, 2022

jvanstraten commented Jul 21, 2022

jvanstraten Jul 27, 2022

richtia Jul 27, 2022 •

edited

Loading

richtia Jul 27, 2022

richtia Jul 28, 2022

jvanstraten left a comment

feat: add string trimming and padding functions #248

feat: add string trimming and padding functions #248

Conversation

richtia commented Jul 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvanstraten commented Jul 21, 2022

Choose a reason for hiding this comment

richtia Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvanstraten left a comment

Choose a reason for hiding this comment

richtia Jul 27, 2022 •

edited

Loading