-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add string trimming and padding functions #248
Conversation
extensions/functions_string.yaml
Outdated
- value: "varchar<L1>" | ||
- value: i32 | ||
- value: "varchar<L2>" | ||
return: "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return for these padding functions kind of confused me. I guess the return would be a varchar, where the length is the i32 input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't make a return type depend on an argument value. Imagine for instance a project relation that applies this function to three separate columns; what would the type of the resulting column be? You could do this if Substrait would support something akin to non-type template arguments in C++, but currently it doesn't.
extensions/functions_string.yaml
Outdated
Remove any occurrence of the characters from the left side of the string. | ||
If no characters are specified, spaces are removed. | ||
impls: | ||
- args: | ||
- value: "varchar<L1>" | ||
- value: "varchar<L2>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know most (all?) other function definitions don't specify argument names either, but for these functions and descriptions I would really have to guess which argument does what (I can kinda tell based on the varchar sizes, but that's a pretty bad thing to have to rely on). In case you're unaware, according to the schema you should be able to do something like
impls:
- args:
- value: "varchar<L1>"
name: "input"
description: "The string to remove characters from."
- value: "varchar<L2>"
name: "characters"
description: "The set of characters to remove."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I wasn't aware that I could do this. I'll make these changes. I came across a few other functions that probably could be more descriptive like this as well. I can submit a separate PR for those later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to do something like this directly in the function description instead of repeating it for all implementations of the function?
name: ltrim
description: >-
Remove any occurrence of the characters from the left side of the string.
If no characters are specified, spaces are removed.
arg0: input - The string to remove characters from.
arg1: characters - The set of characters to remove.```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question... that would be a lot of repetition indeed. Generally speaking though I suppose a function can have implementations with different argument counts as well, since AFAICT those "implementations" are basically just function overloads. Comparing it to a random C++ function for instance, the constructor of std::vector
has several overloads with very different semantics that would need separate argument descriptions to be captured. That being said... they also have separate overarching descriptions for each implementation. I'm curious what @jacques-n thinks about this, because I might also be mischaracterizing these implementations as generalized function overloads.
I'd still err toward repetition until something like what you described is standardized though (assuming it will be, I'm on the fence about it). Otherwise different committers are bound to come up with their own formats, and things will quickly become a mess.
@@ -163,3 +165,236 @@ scalar_functions: | |||
- value: "fixedchar<L1>" | |||
- value: "varchar<L2>" | |||
return: "BOOLEAN" | |||
- | |||
name: ltrim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these trimming functions where the user can specify characters to remove, is there anything to clarify whether the characters are interpreted as-is or as a regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption was as-is, since that seems to be how Postgresql and DuckDB do it. For regex, they use different functions. I was planning on looking into those functions in another PR.
extensions/functions_string.yaml
Outdated
return: "string" | ||
- | ||
name: center | ||
description: Pad the string with characters from each side until the specified length of the string has been reached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this need an option to determine which side to add more items to if the number of character to add was an odd number? Or should the consumer just decide that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure either. From what i've seen, the default is that for uneven padding, the lesser amount of padding goes on the left. I can at least specify that in the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reason I brought it up is that when I was writing R bindings for Arrow, this came up and I think we ended up requesting it being added as an option in Acero as the R library handled it differently to Acero. Not sure how common this is though - definitely worth checking as it's a pain to work around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the Substrait way would be to add an optional enum argument for that. A producer can leave that unset if it doesn't care what convention the consumer uses. I do think preferring more spacing on the right is the default though, because it just tends to look slightly more aesthetically pleasing (to me, anyway).
See #251 (comment), it's mostly inspired by this PR. |
8f0d7e3
to
4b994a6
Compare
42077f8
to
2ab528d
Compare
extensions/functions_string.yaml
Outdated
description: "The length of the output string." | ||
- value: "varchar<L2>" | ||
name: "characters" | ||
description: "The set of characters to use for padding." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the "set of characters" concept for trim functions (trim("foobar", "of") -> "bar"
), but not for padding. "Set" implies unordered. I don't think I've ever seen these functions in forms where you can specify more than one character either, but I guess Substrait doesn't really have a character type except maybe fixedchar<1>
. Do these functions just repeat the padding string as many times as needed? The question then becomes what they do if the number of characters needed is uneven, especially for the center
functions, where I can think of at least two similarly efficient algorithms that would have different results: center("x", 10, "abc")
-> replace_slice(substring(repeat("abc", 4), 1, 10), 5, 1, "x")
-> "abcaxcabca"
, and center("x", 10, "abc")
-> concat(substring(repeat("abc", 2), 1, 4), "x", substring(repeat("abc", 2), 1, 5))
-> "abcaxabcab"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll update the wording to 'string'. DuckDB only allows one character, but postgresql let's you use a string for the padding functions. I figured we would use whatever is more flexible.
Do these functions just repeat the padding string as many times as needed? The question then becomes what they do if the number of characters needed is uneven
I'll update the lpad and rpad descriptions to address these types of scenarios. Not too sure about center though.
To be honest, I also wasn't too convinced on the center function. I couldn't really find many places that implement it. I imagine the expectation is that someone could easily use a combination of lpad and rpad. I wonder if we should omit here as well?
cc: @ianmcook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought is that maybe we have center only take single character padding. Multiple character strings used for padding can be just for lpad/rpad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvanstraten in the interest of getting a bunch of these other more common functions in, i've removed center for now. I'll create a separate task to look more into center and if/how other SQL variants deal with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR to add definitions for string trimming and padding functions.