Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add string trimming and padding functions #248

Merged
merged 7 commits into from
Jul 28, 2022

Conversation

richtia
Copy link
Member

@richtia richtia commented Jul 15, 2022

PR to add definitions for string trimming and padding functions.

- value: "varchar<L1>"
- value: i32
- value: "varchar<L2>"
return: "string"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return for these padding functions kind of confused me. I guess the return would be a varchar, where the length is the i32 input?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't make a return type depend on an argument value. Imagine for instance a project relation that applies this function to three separate columns; what would the type of the resulting column be? You could do this if Substrait would support something akin to non-type template arguments in C++, but currently it doesn't.

Comment on lines 171 to 264
Remove any occurrence of the characters from the left side of the string.
If no characters are specified, spaces are removed.
impls:
- args:
- value: "varchar<L1>"
- value: "varchar<L2>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know most (all?) other function definitions don't specify argument names either, but for these functions and descriptions I would really have to guess which argument does what (I can kinda tell based on the varchar sizes, but that's a pretty bad thing to have to rely on). In case you're unaware, according to the schema you should be able to do something like

impls:
  - args:
      - value: "varchar<L1>"
        name: "input"
        description: "The string to remove characters from."
      - value: "varchar<L2>"
        name: "characters"
        description: "The set of characters to remove."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I wasn't aware that I could do this. I'll make these changes. I came across a few other functions that probably could be more descriptive like this as well. I can submit a separate PR for those later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to do something like this directly in the function description instead of repeating it for all implementations of the function?

    name: ltrim
    description: >-
      Remove any occurrence of the characters from the left side of the string.
      If no characters are specified, spaces are removed.

      arg0: input - The string to remove characters from.
      arg1: characters - The set of characters to remove.```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question... that would be a lot of repetition indeed. Generally speaking though I suppose a function can have implementations with different argument counts as well, since AFAICT those "implementations" are basically just function overloads. Comparing it to a random C++ function for instance, the constructor of std::vector has several overloads with very different semantics that would need separate argument descriptions to be captured. That being said... they also have separate overarching descriptions for each implementation. I'm curious what @jacques-n thinks about this, because I might also be mischaracterizing these implementations as generalized function overloads.

I'd still err toward repetition until something like what you described is standardized though (assuming it will be, I'm on the fence about it). Otherwise different committers are bound to come up with their own formats, and things will quickly become a mess.

@@ -163,3 +165,236 @@ scalar_functions:
- value: "fixedchar<L1>"
- value: "varchar<L2>"
return: "BOOLEAN"
-
name: ltrim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these trimming functions where the user can specify characters to remove, is there anything to clarify whether the characters are interpreted as-is or as a regex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was as-is, since that seems to be how Postgresql and DuckDB do it. For regex, they use different functions. I was planning on looking into those functions in another PR.

return: "string"
-
name: center
description: Pad the string with characters from each side until the specified length of the string has been reached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this need an option to determine which side to add more items to if the number of character to add was an odd number? Or should the consumer just decide that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure either. From what i've seen, the default is that for uneven padding, the lesser amount of padding goes on the left. I can at least specify that in the description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason I brought it up is that when I was writing R bindings for Arrow, this came up and I think we ended up requesting it being added as an option in Acero as the R library handled it differently to Acero. Not sure how common this is though - definitely worth checking as it's a pain to work around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the Substrait way would be to add an optional enum argument for that. A producer can leave that unset if it doesn't care what convention the consumer uses. I do think preferring more spacing on the right is the default though, because it just tends to look slightly more aesthetically pleasing (to me, anyway).

@jvanstraten
Copy link
Contributor

See #251 (comment), it's mostly inspired by this PR.

@richtia richtia requested a review from jvanstraten July 26, 2022 21:56
description: "The length of the output string."
- value: "varchar<L2>"
name: "characters"
description: "The set of characters to use for padding."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the "set of characters" concept for trim functions (trim("foobar", "of") -> "bar"), but not for padding. "Set" implies unordered. I don't think I've ever seen these functions in forms where you can specify more than one character either, but I guess Substrait doesn't really have a character type except maybe fixedchar<1>. Do these functions just repeat the padding string as many times as needed? The question then becomes what they do if the number of characters needed is uneven, especially for the center functions, where I can think of at least two similarly efficient algorithms that would have different results: center("x", 10, "abc") -> replace_slice(substring(repeat("abc", 4), 1, 10), 5, 1, "x") -> "abcaxcabca", and center("x", 10, "abc") -> concat(substring(repeat("abc", 2), 1, 4), "x", substring(repeat("abc", 2), 1, 5)) -> "abcaxabcab".

Copy link
Member Author

@richtia richtia Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll update the wording to 'string'. DuckDB only allows one character, but postgresql let's you use a string for the padding functions. I figured we would use whatever is more flexible.

Do these functions just repeat the padding string as many times as needed? The question then becomes what they do if the number of characters needed is uneven

I'll update the lpad and rpad descriptions to address these types of scenarios. Not too sure about center though.

To be honest, I also wasn't too convinced on the center function. I couldn't really find many places that implement it. I imagine the expectation is that someone could easily use a combination of lpad and rpad. I wonder if we should omit here as well?

cc: @ianmcook

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought is that maybe we have center only take single character padding. Multiple character strings used for padding can be just for lpad/rpad

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvanstraten in the interest of getting a bunch of these other more common functions in, i've removed center for now. I'll create a separate task to look more into center and if/how other SQL variants deal with it.

@richtia richtia requested a review from jvanstraten July 28, 2022 15:16
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jacques-n jacques-n merged commit 8a8f65d into substrait-io:main Jul 28, 2022
@richtia richtia mentioned this pull request Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants