Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add string trimming and padding functions #248

Merged
merged 7 commits into from
Jul 28, 2022
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 180 additions & 3 deletions extensions/functions_string.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ scalar_functions:
return: "BOOLEAN"
-
name: substring
description: Extract a portion of a string from another string.
description: >-
Extract a substring of a specified length starting from position start.
A start value of 1 refers to the first characters of the string.
impls:
- args:
- value: "varchar<L1>"
Expand All @@ -44,7 +46,7 @@ scalar_functions:
- value: i32
- value: i32
return: "string"
-
-
name: starts_with
description: Whether this string starts with another string.
impls:
Expand Down Expand Up @@ -222,7 +224,8 @@ scalar_functions:
name: "substring"
description: The substring to count.
return: i64
- name: replace
-
name: replace
description: >-
Replace all occurrences of the substring with the replacement string.
impls:
Expand All @@ -248,3 +251,177 @@ scalar_functions:
name: "replacement"
description: The replacement string.
return: "varchar<L1>"
-
name: ltrim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these trimming functions where the user can specify characters to remove, is there anything to clarify whether the characters are interpreted as-is or as a regex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was as-is, since that seems to be how Postgresql and DuckDB do it. For regex, they use different functions. I was planning on looking into those functions in another PR.

description: >-
Remove any occurrence of the characters from the left side of the string.
If no characters are specified, spaces are removed.
impls:
- args:
- value: "varchar<L1>"
name: "input"
description: "The string to remove characters from."
- value: "varchar<L2>"
name: "characters"
description: "The set of characters to remove."
return: "varchar<L1>"
- args:
- value: "string"
name: "input"
description: "The string to remove characters from."
- value: "string"
name: "characters"
description: "The set of characters to remove."
return: "string"
-
name: rtrim
description: >-
Remove any occurrence of the characters from the right side of the string.
If no characters are specified, spaces are removed.
impls:
- args:
- value: "varchar<L1>"
name: "input"
description: "The string to remove characters from."
- value: "varchar<L2>"
name: "characters"
description: "The set of characters to remove."
return: "varchar<L1>"
- args:
- value: "string"
name: "input"
description: "The string to remove characters from."
- value: "string"
name: "characters"
description: "The set of characters to remove."
return: "string"
-
name: trim
description: >-
Remove any occurrence of the characters from the left and right sides of
the string. If no characters are specified, spaces are removed.
impls:
- args:
- value: "varchar<L1>"
name: "input"
description: "The string to remove characters from."
- value: "varchar<L2>"
name: "characters"
description: "The set of characters to remove."
return: "varchar<L1>"
- args:
- value: "string"
name: "input"
description: "The string to remove characters from."
- value: "string"
name: "characters"
description: "The set of characters to remove."
return: "string"
-
name: lpad
description: Left-pad the string with the characters until the specified length of the string has been reached.
impls:
- args:
- value: "varchar<L1>"
name: "input"
description: "The string to pad."
- value: i32
name: "length"
description: "The length of the output string."
- value: "varchar<L2>"
name: "characters"
description: "The set of characters to use for padding."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the "set of characters" concept for trim functions (trim("foobar", "of") -> "bar"), but not for padding. "Set" implies unordered. I don't think I've ever seen these functions in forms where you can specify more than one character either, but I guess Substrait doesn't really have a character type except maybe fixedchar<1>. Do these functions just repeat the padding string as many times as needed? The question then becomes what they do if the number of characters needed is uneven, especially for the center functions, where I can think of at least two similarly efficient algorithms that would have different results: center("x", 10, "abc") -> replace_slice(substring(repeat("abc", 4), 1, 10), 5, 1, "x") -> "abcaxcabca", and center("x", 10, "abc") -> concat(substring(repeat("abc", 2), 1, 4), "x", substring(repeat("abc", 2), 1, 5)) -> "abcaxabcab".

Copy link
Member Author

@richtia richtia Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll update the wording to 'string'. DuckDB only allows one character, but postgresql let's you use a string for the padding functions. I figured we would use whatever is more flexible.

Do these functions just repeat the padding string as many times as needed? The question then becomes what they do if the number of characters needed is uneven

I'll update the lpad and rpad descriptions to address these types of scenarios. Not too sure about center though.

To be honest, I also wasn't too convinced on the center function. I couldn't really find many places that implement it. I imagine the expectation is that someone could easily use a combination of lpad and rpad. I wonder if we should omit here as well?

cc: @ianmcook

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought is that maybe we have center only take single character padding. Multiple character strings used for padding can be just for lpad/rpad

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvanstraten in the interest of getting a bunch of these other more common functions in, i've removed center for now. I'll create a separate task to look more into center and if/how other SQL variants deal with it.

return: "varchar<L1>"
- args:
- value: "string"
name: "input"
description: "The string to pad."
- value: i32
name: "length"
description: "The length of the output string."
- value: "string"
name: "characters"
description: "The set of characters to use for padding."
return: "string"
-
name: rpad
description: Right-pad the string with the characters until the specified length of the string has been reached.
impls:
- args:
- value: "varchar<L1>"
name: "input"
description: "The string to pad."
- value: i32
name: "length"
description: "The length of the output string."
- value: "varchar<L2>"
name: "characters"
description: "The set of characters to use for padding."
return: "varchar<L1>"
- args:
- value: "string"
name: "input"
description: "The string to pad."
- value: i32
name: "length"
description: "The length of the output string."
- value: "string"
name: "characters"
description: "The set of characters to use for padding."
return: "string"
-
name: center
description: >-
Pad the string with characters from each side until the specified length of the string has
been reached. If padding is uneven, the default option is to have extra padding on the right.
impls:
- args:
- options: [ RIGHT, LEFT ]
required: false
- value: "varchar<L1>"
name: "input"
description: "The string to pad."
- value: i32
name: "length"
description: "The length of the output string."
- value: "varchar<L2>"
name: "characters"
description: "The set of characters to use for padding."
return: "varchar<L1>"
- args:
- options: [ RIGHT, LEFT ]
required: false
- value: "string"
name: "input"
description: "The string to pad."
- value: i32
name: "length"
description: "The length of the output string."
- value: "string"
name: "characters"
description: "The set of characters to use for padding."
return: "string"
-
name: left
description: Extract count characters starting from the left of the string.
impls:
- args:
- value: "varchar<L1>"
- value: i32
return: "varchar<L1>"
- args:
- value: "string"
- value: i32
return: "string"
-
name: right
description: Extract count characters starting from the right of the string.
impls:
- args:
- value: "varchar<L1>"
- value: i32
return: "varchar<L1>"
- args:
- value: "string"
- value: i32
return: "string"