-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a blog post for LIKE optimizations #8576
Conversation
✅ Deploy Preview for meta-velox ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
cc @mbasmanova |
relaxed patterns that are not so straightforward: | ||
|
||
- `hello_velox%`: matches inputs that start with 'hello', followed by any character, then followed by 'velox'. | ||
- `hello_velox%`: matches inputs that end with 'hello', followed by any character, then followed by 'velox'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: hello_velox%
-> %hello_velox
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Fixed.
91cbba7
to
b18effe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xumingming James, thank you for writing this nice. It reads very well.
@pedroerp Pedro, would you also take a look?
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xumingming for this blog post. It is very useful.
Had some nits in the writing.
|
||
## What is LIKE? | ||
|
||
<a href="https://prestodb.io/docs/current/functions/comparison.html#like">LIKE</a> is a very useful operation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Use multiple short sentences instead of commas.
"LIKE is a very useful SQL operator. It is used to do string pattern matching. The following examples for LIKE usage are from the Presto doc:"
|
||
- Use `%` to match zero or more characters. | ||
- Use `_` to match exactly one character. | ||
- If we need to match `%` and `_` literally, we can specify escape char to escape them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : specify "an" escape character
into Velox's function call, e.g. `name LIKE '%b%'` is translated to | ||
`like(name, '%b%')`. Internally Velox converts the pattern string into a regular | ||
expression and then uses regular expression library <a href="https://github.com/google/re2">RE2</a> | ||
to do the pattern matching. RE2 is a very good regular expression library, it is fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here : Use full stop between the sentences "RE2 is a very good regular expression library. It is fast and safe, which gives Velox LIKE function a good performance."
expression and then uses regular expression library <a href="https://github.com/google/re2">RE2</a> | ||
to do the pattern matching. RE2 is a very good regular expression library, it is fast | ||
and safe which gives Velox LIKE a good performance. But some popularly used simple patterns | ||
can be optimized to use simple C++ string functions to implement directly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : "can be optimized using direct simple C++ string functions instead of regex."
and safe which gives Velox LIKE a good performance. But some popularly used simple patterns | ||
can be optimized to use simple C++ string functions to implement directly, | ||
e.g. Pattern `hello%` matches inputs that start with `hello`, which can be implemented by | ||
memory comparing the prefix bytes of inputs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : " can be implemented by direct memory comparison of prefix ('hello' in this case) bytes of input"
Although these patterns look similar to previous ones, but they are not so straightforward | ||
to optimize, `_` here matches any single character, we can not simply use memory comparison to | ||
do the matching. And if user's input is not pure ASCII, `_` might match more than one byte which | ||
makes the implementation even more complex. And also note that the patterns above are just for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : "Also note that the above patterns are just for illustrative purposes. Actual patterns in practice can be more complex."
} | ||
``` | ||
|
||
Here `cursor` is the index in the input we are trying to match, `unicodeCharLength` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe format this as follows
Here :
- 'cursor' is the index in the input we are trying to match.
- 'unicodeCharLength' ....
So the logic is basically repeatedly....
a function which wraps utf8proc function to determine how many bytes current character consists of, | ||
so the logic is basically repeatedly calculate size of current character and skip it. | ||
|
||
It seems not that complex, but we should note that this logic is not effective for pure ASCII input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End sentence here.
so the logic is basically repeatedly calculate size of current character and skip it. | ||
|
||
It seems not that complex, but we should note that this logic is not effective for pure ASCII input, | ||
for pure ASCII input, every character is one byte, to match a sequence of `_`, we don't need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit sentence :
"Every character is one byte in pure ASCII input. So to match a sequence of '', we don't need to calculate the size of each character and compare in a for-loop. Infact, we don't need to explicitly match '' for pure ASCII input as all. We can use the following logic instead:"
} | ||
``` | ||
|
||
It only matches the kLiteralString pattern at the right position of the inputs, `_` is automatically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End this sentence with full-stop.
4a3769d
to
9d2dfd8
Compare
@aditi-pandit Thanks for the review, made corresponding changes to all the comments. |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova merged this pull request in 3004e34. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: Pull Request resolved: facebookincubator#8576 Reviewed By: Yuhta, kgpai Differential Revision: D53308906 Pulled By: mbasmanova fbshipit-source-id: 31a1efe0d5472ccc9f2a1c81602e402d2f8c8e8a
No description provided.