Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a blog post for LIKE optimizations #8576

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions website/blog/2024-01-27-like-optimization.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
---
slug: like
title: "Improve LIKE's performance"
authors: [xumingming]
tags: [tech-blog,performance]
---

## What is LIKE?

<a href="https://prestodb.io/docs/current/functions/comparison.html#like">LIKE</a> is a very useful operation,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Use multiple short sentences instead of commas.

"LIKE is a very useful SQL operator. It is used to do string pattern matching. The following examples for LIKE usage are from the Presto doc:"

it is used to do string pattern matching, the following examples are from Presto doc:

```
SELECT * FROM (VALUES ('abc'), ('bcd'), ('cde')) AS t (name)
WHERE name LIKE '%b%'
--returns 'abc' and 'bcd'

SELECT * FROM (VALUES ('abc'), ('bcd'), ('cde')) AS t (name)
WHERE name LIKE '_b%'
--returns 'abc'

SELECT * FROM (VALUES ('a_c'), ('_cd'), ('cde')) AS t (name)
WHERE name LIKE '%#_%' ESCAPE '#'
--returns 'a_c' and '_cd'
```

These examples show the basic usage of LIKE:

- Use `%` to match zero or more characters.
- Use `_` to match exactly one character.
- If we need to match `%` and `_` literally, we can specify escape char to escape them.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : specify "an" escape character


When we use Velox as the backend to evaluate Presto's query, LIKE operation is translated
into Velox's function call, e.g. `name LIKE '%b%'` is translated to
`like(name, '%b%')`. Internally Velox converts the pattern string into a regular
expression and then uses regular expression library <a href="https://github.com/google/re2">RE2</a>
to do the pattern matching. RE2 is a very good regular expression library, it is fast
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here : Use full stop between the sentences "RE2 is a very good regular expression library. It is fast and safe, which gives Velox LIKE function a good performance."

and safe which gives Velox LIKE a good performance. But some popularly used simple patterns
can be optimized to use simple C++ string functions to implement directly,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : "can be optimized using direct simple C++ string functions instead of regex."

e.g. Pattern `hello%` matches inputs that start with `hello`, which can be implemented by
memory comparing the prefix bytes of inputs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : " can be implemented by direct memory comparison of prefix ('hello' in this case) bytes of input"


```
// Match the first 'length' characters of string 'input' and prefix pattern.
bool matchPrefixPattern(
StringView input,
const std::string& pattern,
size_t length) {
return input.size() >= length &&
std::memcmp(input.data(), pattern.data(), length) == 0;
}
```

It is much faster than using RE2, benchmark shows it gives us a 750x speedup. We can do similar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Use full-stop instead of ',' between the 2 parts of the sentence.

optimizations for some other patterns:

- `%hello`: matches inputs that end with `hello`. It can be optimized by memory comparing the suffix bytes of the inputs.
- `%hello%`: matches inputs that contain `hello`. It can be optimized by using `std::string_view::find` to check whether inputs contain `hello`.

These simple patterns are straightforward to optimize, there are some more
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Use full-stop between these sentences as well.

relaxed patterns that are not so straightforward:

- `hello_velox%`: matches inputs that start with 'hello', followed by any character, then followed by 'velox'.
- `%hello_velox`: matches inputs that end with 'hello', followed by any character, then followed by 'velox'.
- `%hello_velox%`: matches inputs that contain both 'hello' and 'velox', and there is a single character separating them.

Although these patterns look similar to previous ones, but they are not so straightforward
to optimize, `_` here matches any single character, we can not simply use memory comparison to
do the matching. And if user's input is not pure ASCII, `_` might match more than one byte which
makes the implementation even more complex. And also note that the patterns above are just for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : "Also note that the above patterns are just for illustrative purposes. Actual patterns in practice can be more complex."

illustrative purpose, actual patterns can be more complex, e.g. `h_e_l_l_o`, so trivial algorithm
will not work.

## Optimizing Relaxed Patterns

We optimized these patterns as follows. First, we split the patterns into a list of sub patterns, e.g.
`hello_velox%` is split into sub-patterns: `hello`, `_`, `velox`, `%`, because there is
a `%` at the end, we determine it as a `kRelaxedPrefix` pattern, which means we need to do some prefix
matching, but it is not a trivial prefix matching, we need to match three sub-patterns:

- kLiteralString: hello
- kSingleCharWildcard: _
- kLiteralString: velox

For `kLiteralString` we simply do a memory comparison:

```
if (subPattern.kind == SubPatternKind::kLiteralString &&
std::memcmp(
input.data() + start + subPattern.start,
patternMetadata.fixedPattern().data() + subPattern.start,
subPattern.length) != 0) {
return false;
}
```

Note that since it is a memory comparison, it handles both pure ASCII inputs and inputs that
contain Unicode characters.

Matching `_` is more complex considering that there are variable length multi-bytes character in
unicode inputs. Fortunately there are existing libraries which provides unicode related operations:
<a href="https://juliastrings.github.io/utf8proc/">utf8proc</a>. It provides functions that tells
us whether a byte in input is the start of a character or not, how many bytes current character
consists of etc. So to match a sequence of `_` our algorithm is:

```
if (subPattern.kind == SubPatternKind::kSingleCharWildcard) {
// Match every single char wildcard.
for (auto i = 0; i < subPattern.length; i++) {
if (cursor >= input.size()) {
return false;
}

auto numBytes = unicodeCharLength(input.data() + cursor);
cursor += numBytes;
}
}
```

Here `cursor` is the index in the input we are trying to match, `unicodeCharLength` is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe format this as follows

Here :

  • 'cursor' is the index in the input we are trying to match.
  • 'unicodeCharLength' ....

So the logic is basically repeatedly....

a function which wraps utf8proc function to determine how many bytes current character consists of,
so the logic is basically repeatedly calculate size of current character and skip it.

It seems not that complex, but we should note that this logic is not effective for pure ASCII input,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End sentence here.

for pure ASCII input, every character is one byte, to match a sequence of `_`, we don't need to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit sentence :
"Every character is one byte in pure ASCII input. So to match a sequence of '', we don't need to calculate the size of each character and compare in a for-loop. Infact, we don't need to explicitly match '' for pure ASCII input as all. We can use the following logic instead:"

calculate the size of each character, don't need the for loop, actually we don't need to explicitly
match `_` for pure ASCII input at all, following is the whole logic for ASCII input:

```
for (const auto& subPattern : patternMetadata.subPatterns()) {
if (subPattern.kind == SubPatternKind::kLiteralString &&
std::memcmp(
input.data() + start + subPattern.start,
patternMetadata.fixedPattern().data() + subPattern.start,
subPattern.length) != 0) {
return false;
}
}
```

It only matches the kLiteralString pattern at the right position of the inputs, `_` is automatically
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End this sentence with full-stop.

matched(actually skipped), no need to match it explicitly. With this optimization we get 40x speedup
for kRelaxedPrefix patterns, 100x speedup for kRelaxedSuffix patterns.

Thank you <a href="https://github.com/mbasmanova">Maria Basmanova</a> for spending a lot of time
reviewing the code.
5 changes: 5 additions & 0 deletions website/blog/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,8 @@ raulcd:
url: https://github.com/raulcd
image_url: https://github.com/raulcd.png

xumingming:
name: James Xu
title: Software Engineer @ Alibaba
url: https://github.com/xumingming
image_url: https://github.com/xumingming.png