- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.9k
ARROW-9131: [C++] Faster ascii_lower and ascii_upper. #7434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Following up on apache#7418 I tried and benchmarked a different way for * ascii_lower * ascii_upper Before (lower is similar): ``` -------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------- AsciiUpper_median 4922843 ns 4918961 ns 10 bytes_per_second=3.1457G/s items_per_second=213.17M/s ``` After: ``` -------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------- AsciiUpper_median 1391272 ns 1390014 ns 10 bytes_per_second=11.132G/s items_per_second=754.363M/s ``` This is a 3.7x speedup (on a AMD machine). Using http://quick-bench.com/JaDErmVCY23Z1tu6YZns_KBt0qU I found 4.6x speedup for clang 9, 6.4x for GCC 9.2. Also, the test is expanded a bit to include a non-ascii codepoint, to make explicit it is fine to upper or lower case a utf8 string. The non-overlap encoding of utf8 make this ok (see section 2.5 of Unicode Standard Core Specification v13.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat. It appears that both gcc and clang auto-vectorize the conditional expression.
| const uint8_t utf8_code_unit = *input++; | ||
| // Code units in the range [a-z] can only be an encoding of an ascii | ||
| // character/codepoint, not the 2nd, 3rd or 4th code unit (byte) of an different | ||
| // codepoint. This guaranteed by non-overal design of the unicode standard. (see | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"non-overlap"
| I also thought that we could do a bit check instead of the range check, e.g.  The generated code looks vectorized indeed. I didn't look into the details of the generated code by clang and GCC, it seems their performance is a bit different, so we might be able to squeeze out a bit more if we want. Happy to look into that later (create a new issue), but I rather spend my time on other functions now. | 
| It's ok, there's no need to further optimize those functions. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| Travis-CI failure is unrelated. | 
| Thank you @maartenbreddels ! | 
| Thanks, let me know if my workflow is ok, or if I can make things go smoother. PS: I am looking for a document describing the kernel design. I see these two cases ( PS2: Are there alternative channels for quick/small questions, or is this fine? | 
| I think the current approach (implement kernels one-by-one) is reasonable and manageable for us (and for you as well I hope). I don't think there's much documentation for now around the kernel design. Basically, kernels should usually be able to process two kinds of inputs (represented as  For quick development questions, we have a public chat instance at https://ursalabs.zulipchat.com/ - just register though and you can chat with the team. The main channel there is the "dev" channel, and Zulip allows you to create subtopics - don't hesitate to use those! | 
| 
 No, this is fine. 
 👍 | 
| 
 Can we use dev@arrow.apache.org so everyone can see the discussion and it's searchable via Google later? 
 Keep in mind that the new kernels framework is only a few weeks old (7ad49ee), so developer documentation is a bit behind, so don't hesitate to ask questions. | 
| 
 I'm ok with that if not considered too noisy (like small cmake questions etc). | 
| It shouldn't be a problem. | 
Following up on #7418 I tried and benchmarked a different way for
Before (lower is similar):
After:
This is a 3.7x speedup (on a AMD machine).
Using http://quick-bench.com/JaDErmVCY23Z1tu6YZns_KBt0qU I found 4.6x speedup for clang 9, 6.4x for GCC 9.2.
Also, the test is expanded a bit to include a non-ascii codepoint, to make explicit it is fine to upper
or lower case a utf8 string. The non-overlap encoding of utf8 make this ok (see section 2.5 of Unicode
Standard Core Specification v13.0).