DRILL-5450: Fix initcap function to convert upper case characters cor…#821
DRILL-5450: Fix initcap function to convert upper case characters cor…#821arina-ielchiieva wants to merge 1 commit intoapache:masterfrom
Conversation
paul-rogers
left a comment
There was a problem hiding this comment.
Do we need to roll our own toLower when Java already provides one that is fully Unicode aware?
| // noop | ||
| } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z | ||
| currentByte -= 0x20; // Lowercase this character | ||
| currentByte += 0x20; // Lowercase this character |
There was a problem hiding this comment.
currentByte = Character.toLowerCase( currentByte )
The above handles all the Unicode complexity -- no need for us to reimplement it here.
A concern might be performance. Try calling the above 10K times in a loop and this function 10K times. Is there a difference in cost?
There was a problem hiding this comment.
toLowerCase() is implemented as a big switch statement for the Unicode, so very little cost ....
There was a problem hiding this comment.
I did not notice any significant difference in performance, so will replace to Character methods.
22f5a8a to
4d97811
Compare
|
@paul-rogers |
paul-rogers
left a comment
There was a problem hiding this comment.
Giving this a +1 only because it is a bit less broken than before. See comments for how to handle the fact that the function still doesn't support our claimed character encoding of UTF-8.
| } else { // whitespace | ||
| capNext = true; | ||
| } | ||
| int currentByte = inBuf.getByte(id); |
There was a problem hiding this comment.
This code works only for ASCII, but not for UTF-8. UTF-8 is a multi-byte code that requires special encoding/decoding to convert to Unicode characters. Without that encoding, this method won't work for Cyrillic, Greek or any other character set with upper/lower distinctions.
Since this method never worked, it is probably OK to make it a bit less broken than before: at least now it works for ASCII. Please add unit tests below, then file a JIRA, for the fact that this function does not work with UTF-8 despite the fact that Drill claims it supports UTF-8.
| testBuilder() | ||
| .sqlQuery("select\n" + | ||
| "lower('ABC') col_upper,\n" + | ||
| "lower('abc') col_lower,\n" + |
There was a problem hiding this comment.
Please add tests for Greek and Cyrillic. Our source encoding is UTF-8, so you can enter the characters directly. Or, if that does not work, you can instead use the Java Unicode encoding: U1234.
If the tests fail because of parsing of SQL, please file a bug. If they fail because the function above does not support UTF-8, please file a different bug.
In either case, you can then comment out the test cases and add a comment that says that they fail due to DRILL-xxxx, whatever your bug number turns out to be.
There was a problem hiding this comment.
Created Jira DRILL-5477 and added appropriate unit test which is ignored for now.
4d97811 to
ad31119
Compare
|
+1 so we can commit this. But Paul is right. This could be a lot better. |
…rectly