DRILL-5450: Fix initcap function to convert upper case characters cor… by arina-ielchiieva · Pull Request #821 · apache/drill

arina-ielchiieva · 2017-04-27T12:29:33Z

…rectly

paul-rogers

Do we need to roll our own toLower when Java already provides one that is fully Unicode aware?

paul-rogers · 2017-04-27T20:12:38Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionHelpers.java

          // noop
        } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
-          currentByte -= 0x20; // Lowercase this character
+          currentByte += 0x20; // Lowercase this character


currentByte = Character.toLowerCase( currentByte )

The above handles all the Unicode complexity -- no need for us to reimplement it here.

A concern might be performance. Try calling the above 10K times in a loop and this function 10K times. Is there a difference in cost?

toLowerCase() is implemented as a big switch statement for the Unicode, so very little cost ....

I did not notice any significant difference in performance, so will replace to Character methods.

arina-ielchiieva · 2017-04-28T11:21:29Z

@paul-rogers
I have changed customer lower / upper implementation in favor of Character methods.
Made changes in lower, upper and initcap functions.
Please review when possible.

paul-rogers

Giving this a +1 only because it is a bit less broken than before. See comments for how to handle the fact that the function still doesn't support our claimed character encoding of UTF-8.

paul-rogers · 2017-05-02T05:07:33Z

exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionHelpers.java

-        } else { // whitespace
-          capNext = true;
-        }
+      int currentByte = inBuf.getByte(id);


This code works only for ASCII, but not for UTF-8. UTF-8 is a multi-byte code that requires special encoding/decoding to convert to Unicode characters. Without that encoding, this method won't work for Cyrillic, Greek or any other character set with upper/lower distinctions.

Since this method never worked, it is probably OK to make it a bit less broken than before: at least now it works for ASCII. Please add unit tests below, then file a JIRA, for the fact that this function does not work with UTF-8 despite the fact that Drill claims it supports UTF-8.

paul-rogers · 2017-05-02T05:10:41Z

exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestStringFunctions.java

+    testBuilder()
+        .sqlQuery("select\n" +
+            "lower('ABC') col_upper,\n" +
+            "lower('abc') col_lower,\n" +


Please add tests for Greek and Cyrillic. Our source encoding is UTF-8, so you can enter the characters directly. Or, if that does not work, you can instead use the Java Unicode encoding: U1234.

If the tests fail because of parsing of SQL, please file a bug. If they fail because the function above does not support UTF-8, please file a different bug.

In either case, you can then comment out the test cases and add a comment that says that they fail due to DRILL-xxxx, whatever your bug number turns out to be.

Created Jira DRILL-5477 and added appropriate unit test which is ignored for now.

…rectly

parthchandra · 2017-05-12T23:52:46Z

+1 so we can commit this. But Paul is right. This could be a lot better.

paul-rogers requested changes Apr 27, 2017

View reviewed changes

arina-ielchiieva force-pushed the DRILL-5450 branch from 22f5a8a to 4d97811 Compare April 28, 2017 10:35

paul-rogers approved these changes May 2, 2017

View reviewed changes

DRILL-5450: Fix initcap function to convert upper case characters cor…

ad31119

…rectly

arina-ielchiieva force-pushed the DRILL-5450 branch from 4d97811 to ad31119 Compare May 5, 2017 15:35

asfgit closed this in cb9547a May 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-5450: Fix initcap function to convert upper case characters cor…#821

DRILL-5450: Fix initcap function to convert upper case characters cor…#821
arina-ielchiieva wants to merge 1 commit intoapache:masterfrom
arina-ielchiieva:DRILL-5450

arina-ielchiieva commented Apr 27, 2017

Uh oh!

paul-rogers left a comment

Uh oh!

paul-rogers Apr 27, 2017

Uh oh!

Ben-Zvi Apr 27, 2017

Uh oh!

arina-ielchiieva Apr 28, 2017

Uh oh!

arina-ielchiieva commented Apr 28, 2017

Uh oh!

paul-rogers left a comment

Uh oh!

paul-rogers May 2, 2017

Uh oh!

paul-rogers May 2, 2017

Uh oh!

arina-ielchiieva May 5, 2017

Uh oh!

parthchandra commented May 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

arina-ielchiieva commented Apr 27, 2017

Uh oh!

paul-rogers left a comment

Choose a reason for hiding this comment

Uh oh!

paul-rogers Apr 27, 2017

Choose a reason for hiding this comment

Uh oh!

Ben-Zvi Apr 27, 2017

Choose a reason for hiding this comment

Uh oh!

arina-ielchiieva Apr 28, 2017

Choose a reason for hiding this comment

Uh oh!

arina-ielchiieva commented Apr 28, 2017

Uh oh!

paul-rogers left a comment

Choose a reason for hiding this comment

Uh oh!

paul-rogers May 2, 2017

Choose a reason for hiding this comment

Uh oh!

paul-rogers May 2, 2017

Choose a reason for hiding this comment

Uh oh!

arina-ielchiieva May 5, 2017

Choose a reason for hiding this comment

Uh oh!

parthchandra commented May 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants