Skip to content

DRILL-5450: Fix initcap function to convert upper case characters cor…#821

Closed
arina-ielchiieva wants to merge 1 commit intoapache:masterfrom
arina-ielchiieva:DRILL-5450
Closed

DRILL-5450: Fix initcap function to convert upper case characters cor…#821
arina-ielchiieva wants to merge 1 commit intoapache:masterfrom
arina-ielchiieva:DRILL-5450

Conversation

@arina-ielchiieva
Copy link
Member

…rectly

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to roll our own toLower when Java already provides one that is fully Unicode aware?

// noop
} else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
currentByte -= 0x20; // Lowercase this character
currentByte += 0x20; // Lowercase this character
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currentByte = Character.toLowerCase( currentByte )

The above handles all the Unicode complexity -- no need for us to reimplement it here.

A concern might be performance. Try calling the above 10K times in a loop and this function 10K times. Is there a difference in cost?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toLowerCase() is implemented as a big switch statement for the Unicode, so very little cost ....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice any significant difference in performance, so will replace to Character methods.

@arina-ielchiieva
Copy link
Member Author

@paul-rogers
I have changed customer lower / upper implementation in favor of Character methods.
Made changes in lower, upper and initcap functions.
Please review when possible.

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving this a +1 only because it is a bit less broken than before. See comments for how to handle the fact that the function still doesn't support our claimed character encoding of UTF-8.

} else { // whitespace
capNext = true;
}
int currentByte = inBuf.getByte(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code works only for ASCII, but not for UTF-8. UTF-8 is a multi-byte code that requires special encoding/decoding to convert to Unicode characters. Without that encoding, this method won't work for Cyrillic, Greek or any other character set with upper/lower distinctions.

Since this method never worked, it is probably OK to make it a bit less broken than before: at least now it works for ASCII. Please add unit tests below, then file a JIRA, for the fact that this function does not work with UTF-8 despite the fact that Drill claims it supports UTF-8.

testBuilder()
.sqlQuery("select\n" +
"lower('ABC') col_upper,\n" +
"lower('abc') col_lower,\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for Greek and Cyrillic. Our source encoding is UTF-8, so you can enter the characters directly. Or, if that does not work, you can instead use the Java Unicode encoding: U1234.

If the tests fail because of parsing of SQL, please file a bug. If they fail because the function above does not support UTF-8, please file a different bug.

In either case, you can then comment out the test cases and add a comment that says that they fail due to DRILL-xxxx, whatever your bug number turns out to be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created Jira DRILL-5477 and added appropriate unit test which is ignored for now.

@parthchandra
Copy link
Contributor

+1 so we can commit this. But Paul is right. This could be a lot better.

@asfgit asfgit closed this in cb9547a May 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants