-
Notifications
You must be signed in to change notification settings - Fork 27
RFC for supporting mixed case identifiers #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f46e9a3
to
121529c
Compare
121529c
to
9b2de30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @agrawalreetika , thanks for the detailed write up. Got minor comments!
## Test Plan | ||
|
||
* Ensure that existing CI tests pass for connectors where no specific implementation is added. | ||
* Add support for mixed-case identifiers in at least one JDBC connector (e.g., MySQL, PostgreSQL) and create relevant unit tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Add support for mixed-case identifiers in at least one JDBC connector (e.g., MySQL, PostgreSQL) and create relevant unit tests. | |
* Add unit tests for testing mixed-case identifiers support in a JDBC connector (e.g., MySQL, PostgreSQL). |
|
||
## Background | ||
|
||
Presto treats all identifiers as case-insensitive, normalizing them (typically to lowercase). This creates issues when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presto treats all identifiers as case-insensitive, normalizing them (typically to lowercase). This creates issues when | |
Presto treats all identifiers as case-insensitive, normalizing them to lowercase. This creates issues when |
querying databases that are case-sensitive (e.g., MySQL, PostgreSQL) or case-normalizing to uppercase (e.g., Oracle, | ||
DB2). Without a standard approach, identifiers might not match the actual names in the underlying data sources, leading | ||
to unexpected query failures or incorrect results. Additionally, inconsistent handling of delimited and non-delimited | ||
identifiers across different connectors further complicates cross-engine compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on inconsistent handling of delimited and non-delimited identifiers
|
||
The goal here is to improve interoperability with storage engines by aligning identifier handling with SQL standards | ||
while ensuring a seamless user experience. Ideally, the change should be implemented in a way that minimizes | ||
backward-compatibility-breaking changes to the SPI, allowing connectors to adopt the new approach without significant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backward-compatibility-breaking changes to the SPI, allowing connectors to adopt the new approach without significant | |
breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant |
to unexpected query failures or incorrect results. Additionally, inconsistent handling of delimited and non-delimited | ||
identifiers across different connectors further complicates cross-engine compatibility. | ||
|
||
The goal here is to improve interoperability with storage engines by aligning identifier handling with SQL standards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goals can be listed as bullet points:
-
Improve interoperability with storage engines i.e standardizing identifier handling (SQL Standard ref: ANSI?).
-
Backwards compatibility with existing system. Minimal or no breaking changes to SPI.
|
||
#### Core Changes | ||
|
||
* In the common code path, make changes to pass original identifier (Schema, table and column names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* In the common code path, make changes to pass original identifier (Schema, table and column names) | |
* In the presto-spi, add new API to pass original identifier (Schema, table and column names) |
bfd9603
to
cc9704d
Compare
Thanks for your review, @ScrapCodes . I've updated the document based on your comments. Please check it at your convenience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this detailed proposal. I do have one question about performance. Is this something you have tested in any prototype? Will delegating out to the connectors for normalization impose any significant overhead during the analysis phase?
Presto's behavior includes: | ||
|
||
- Delimited identifiers ("Identifier") and non-delimited identifiers (identifier) are converted to lowercase by default | ||
unless a connector enforces a specific behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless a connector enforces a specific behavior.
Where do we enforce this currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its going to be in ConnectorMetadata API implementation here - https://github.com/prestodb/rfcs/pull/36/files#diff-217aa5922002c9dabc8df724c3930f3fa969a98881495ad2648be620fdb50407R105
- Retrieving metadata from connectors. | ||
- Displaying entity names in metadata introspection commands like SHOW TABLES and DESCRIBE. | ||
|
||
Presto handles identifiers in several ways: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presto handles identifiers in several ways: | |
Presto uses identifiers in several ways: |
/** | ||
* Normalize the provided SQL identifier according to connector-specific rules | ||
*/ | ||
default String normalizeIdentifier(ConnectorSession session, String identifier, boolean delimited) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider expanding the javadoc here to include the parameters and their descriptions (specifically I am concerned about documenting the meaning of delimited
. I actually think we may want to rename this to "escaped" or maybe "quoted" as discussed in the conversation with Anant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we can do that. This is going to be a boolean for double quotes
. And if in any connector we want to handle with and without quote differently https://github.com/prestodb/rfcs/pull/36/files#diff-217aa5922002c9dabc8df724c3930f3fa969a98881495ad2648be620fdb50407R30
I chose delimited
currently since Identifier here defines it as delimited
https://github.com/prestodb/presto/blob/master/presto-parser/src/main/java/com/facebook/presto/sql/tree/Identifier.java#L57 But I am open for suggestions for naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delimited is widely used in other databases too e.g. Netezza
But, Agree with Zac about documenting the meaning of delimited, and how it is a better alternative to escaped. Where each special character needs to be escaped and in case of delimited only the delimiter needs to be escaped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shoun't call it escaped since this is going to be only for double quotes
boolean. May be we can rename boolean to quoted
if that makes more clear? Open to hear suggestions here?
} | ||
``` | ||
|
||
Example - Connector specific implementation - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding an example for the use of the delimited
(escaped) argument
cc9704d
to
b591ab3
Compare
Thanks for your review @ZacBlanco, I have addressed your comments. Please check. Also about performance, I don't have the numbers yet but here its gonna add additional API call i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for quick turn around!,
Do you think we can call this RFC: "Standardize the handling of delimited identifiers" or "Support case sensitive identifiers in Presto". To me case sensitive is a better term than Mixed case.
Also adding clarification about what delimited identifier means, and how it is a better option over escaped identifiers. Where delimited identifier needs only delimiter to be escaped, whereas an escaped identifier needs every special character escaped.
|
||
The goal here is to improve interoperability with storage engines by aligning identifier handling with SQL standards | ||
while ensuring a seamless user experience. Ideally, the change should be implemented in a way that minimizes | ||
breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant. | |
breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant impact. |
----------- | ||
Test | ||
TestTable | ||
testtable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like, we are making presto case sensitive while remaining backwards compatible. To me the term "Mixed case" across the doc sounds less specific, not sure what others think, but case sensitive
is more widely used and precise.
Thank for your suggestion @ScrapCodes I chose |
- `presto-spi` | ||
- `presto-parser` | ||
- `presto-base-jdbc` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding the link to the POC implementation, if you have one.
b591ab3
to
5dd2e27
Compare
* Cover cases such as: | ||
- Queries with mixed-case identifiers. | ||
- Metadata retrieval commands (SHOW SCHEMAS, SHOW TABLES, DESCRIBE). | ||
- Joins, subqueries, and alias usage with mixed-case identifiers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also add Queries between two connectors with different normalization rules, e.g MySQL and Hive
|
||
### Proposed Plan | ||
|
||
Presto's behavior includes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Presto's behavior includes: | |
Presto's default behavior is - |
- Align Presto’s identifier handling with SQL standards to improve interoperability with case-sensitive and | ||
case-normalizing databases. | ||
- Minimize SPI-breaking changes to maintain backward compatibility for existing connectors. | ||
- Introduce a mechanism for connectors to define their own identifier normalization behavior. | ||
- Allow identifiers to retain their original case where necessary, preventing unexpected query failures. | ||
- Ensure Access Control SPI can correctly normalize identifiers. | ||
- Preserve a seamless user experience while making these changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agrawalreetika thanks for this very helpful proposal. Bring a few questions for discussing. Please let me know if there is anything I didn't understand correctly.
It seems that when setting case-sensitive-name-matching=true
, the effect of delimited identifier on catalog/schema/table
is the same as that of undelimited identifier. This means that unlimited identifiers also retain the case, which seems a little different from the SQL spec. So should we add some explanations about the behavior of delimited and undelimited identifiers?
Besides, for databases like PostgreSQL, it's case sensitive for delimited column names, which means that the table can contain both column "ABCol" and column "AbCol" at the same time. Have we considered this situation? If not, should we clearly state that this situation is not supported in presto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd Thank you for the thoughtful review!
You're absolutely right about the behavior around delimited identifiers. As part of the initial proposal, the goal was to remove the default lowercasing of identifiers and delegate normalization to the connector level — this was introduced in PR #24551.
The current PR focuses specifically on handling schema and table names. Column names aren't included yet, as they are still lowercased at the SPI level (ColumnMetadata.java#L45). Supporting case-sensitive column names will require removing this default behavior and updating each connector to handle normalization through the metadata API. This work can be planned for a follow-up PR. LMK, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agrawalreetika I saw the PR first, and then came to read the proposal 😄. The column issue is OK as long as we're explicitly aware of the current limitations.
I'm a little concerned about how we support Postgresql
. Basically, when it comes to case sensitivity, the default behavior of Postgresql
follows the SQL specification most faithfully, that is, identifiers in quotation marks are treated as case-sensitive in SQL statement, whereas unquoted identifiers are automatically converted to lowercase.
So, do you think it makes sense to pass the state of delimited
as a parameter when define the SPI methods? For example:
Metadata.java
String normalizeIdentifier(Session session, String catalogName, String identifier, boolean delimited);
ConnectorMetadata.java
/**
* Normalize the provided SQL identifier according to connector-specific rules
*/
default String normalizeIdentifier(ConnectorSession session, String identifier, boolean delimited)
{
return identifier.toLowerCase(ENGLISH);
}
As I understand, if some connectors (such as Postgresql) is unaware of this delimited
flag, it cannot make the right case-handling logic. Other connectors (such as Mysql) may optionally ignore this flag.
Haven't dug deep into the code, so not very sure about the implementation complexity. Please let me know if this is a feasible way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the insightful follow-up, @hantangwangd!
You're right about the importance of the delimited flag, especially for connectors like PostgreSQL that handle identifier casing differently based on whether the identifier is quoted.
In fact, during the initial proposal, I had a similar thought and included the delimited flag as a parameter in the proposed SPI method for exactly this reason—to give connectors enough context to apply the correct case-handling logic based on whether the identifier was quoted or not.
However, in the current state of Presto, the information about whether a schema or table name was quoted in the original SQL isn't preserved by the time it reaches the SPI layer. Adding support for retaining that quoting information would require changes. So I scoped that out for now and focused on getting the base support in place, with the intention to handle quoting and delimited identifiers more fully in a follow-up iteration.
You're absolutely right that for PostgreSQL and similar connectors, this context is important.
Let me know if that sounds reasonable, or if you'd prefer we try to tackle quoting preservation sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agrawalreetika Thanks for your explanation.
As you said, the delimited flag is very important for handling the case sensitivity of identifier, so IMO we should at least take it into account in the proposal, since the solution designed in the proposal should be as comprehensive as possible.
For actual implementation, I agree with you that we can accomplish the whole proposal progressively though several PRs ---- especially if retaining the quoting information for schema/table name need a lot of changes. Do you think this makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hantangwangd The SQL standard is interepreted differently by different vendors, e.g DB2 and Oracle choose to UPPER-case by default 1, Postgres chooses to lower-case
IMO, whatever way we slice it, given that we federate to connectors for storage, there is going to be ambiguity and disagreement on the 'right' behavior for DDL statements run via Presto
We could call this out in the RFC and point to these examples as what behavior to expect
Do you see any gotchas for SELECT / DML statements though ?
I think this RFC is primarliy enabling reading case-sensitive schemas/tables/columns and we should shake out those issues first
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could call this out in the RFC and point to these examples as what behavior to expect
@aaneja 100% agree with this, it's always helpful to demonstrate the specific expected behavior through concrete examples.
Do you see any gotchas for SELECT / DML statements though ?
I think this RFC is primarliy enabling reading case-sensitive schemas/tables/columns and we should shake out those issues first
Seems that the core of our discussion revolves around how we intend to handle unquoted identifiers. If the goal is to quickly enable reading case-sensitive schemas/tables/columns, I agree that — with case-sensitive-name-matching
flag be enabled — treating unquoted identifiers as strictly case-sensitive is a reasonable way, and the approach is OK for now since the parsing and maintaining of the delimited flag for schema/table identifiers would likely require a lot of efforts.
In future, if we implement maintaining and handling of the delimited flag for all identifiers, it seems that we no longer need this dedicated case-sensitive-name-matching
configuration. At that point, it could be more align with SQL standard behavior: when users use unquoted identifiers, it implies case insensitivity (with specific case conversion handled by each connector); whereas when users use quoted identifiers, they meant to enforce strict case sensitivity. FYI, this behavior has been discussed for the case sensitivity of RowType
field names and JSON
column names, see here and here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your pointers and review @aaneja & @hantangwangd. I will add the example as well in the RFC for clarity.
In future, if we implement maintaining and handling of the delimited flag for all identifiers, it seems that we no longer need this dedicated case-sensitive-name-matching configuration.
My understanding here is, once we start sending delimiter
flag as well for normalizeIdentifier
API then in case of,
delimiter = false
: it could be legacy (converting it to lower case) anddelimiter = true
: it would be converted to connector specific case.
We don't have to havecase-sensitive-name-matching
configuration check in thenormalizeIdentifier
API.
Current behaviour is -
If case-sensitive-name-matching = false
- converts schema/table names to lower case with or w/o delimiter (Existing behaviour)
case-sensitive-name-matching = true
- converts schema/table names to connector-specific casing with or w/o delimiter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agrawalreetika, thanks for your summarizing, great understanding overall, just on little point to clarify when we start sending delimiter flag as well for normalizeIdentifier
API.
As I understand, we send the original-case identifiers along with their delimiter flags to the normalizeIdentifier
API, and then in case of,
delimiter = false
: the identifiers need to be normalized. Presto's default behavior is lowercase conversion, but connectors can customize their own normalization logic ---- like Oracle uppercasing them, PostgreSQL lowercasing them, and MySQL keep the case as-is.delimiter = true
: the identifiers always need to preserve their original case.
Do you think this behavior makes sense? If there's anything I misunderstood, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds fair to me.
We will also have flexibility in here, since we have normalizeIdentifier
API layout already. In case we still want to have a flag to keep the current legacy behavior (lower-case), we would have that option as well.
@aaneja @hantangwangd I’ve added detailed Behavioral Examples here to the RFC based on our earlier discussion. Please take a look at your convenience and share any feedback.
Also, when you get a chance, kindly drop a review on the corresponding PR: prestodb/presto#24551. Thanks a lot!
5dd2e27
to
1d5a0eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying the behaviors of both quoted and unquoted identifiers with case-sensitive-name-matching
flag.
RFC for supporting mixed case identifiers