Skip to content

RFC for supporting mixed case identifiers #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

agrawalreetika
Copy link
Member

RFC for supporting mixed case identifiers

@prestodb-ci prestodb-ci added the from:IBM PRs from IBM label Feb 17, 2025
@prestodb-ci prestodb-ci requested review from a team, bibith4 and pratyakshsharma and removed request for a team February 17, 2025 13:02
Copy link
Contributor

@ScrapCodes ScrapCodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @agrawalreetika , thanks for the detailed write up. Got minor comments!

## Test Plan

* Ensure that existing CI tests pass for connectors where no specific implementation is added.
* Add support for mixed-case identifiers in at least one JDBC connector (e.g., MySQL, PostgreSQL) and create relevant unit tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Add support for mixed-case identifiers in at least one JDBC connector (e.g., MySQL, PostgreSQL) and create relevant unit tests.
* Add unit tests for testing mixed-case identifiers support in a JDBC connector (e.g., MySQL, PostgreSQL).


## Background

Presto treats all identifiers as case-insensitive, normalizing them (typically to lowercase). This creates issues when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Presto treats all identifiers as case-insensitive, normalizing them (typically to lowercase). This creates issues when
Presto treats all identifiers as case-insensitive, normalizing them to lowercase. This creates issues when

querying databases that are case-sensitive (e.g., MySQL, PostgreSQL) or case-normalizing to uppercase (e.g., Oracle,
DB2). Without a standard approach, identifiers might not match the actual names in the underlying data sources, leading
to unexpected query failures or incorrect results. Additionally, inconsistent handling of delimited and non-delimited
identifiers across different connectors further complicates cross-engine compatibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on inconsistent handling of delimited and non-delimited identifiers


The goal here is to improve interoperability with storage engines by aligning identifier handling with SQL standards
while ensuring a seamless user experience. Ideally, the change should be implemented in a way that minimizes
backward-compatibility-breaking changes to the SPI, allowing connectors to adopt the new approach without significant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
backward-compatibility-breaking changes to the SPI, allowing connectors to adopt the new approach without significant
breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant

to unexpected query failures or incorrect results. Additionally, inconsistent handling of delimited and non-delimited
identifiers across different connectors further complicates cross-engine compatibility.

The goal here is to improve interoperability with storage engines by aligning identifier handling with SQL standards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goals can be listed as bullet points:

  • Improve interoperability with storage engines i.e standardizing identifier handling (SQL Standard ref: ANSI?).

  • Backwards compatibility with existing system. Minimal or no breaking changes to SPI.


#### Core Changes

* In the common code path, make changes to pass original identifier (Schema, table and column names)
Copy link
Contributor

@ScrapCodes ScrapCodes Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* In the common code path, make changes to pass original identifier (Schema, table and column names)
* In the presto-spi, add new API to pass original identifier (Schema, table and column names)

@agrawalreetika agrawalreetika force-pushed the rfc-mixedcase-010 branch 2 times, most recently from bfd9603 to cc9704d Compare March 7, 2025 14:07
@agrawalreetika
Copy link
Member Author

Thanks for your review, @ScrapCodes . I've updated the document based on your comments. Please check it at your convenience.

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this detailed proposal. I do have one question about performance. Is this something you have tested in any prototype? Will delegating out to the connectors for normalization impose any significant overhead during the analysis phase?

Presto's behavior includes:

- Delimited identifiers ("Identifier") and non-delimited identifiers (identifier) are converted to lowercase by default
unless a connector enforces a specific behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless a connector enforces a specific behavior.

Where do we enforce this currently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Retrieving metadata from connectors.
- Displaying entity names in metadata introspection commands like SHOW TABLES and DESCRIBE.

Presto handles identifiers in several ways:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Presto handles identifiers in several ways:
Presto uses identifiers in several ways:

/**
* Normalize the provided SQL identifier according to connector-specific rules
*/
default String normalizeIdentifier(ConnectorSession session, String identifier, boolean delimited)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider expanding the javadoc here to include the parameters and their descriptions (specifically I am concerned about documenting the meaning of delimited. I actually think we may want to rename this to "escaped" or maybe "quoted" as discussed in the conversation with Anant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can do that. This is going to be a boolean for double quotes. And if in any connector we want to handle with and without quote differently https://github.com/prestodb/rfcs/pull/36/files#diff-217aa5922002c9dabc8df724c3930f3fa969a98881495ad2648be620fdb50407R30

I chose delimited currently since Identifier here defines it as delimited https://github.com/prestodb/presto/blob/master/presto-parser/src/main/java/com/facebook/presto/sql/tree/Identifier.java#L57 But I am open for suggestions for naming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delimited is widely used in other databases too e.g. Netezza
But, Agree with Zac about documenting the meaning of delimited, and how it is a better alternative to escaped. Where each special character needs to be escaped and in case of delimited only the delimiter needs to be escaped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shoun't call it escaped since this is going to be only for double quotes boolean. May be we can rename boolean to quoted if that makes more clear? Open to hear suggestions here?

}
```

Example - Connector specific implementation -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding an example for the use of the delimited (escaped) argument

@agrawalreetika
Copy link
Member Author

Thanks for your review @ZacBlanco, I have addressed your comments. Please check.

Also about performance, I don't have the numbers yet but here its gonna add additional API call i.e. normalizeIdentifier

Copy link
Contributor

@ScrapCodes ScrapCodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for quick turn around!,

Do you think we can call this RFC: "Standardize the handling of delimited identifiers" or "Support case sensitive identifiers in Presto". To me case sensitive is a better term than Mixed case.
Also adding clarification about what delimited identifier means, and how it is a better option over escaped identifiers. Where delimited identifier needs only delimiter to be escaped, whereas an escaped identifier needs every special character escaped.


The goal here is to improve interoperability with storage engines by aligning identifier handling with SQL standards
while ensuring a seamless user experience. Ideally, the change should be implemented in a way that minimizes
breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant.
breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant impact.

-----------
Test
TestTable
testtable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like, we are making presto case sensitive while remaining backwards compatible. To me the term "Mixed case" across the doc sounds less specific, not sure what others think, but case sensitive is more widely used and precise.

@agrawalreetika
Copy link
Member Author

agrawalreetika commented Mar 8, 2025

Thank for your suggestion @ScrapCodes
If it makes it more clear, than we can refactor RFC header from Mixed case identifiers to Support case-sensitive identifiers in Presto?

I chose Mixed case identifiers since there are some connector which supports some limited case of case-sentivitive behaviour so I thought since its a generic changes calling it Mixed case identifiers support would be better.

- `presto-spi`
- `presto-parser`
- `presto-base-jdbc`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding the link to the POC implementation, if you have one.

* Cover cases such as:
- Queries with mixed-case identifiers.
- Metadata retrieval commands (SHOW SCHEMAS, SHOW TABLES, DESCRIBE).
- Joins, subqueries, and alias usage with mixed-case identifiers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also add Queries between two connectors with different normalization rules, e.g MySQL and Hive


### Proposed Plan

Presto's behavior includes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
Presto's behavior includes:
Presto's default behavior is -

Comment on lines +27 to +33
- Align Presto’s identifier handling with SQL standards to improve interoperability with case-sensitive and
case-normalizing databases.
- Minimize SPI-breaking changes to maintain backward compatibility for existing connectors.
- Introduce a mechanism for connectors to define their own identifier normalization behavior.
- Allow identifiers to retain their original case where necessary, preventing unexpected query failures.
- Ensure Access Control SPI can correctly normalize identifiers.
- Preserve a seamless user experience while making these changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrawalreetika thanks for this very helpful proposal. Bring a few questions for discussing. Please let me know if there is anything I didn't understand correctly.

It seems that when setting case-sensitive-name-matching=true, the effect of delimited identifier on catalog/schema/table is the same as that of undelimited identifier. This means that unlimited identifiers also retain the case, which seems a little different from the SQL spec. So should we add some explanations about the behavior of delimited and undelimited identifiers?

Besides, for databases like PostgreSQL, it's case sensitive for delimited column names, which means that the table can contain both column "ABCol" and column "AbCol" at the same time. Have we considered this situation? If not, should we clearly state that this situation is not supported in presto?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd Thank you for the thoughtful review!

You're absolutely right about the behavior around delimited identifiers. As part of the initial proposal, the goal was to remove the default lowercasing of identifiers and delegate normalization to the connector level — this was introduced in PR #24551.

The current PR focuses specifically on handling schema and table names. Column names aren't included yet, as they are still lowercased at the SPI level (ColumnMetadata.java#L45). Supporting case-sensitive column names will require removing this default behavior and updating each connector to handle normalization through the metadata API. This work can be planned for a follow-up PR. LMK, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrawalreetika I saw the PR first, and then came to read the proposal 😄. The column issue is OK as long as we're explicitly aware of the current limitations.

I'm a little concerned about how we support Postgresql. Basically, when it comes to case sensitivity, the default behavior of Postgresql follows the SQL specification most faithfully, that is, identifiers in quotation marks are treated as case-sensitive in SQL statement, whereas unquoted identifiers are automatically converted to lowercase.

So, do you think it makes sense to pass the state of delimited as a parameter when define the SPI methods? For example:

Metadata.java

    String normalizeIdentifier(Session session, String catalogName, String identifier, boolean delimited);

ConnectorMetadata.java

    /**
     * Normalize the provided SQL identifier according to connector-specific rules
    */
    default String normalizeIdentifier(ConnectorSession session, String identifier, boolean delimited)
    {
        return identifier.toLowerCase(ENGLISH);
    }

As I understand, if some connectors (such as Postgresql) is unaware of this delimited flag, it cannot make the right case-handling logic. Other connectors (such as Mysql) may optionally ignore this flag.

Haven't dug deep into the code, so not very sure about the implementation complexity. Please let me know if this is a feasible way.

Copy link
Member Author

@agrawalreetika agrawalreetika Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the insightful follow-up, @hantangwangd!

You're right about the importance of the delimited flag, especially for connectors like PostgreSQL that handle identifier casing differently based on whether the identifier is quoted.

In fact, during the initial proposal, I had a similar thought and included the delimited flag as a parameter in the proposed SPI method for exactly this reason—to give connectors enough context to apply the correct case-handling logic based on whether the identifier was quoted or not.

However, in the current state of Presto, the information about whether a schema or table name was quoted in the original SQL isn't preserved by the time it reaches the SPI layer. Adding support for retaining that quoting information would require changes. So I scoped that out for now and focused on getting the base support in place, with the intention to handle quoting and delimited identifiers more fully in a follow-up iteration.
You're absolutely right that for PostgreSQL and similar connectors, this context is important.

Let me know if that sounds reasonable, or if you'd prefer we try to tackle quoting preservation sooner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrawalreetika Thanks for your explanation.

As you said, the delimited flag is very important for handling the case sensitivity of identifier, so IMO we should at least take it into account in the proposal, since the solution designed in the proposal should be as comprehensive as possible.

For actual implementation, I agree with you that we can accomplish the whole proposal progressively though several PRs ---- especially if retaining the quoting information for schema/table name need a lot of changes. Do you think this makes sense?

Copy link
Contributor

@aaneja aaneja Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hantangwangd The SQL standard is interepreted differently by different vendors, e.g DB2 and Oracle choose to UPPER-case by default 1, Postgres chooses to lower-case

IMO, whatever way we slice it, given that we federate to connectors for storage, there is going to be ambiguity and disagreement on the 'right' behavior for DDL statements run via Presto
We could call this out in the RFC and point to these examples as what behavior to expect

Do you see any gotchas for SELECT / DML statements though ?
I think this RFC is primarliy enabling reading case-sensitive schemas/tables/columns and we should shake out those issues first

Footnotes

  1. https://dba.stackexchange.com/questions/321413/why-are-unquoted-identifiers-upper-cased-per-sql-92

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could call this out in the RFC and point to these examples as what behavior to expect

@aaneja 100% agree with this, it's always helpful to demonstrate the specific expected behavior through concrete examples.

Do you see any gotchas for SELECT / DML statements though ?
I think this RFC is primarliy enabling reading case-sensitive schemas/tables/columns and we should shake out those issues first

Seems that the core of our discussion revolves around how we intend to handle unquoted identifiers. If the goal is to quickly enable reading case-sensitive schemas/tables/columns, I agree that — with case-sensitive-name-matching flag be enabled — treating unquoted identifiers as strictly case-sensitive is a reasonable way, and the approach is OK for now since the parsing and maintaining of the delimited flag for schema/table identifiers would likely require a lot of efforts.

In future, if we implement maintaining and handling of the delimited flag for all identifiers, it seems that we no longer need this dedicated case-sensitive-name-matching configuration. At that point, it could be more align with SQL standard behavior: when users use unquoted identifiers, it implies case insensitivity (with specific case conversion handled by each connector); whereas when users use quoted identifiers, they meant to enforce strict case sensitivity. FYI, this behavior has been discussed for the case sensitivity of RowType field names and JSON column names, see here and here.

Copy link
Member Author

@agrawalreetika agrawalreetika Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your pointers and review @aaneja & @hantangwangd. I will add the example as well in the RFC for clarity.

In future, if we implement maintaining and handling of the delimited flag for all identifiers, it seems that we no longer need this dedicated case-sensitive-name-matching configuration.

My understanding here is, once we start sending delimiter flag as well for normalizeIdentifier API then in case of,

  • delimiter = false : it could be legacy (converting it to lower case) and
  • delimiter = true : it would be converted to connector specific case.
    We don't have to have case-sensitive-name-matching configuration check in the normalizeIdentifier API.

Current behaviour is -
If case-sensitive-name-matching = false

  • converts schema/table names to lower case with or w/o delimiter (Existing behaviour)

case-sensitive-name-matching = true

  • converts schema/table names to connector-specific casing with or w/o delimiter

Copy link
Member

@hantangwangd hantangwangd Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrawalreetika, thanks for your summarizing, great understanding overall, just on little point to clarify when we start sending delimiter flag as well for normalizeIdentifier API.

As I understand, we send the original-case identifiers along with their delimiter flags to the normalizeIdentifier API, and then in case of,

  • delimiter = false: the identifiers need to be normalized. Presto's default behavior is lowercase conversion, but connectors can customize their own normalization logic ---- like Oracle uppercasing them, PostgreSQL lowercasing them, and MySQL keep the case as-is.
  • delimiter = true: the identifiers always need to preserve their original case.

Do you think this behavior makes sense? If there's anything I misunderstood, please let me know.

Copy link
Member Author

@agrawalreetika agrawalreetika Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fair to me.
We will also have flexibility in here, since we have normalizeIdentifier API layout already. In case we still want to have a flag to keep the current legacy behavior (lower-case), we would have that option as well.

@aaneja @hantangwangd I’ve added detailed Behavioral Examples here to the RFC based on our earlier discussion. Please take a look at your convenience and share any feedback.

Also, when you get a chance, kindly drop a review on the corresponding PR: prestodb/presto#24551. Thanks a lot!

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying the behaviors of both quoted and unquoted identifiers with case-sensitive-name-matching flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PRs from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants