RFC for supporting mixed case identifiers #36

agrawalreetika · 2025-02-17T13:02:18Z

RFC for supporting mixed case identifiers

ScrapCodes

Hi @agrawalreetika , thanks for the detailed write up. Got minor comments!

ScrapCodes · 2025-03-07T06:06:42Z

RFC-0010-mixed-case-identifier-support.md

+## Test Plan
+
+* Ensure that existing CI tests pass for connectors where no specific implementation is added.
+* Add support for mixed-case identifiers in at least one JDBC connector (e.g., MySQL, PostgreSQL) and create relevant unit tests.


Suggested change

* Add support for mixed-case identifiers in at least one JDBC connector (e.g., MySQL, PostgreSQL) and create relevant unit tests.

* Add unit tests for testing mixed-case identifiers support in a JDBC connector (e.g., MySQL, PostgreSQL).

ScrapCodes · 2025-03-07T06:08:08Z

RFC-0010-mixed-case-identifier-support.md

+
+## Background
+
+Presto treats all identifiers as case-insensitive, normalizing them (typically to lowercase). This creates issues when


Suggested change

Presto treats all identifiers as case-insensitive, normalizing them (typically to lowercase). This creates issues when

Presto treats all identifiers as case-insensitive, normalizing them to lowercase. This creates issues when

ScrapCodes · 2025-03-07T06:13:36Z

RFC-0010-mixed-case-identifier-support.md

+querying databases that are case-sensitive (e.g., MySQL, PostgreSQL) or case-normalizing to uppercase (e.g., Oracle,
+DB2). Without a standard approach, identifiers might not match the actual names in the underlying data sources, leading
+to unexpected query failures or incorrect results. Additionally, inconsistent handling of delimited and non-delimited
+identifiers across different connectors further complicates cross-engine compatibility.


Can you elaborate more on inconsistent handling of delimited and non-delimited identifiers

ScrapCodes · 2025-03-07T06:19:45Z

RFC-0010-mixed-case-identifier-support.md

+
+The goal here is to improve interoperability with storage engines by aligning identifier handling with SQL standards
+while ensuring a seamless user experience. Ideally, the change should be implemented in a way that minimizes
+backward-compatibility-breaking changes to the SPI, allowing connectors to adopt the new approach without significant


Suggested change

backward-compatibility-breaking changes to the SPI, allowing connectors to adopt the new approach without significant

breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant

ScrapCodes · 2025-03-07T06:24:27Z

RFC-0010-mixed-case-identifier-support.md

+to unexpected query failures or incorrect results. Additionally, inconsistent handling of delimited and non-delimited
+identifiers across different connectors further complicates cross-engine compatibility.
+
+The goal here is to improve interoperability with storage engines by aligning identifier handling with SQL standards


Goals can be listed as bullet points:

Improve interoperability with storage engines i.e standardizing identifier handling (SQL Standard ref: ANSI?).

Backwards compatibility with existing system. Minimal or no breaking changes to SPI.

ScrapCodes · 2025-03-07T06:42:04Z

RFC-0010-mixed-case-identifier-support.md

+
+#### Core Changes
+
+* In the common code path, make changes to pass original identifier (Schema, table and column names)


Suggested change

* In the common code path, make changes to pass original identifier (Schema, table and column names)

* In the presto-spi, add new API to pass original identifier (Schema, table and column names)

agrawalreetika · 2025-03-07T14:09:19Z

Thanks for your review, @ScrapCodes . I've updated the document based on your comments. Please check it at your convenience.

ZacBlanco

Thanks for this detailed proposal. I do have one question about performance. Is this something you have tested in any prototype? Will delegating out to the connectors for normalization impose any significant overhead during the analysis phase?

ZacBlanco · 2025-03-07T22:28:40Z

RFC-0011-mixed-case-identifier-support.md

+Presto's behavior includes:
+
+- Delimited identifiers ("Identifier") and non-delimited identifiers (identifier) are converted to lowercase by default
+  unless a connector enforces a specific behavior.


unless a connector enforces a specific behavior.

Where do we enforce this currently?

Its going to be in ConnectorMetadata API implementation here - https://github.com/prestodb/rfcs/pull/36/files#diff-217aa5922002c9dabc8df724c3930f3fa969a98881495ad2648be620fdb50407R105

ZacBlanco · 2025-03-07T22:29:38Z

RFC-0011-mixed-case-identifier-support.md

+  - Retrieving metadata from connectors.
+  - Displaying entity names in metadata introspection commands like SHOW TABLES and DESCRIBE.
+
+Presto handles identifiers in several ways:


Suggested change

Presto handles identifiers in several ways:

Presto uses identifiers in several ways:

ZacBlanco · 2025-03-07T22:33:57Z

RFC-0011-mixed-case-identifier-support.md

+    /**
+     * Normalize the provided SQL identifier according to connector-specific rules
+    */
+    default String normalizeIdentifier(ConnectorSession session, String identifier, boolean delimited)


Consider expanding the javadoc here to include the parameters and their descriptions (specifically I am concerned about documenting the meaning of delimited. I actually think we may want to rename this to "escaped" or maybe "quoted" as discussed in the conversation with Anant

Ok, we can do that. This is going to be a boolean for double quotes. And if in any connector we want to handle with and without quote differently https://github.com/prestodb/rfcs/pull/36/files#diff-217aa5922002c9dabc8df724c3930f3fa969a98881495ad2648be620fdb50407R30

I chose delimited currently since Identifier here defines it as delimited https://github.com/prestodb/presto/blob/master/presto-parser/src/main/java/com/facebook/presto/sql/tree/Identifier.java#L57 But I am open for suggestions for naming.

Delimited is widely used in other databases too e.g. Netezza
But, Agree with Zac about documenting the meaning of delimited, and how it is a better alternative to escaped. Where each special character needs to be escaped and in case of delimited only the delimiter needs to be escaped.

I think we shoun't call it escaped since this is going to be only for double quotes boolean. May be we can rename boolean to quoted if that makes more clear? Open to hear suggestions here?

ZacBlanco · 2025-03-07T22:34:31Z

RFC-0011-mixed-case-identifier-support.md

+    }
+```
+
+Example - Connector specific implementation -


Consider adding an example for the use of the delimited (escaped) argument

agrawalreetika · 2025-03-08T04:36:12Z

Thanks for your review @ZacBlanco, I have addressed your comments. Please check.

Also about performance, I don't have the numbers yet but here its gonna add additional API call i.e. normalizeIdentifier

ScrapCodes

Thanks for quick turn around!,

Do you think we can call this RFC: "Standardize the handling of delimited identifiers" or "Support case sensitive identifiers in Presto". To me case sensitive is a better term than Mixed case.
Also adding clarification about what delimited identifier means, and how it is a better option over escaped identifiers. Where delimited identifier needs only delimiter to be escaped, whereas an escaped identifier needs every special character escaped.

ScrapCodes · 2025-03-08T06:02:22Z

RFC-0011-mixed-case-identifier-support.md

+
+The goal here is to improve interoperability with storage engines by aligning identifier handling with SQL standards
+while ensuring a seamless user experience. Ideally, the change should be implemented in a way that minimizes
+breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant.


Suggested change

breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant.

breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant impact.

ScrapCodes · 2025-03-08T06:11:44Z

RFC-0011-mixed-case-identifier-support.md

+-----------
+ Test      
+ TestTable 
+ testtable 


Sounds like, we are making presto case sensitive while remaining backwards compatible. To me the term "Mixed case" across the doc sounds less specific, not sure what others think, but case sensitive is more widely used and precise.

agrawalreetika · 2025-03-08T06:40:44Z

Thank for your suggestion @ScrapCodes
If it makes it more clear, than we can refactor RFC header from Mixed case identifiers to Support case-sensitive identifiers in Presto?

I chose Mixed case identifiers since there are some connector which supports some limited case of case-sentivitive behaviour so I thought since its a generic changes calling it Mixed case identifiers support would be better.

ScrapCodes · 2025-03-10T06:00:26Z

RFC-0011-mixed-case-identifier-support.md

+- `presto-spi`
+- `presto-parser`
+- `presto-base-jdbc`
+


Consider adding the link to the POC implementation, if you have one.

aaneja · 2025-03-12T08:36:23Z

RFC-0011-mixed-case-identifier-support.md

+* Cover cases such as:
+  - Queries with mixed-case identifiers.
+  - Metadata retrieval commands (SHOW SCHEMAS, SHOW TABLES, DESCRIBE).
+  - Joins, subqueries, and alias usage with mixed-case identifiers.


Can also add Queries between two connectors with different normalization rules, e.g MySQL and Hive

aaneja · 2025-03-12T08:36:59Z

RFC-0011-mixed-case-identifier-support.md

+
+### Proposed Plan
+
+Presto's behavior includes:


nit:

Suggested change

Presto's behavior includes:

Presto's default behavior is -

hantangwangd · 2025-04-12T12:56:10Z

RFC-0011-mixed-case-identifier-support.md

+- Align Presto’s identifier handling with SQL standards to improve interoperability with case-sensitive and
+  case-normalizing databases.
+- Minimize SPI-breaking changes to maintain backward compatibility for existing connectors.
+- Introduce a mechanism for connectors to define their own identifier normalization behavior.
+- Allow identifiers to retain their original case where necessary, preventing unexpected query failures.
+- Ensure Access Control SPI can correctly normalize identifiers.
+- Preserve a seamless user experience while making these changes.


@agrawalreetika thanks for this very helpful proposal. Bring a few questions for discussing. Please let me know if there is anything I didn't understand correctly.

It seems that when setting case-sensitive-name-matching=true, the effect of delimited identifier on catalog/schema/table is the same as that of undelimited identifier. This means that unlimited identifiers also retain the case, which seems a little different from the SQL spec. So should we add some explanations about the behavior of delimited and undelimited identifiers?

Besides, for databases like PostgreSQL, it's case sensitive for delimited column names, which means that the table can contain both column "ABCol" and column "AbCol" at the same time. Have we considered this situation? If not, should we clearly state that this situation is not supported in presto?

@hantangwangd Thank you for the thoughtful review!

You're absolutely right about the behavior around delimited identifiers. As part of the initial proposal, the goal was to remove the default lowercasing of identifiers and delegate normalization to the connector level — this was introduced in PR #24551.

The current PR focuses specifically on handling schema and table names. Column names aren't included yet, as they are still lowercased at the SPI level (ColumnMetadata.java#L45). Supporting case-sensitive column names will require removing this default behavior and updating each connector to handle normalization through the metadata API. This work can be planned for a follow-up PR. LMK, what do you think?

@agrawalreetika I saw the PR first, and then came to read the proposal 😄. The column issue is OK as long as we're explicitly aware of the current limitations.

I'm a little concerned about how we support Postgresql. Basically, when it comes to case sensitivity, the default behavior of Postgresql follows the SQL specification most faithfully, that is, identifiers in quotation marks are treated as case-sensitive in SQL statement, whereas unquoted identifiers are automatically converted to lowercase.

So, do you think it makes sense to pass the state of delimited as a parameter when define the SPI methods? For example:

Metadata.java

String normalizeIdentifier(Session session, String catalogName, String identifier, boolean delimited);

ConnectorMetadata.java

/** * Normalize the provided SQL identifier according to connector-specific rules */ default String normalizeIdentifier(ConnectorSession session, String identifier, boolean delimited) { return identifier.toLowerCase(ENGLISH); }

As I understand, if some connectors (such as Postgresql) is unaware of this delimited flag, it cannot make the right case-handling logic. Other connectors (such as Mysql) may optionally ignore this flag.

Haven't dug deep into the code, so not very sure about the implementation complexity. Please let me know if this is a feasible way.

Thanks again for the insightful follow-up, @hantangwangd!

You're right about the importance of the delimited flag, especially for connectors like PostgreSQL that handle identifier casing differently based on whether the identifier is quoted.

In fact, during the initial proposal, I had a similar thought and included the delimited flag as a parameter in the proposed SPI method for exactly this reason—to give connectors enough context to apply the correct case-handling logic based on whether the identifier was quoted or not.

However, in the current state of Presto, the information about whether a schema or table name was quoted in the original SQL isn't preserved by the time it reaches the SPI layer. Adding support for retaining that quoting information would require changes. So I scoped that out for now and focused on getting the base support in place, with the intention to handle quoting and delimited identifiers more fully in a follow-up iteration.
You're absolutely right that for PostgreSQL and similar connectors, this context is important.

Let me know if that sounds reasonable, or if you'd prefer we try to tackle quoting preservation sooner.

@agrawalreetika Thanks for your explanation.

As you said, the delimited flag is very important for handling the case sensitivity of identifier, so IMO we should at least take it into account in the proposal, since the solution designed in the proposal should be as comprehensive as possible.

For actual implementation, I agree with you that we can accomplish the whole proposal progressively though several PRs ---- especially if retaining the quoting information for schema/table name need a lot of changes. Do you think this makes sense?

@hantangwangd The SQL standard is interepreted differently by different vendors, e.g DB2 and Oracle choose to UPPER-case by default ¹, Postgres chooses to lower-case

IMO, whatever way we slice it, given that we federate to connectors for storage, there is going to be ambiguity and disagreement on the 'right' behavior for DDL statements run via Presto
We could call this out in the RFC and point to these examples as what behavior to expect

Do you see any gotchas for SELECT / DML statements though ?
I think this RFC is primarliy enabling reading case-sensitive schemas/tables/columns and we should shake out those issues first

Footnotes

https://dba.stackexchange.com/questions/321413/why-are-unquoted-identifiers-upper-cased-per-sql-92 ↩

We could call this out in the RFC and point to these examples as what behavior to expect

@aaneja 100% agree with this, it's always helpful to demonstrate the specific expected behavior through concrete examples.

Do you see any gotchas for SELECT / DML statements though ?
I think this RFC is primarliy enabling reading case-sensitive schemas/tables/columns and we should shake out those issues first

Seems that the core of our discussion revolves around how we intend to handle unquoted identifiers. If the goal is to quickly enable reading case-sensitive schemas/tables/columns, I agree that — with case-sensitive-name-matching flag be enabled — treating unquoted identifiers as strictly case-sensitive is a reasonable way, and the approach is OK for now since the parsing and maintaining of the delimited flag for schema/table identifiers would likely require a lot of efforts.

In future, if we implement maintaining and handling of the delimited flag for all identifiers, it seems that we no longer need this dedicated case-sensitive-name-matching configuration. At that point, it could be more align with SQL standard behavior: when users use unquoted identifiers, it implies case insensitivity (with specific case conversion handled by each connector); whereas when users use quoted identifiers, they meant to enforce strict case sensitivity. FYI, this behavior has been discussed for the case sensitivity of RowType field names and JSON column names, see here and here.

Thanks for your pointers and review @aaneja & @hantangwangd. I will add the example as well in the RFC for clarity.

In future, if we implement maintaining and handling of the delimited flag for all identifiers, it seems that we no longer need this dedicated case-sensitive-name-matching configuration.

My understanding here is, once we start sending delimiter flag as well for normalizeIdentifier API then in case of,

delimiter = false : it could be legacy (converting it to lower case) and

delimiter = true : it would be converted to connector specific case.
We don't have to have case-sensitive-name-matching configuration check in the normalizeIdentifier API.

Current behaviour is -
If case-sensitive-name-matching = false

converts schema/table names to lower case with or w/o delimiter (Existing behaviour)

case-sensitive-name-matching = true

converts schema/table names to connector-specific casing with or w/o delimiter

@agrawalreetika, thanks for your summarizing, great understanding overall, just on little point to clarify when we start sending delimiter flag as well for normalizeIdentifier API.

As I understand, we send the original-case identifiers along with their delimiter flags to the normalizeIdentifier API, and then in case of,

delimiter = false: the identifiers need to be normalized. Presto's default behavior is lowercase conversion, but connectors can customize their own normalization logic ---- like Oracle uppercasing them, PostgreSQL lowercasing them, and MySQL keep the case as-is.

delimiter = true: the identifiers always need to preserve their original case.

Do you think this behavior makes sense? If there's anything I misunderstood, please let me know.

Sounds fair to me.
We will also have flexibility in here, since we have normalizeIdentifier API layout already. In case we still want to have a flag to keep the current legacy behavior (lower-case), we would have that option as well.

@aaneja @hantangwangd I’ve added detailed Behavioral Examples here to the RFC based on our earlier discussion. Please take a look at your convenience and share any feedback.

Also, when you get a chance, kindly drop a review on the corresponding PR: prestodb/presto#24551. Thanks a lot!

hantangwangd

Thanks for clarifying the behaviors of both quoted and unquoted identifiers with case-sensitive-name-matching flag.

prestodb-ci added the from:IBM PRs from IBM label Feb 17, 2025

prestodb-ci requested review from a team, bibith4 and pratyakshsharma and removed request for a team February 17, 2025 13:02

agrawalreetika force-pushed the rfc-mixedcase-010 branch from f46e9a3 to 121529c Compare February 18, 2025 10:16

agrawalreetika force-pushed the rfc-mixedcase-010 branch from 121529c to 9b2de30 Compare March 3, 2025 10:26

ScrapCodes suggested changes Mar 7, 2025

View reviewed changes

ScrapCodes reviewed Mar 7, 2025

View reviewed changes

agrawalreetika force-pushed the rfc-mixedcase-010 branch 2 times, most recently from bfd9603 to cc9704d Compare March 7, 2025 14:07

ZacBlanco requested changes Mar 7, 2025

View reviewed changes

agrawalreetika force-pushed the rfc-mixedcase-010 branch from cc9704d to b591ab3 Compare March 8, 2025 04:27

ScrapCodes reviewed Mar 8, 2025

View reviewed changes

ScrapCodes reviewed Mar 10, 2025

View reviewed changes

agrawalreetika force-pushed the rfc-mixedcase-010 branch from b591ab3 to 5dd2e27 Compare March 10, 2025 17:25

rschlussel requested a review from gggrace14 March 11, 2025 14:54

aaneja reviewed Mar 12, 2025

View reviewed changes

aaneja approved these changes Mar 12, 2025

View reviewed changes

ScrapCodes approved these changes Mar 13, 2025

View reviewed changes

agrawalreetika mentioned this pull request Apr 8, 2025

Mixed case identifier support prestodb/presto#24551

Merged

6 tasks

hantangwangd reviewed Apr 12, 2025

View reviewed changes

Add mixed case identifiers RFC

1d5a0eb

agrawalreetika force-pushed the rfc-mixedcase-010 branch from 5dd2e27 to 1d5a0eb Compare April 25, 2025 05:56

hantangwangd approved these changes Apr 29, 2025

View reviewed changes

agrawalreetika mentioned this pull request Jun 26, 2025

Normalize ColumnMetadata to support case-sensitive column names prestodb/presto#24983

Merged

6 tasks

	* Add support for mixed-case identifiers in at least one JDBC connector (e.g., MySQL, PostgreSQL) and create relevant unit tests.
	* Add unit tests for testing mixed-case identifiers support in a JDBC connector (e.g., MySQL, PostgreSQL).


		## Background

		Presto treats all identifiers as case-insensitive, normalizing them (typically to lowercase). This creates issues when

	backward-compatibility-breaking changes to the SPI, allowing connectors to adopt the new approach without significant
	breaking changes to the SPI, i.e. allowing connectors to adopt the new approach without significant


		#### Core Changes

		* In the common code path, make changes to pass original identifier (Schema, table and column names)

	* In the common code path, make changes to pass original identifier (Schema, table and column names)
	* In the presto-spi, add new API to pass original identifier (Schema, table and column names)

	Presto handles identifiers in several ways:
	Presto uses identifiers in several ways:

RFC for supporting mixed case identifiers #36

Are you sure you want to change the base?

RFC for supporting mixed case identifiers #36

Uh oh!

Conversation

agrawalreetika commented Feb 17, 2025

Uh oh!

ScrapCodes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScrapCodes Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agrawalreetika commented Mar 7, 2025

Uh oh!

ZacBlanco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agrawalreetika commented Mar 8, 2025

Uh oh!

ScrapCodes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agrawalreetika commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agrawalreetika Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaneja Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

ScrapCodes Mar 7, 2025 •

edited

Loading

agrawalreetika commented Mar 8, 2025 •

edited

Loading

agrawalreetika Apr 21, 2025 •

edited

Loading

aaneja Apr 24, 2025 •

edited

Loading

agrawalreetika Apr 24, 2025 •

edited

Loading

hantangwangd Apr 25, 2025 •

edited

Loading

agrawalreetika Apr 25, 2025 •

edited

Loading