Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Aug 5, 2024

The REST spec currently uses %1F as the UTF-8 encoded namespace separator for multi-part namespaces.
This causes issues, since it's a control character and the Servlet spec can reject such characters.

This PR makes the hard-coded namespace separator configurable by giving servers an option to send an optional namespace separator instead of %1F. The configuration part is entirely optional for REST server implementers and there's no behavioral change for existing installations.

fixes ##10338.

@github-actions github-actions bot added the core label Aug 5, 2024
@nastra nastra force-pushed the configurable-namespace-separator branch 3 times, most recently from ffb244e to a4313e4 Compare August 5, 2024 12:35
@nastra nastra closed this Aug 5, 2024
@nastra nastra reopened this Aug 5, 2024
@nastra nastra force-pushed the configurable-namespace-separator branch from a4313e4 to c599527 Compare August 6, 2024 08:59
@nastra nastra force-pushed the configurable-namespace-separator branch 2 times, most recently from 06fbde2 to 92b8b8b Compare August 6, 2024 10:05
@nastra nastra requested a review from amogh-jahagirdar August 6, 2024 14:31
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach 👍

If not provided or empty, all top-level namespaces should be listed.
If parent is a multipart namespace, the parts must be separated by the unit separator (`0x1F`) byte.
If parent is a multipart namespace, the parts must be separated by the namespace separator as
indicated via the /config override `rest-namespace-separator`, which defaults to the unit separator (`0x1F`) byte.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach will break old client in environments where the server is able to accept the non-printable separator, but chooses to use an alternative separator.

So, the upgrade path for users would be to first upgrade clients, then use new servers. As for me, it can be a significant burden on users. Also, this will require that new clients be API-compatible with older servers during the transition period... WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't break old clients and there's even a test that makes sure old clients send %1F while the server chose %2E. See TestRestUtil#encodeAsOldClientAndDecodeAsNewServer()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add that to the spec. I do not think we can assume that all REST server implementation reuse Iceberg java code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands now "must" in line 226 disallows clients to use the old separator.

@nastra nastra force-pushed the configurable-namespace-separator branch 2 times, most recently from f0c8ad2 to 1b00cd5 Compare August 8, 2024 14:49
Comment on lines 227 to 228
To be compatible with older clients, servers have to use `0x1F` as a fallback even when advertising a different
namespace separator to clients.
Copy link
Contributor

@dimas-b dimas-b Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @nastra . Sorry to be nit-picky, but I believe this text is still not clear enough. I'd suggest To be compatible with older clients, servers should use both the advertised separator and 0x1F as valid separators when parsing namespaces. (feel free to edit)

Copy link
Contributor

@dimas-b dimas-b Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that it is not clear what is "fallback" in this case. I propose to treat both old and new separators equally (always respect both, even when intermixed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx - LGTM 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I would disagree that a server should respect both intermixed. Only a single namespace separator should be used during encoding so the server should respect both during decoding, but not intermixed (a namespace shouldn't be encoded using %1F and %2E at the same time for example)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the server know what (single) separator is effective for a particular request?

@nastra nastra force-pushed the configurable-namespace-separator branch from 1b00cd5 to 4b5ed11 Compare August 8, 2024 14:52
!Strings.isNullOrEmpty(namespaceSeparator), "Invalid namespace separator: null or empty");
String[] levels;

// for backwards compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary if the client sends a specific separator in query params, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be still necessary to be compatible with an old client that doesn't send send the query param and doesn't respect the new rest-namespace-separator that the server sends. There's also a specific test that verifies this scenario in TestRestUtil.encodeAsOldClientAndDecodeAsNewServer()

/** Adaptor class to translate REST requests into {@link Catalog} API calls. */
public class RESTCatalogAdapter implements RESTClient {
private static final Splitter SLASH = Splitter.on('/');
private static final String NAMESPACE_SEPARATOR = "%2E";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not .? It's not a character that needs to be escaped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CatalogTests#testNamespaceWithDot would fail when using . here, so in this case it needs to be the UTF-8 encoded string.

@Test
public void testRoundTripUrlEncodeDecodeNamespace() {
@ParameterizedTest
@ValueSource(strings = {"%1F", "%2D", "%2E"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't escaping be handled automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but the test also uses the non-UTF-8 strings within the namespace, so we need to use the UTF-8 encoded string. I've added some non-UTF-8 encoded strings to the parameters list to indicate that such strings can be used as well (as long as they aren't allowed in the namespace name itself)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a slight distinction that's getting blurred in the code - as I understand it, UTF-8 and URLEncoding are two separate/distinct concepts, and what we seem to mean in the Iceberg code is that these are the URLEncoded strings after apply UTF-8 encoding for the underlying bytes.

My guess is that the URLEncoder javadoc makes it easy to mix this up since it says:

Note: The [World Wide Web Consortium Recommendation](http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars) states that UTF-8 should be used. Not doing so may introduce incompatibilities.

This statement seems to imply that the URLEncoding convention is itself fundamental to UTF-8, when it seems to be intending to just mean the underlying bytes encoding scheme should first be done as UTF-8 and then URL-encoding via "percent-escaping" is applied afterwards. This is more clear in its linked doc where it calls out the two separate steps:

https://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars

Although URIs do not contain non-ASCII values (see [[URI]](https://www.w3.org/TR/html40/references.html#ref-URI), section 2.1) authors sometimes specify them in attribute values expecting URIs (i.e., defined with [%URI;](https://www.w3.org/TR/html40/sgml/dtd.html#URI) in the [DTD](https://www.w3.org/TR/html40/sgml/dtd.html)). For instance, the following [href](https://www.w3.org/TR/html40/struct/links.html#adef-href) value is illegal:


<A href="http://foo.org/Håkon">...</A>
We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:


1. Represent each character in UTF-8 (see [[RFC2279]](https://www.w3.org/TR/html40/references.html#ref-RFC2279)) as one or more bytes.
2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

@ZacBlanco
Copy link

ZacBlanco commented Apr 22, 2025

@nastra Do you intend to continue working on this? I am seeing issues due to this in our test environments and would like to get a fix in upstream. I would be happy to take over the PR if you are preoccupied

@nastra
Copy link
Contributor Author

nastra commented Apr 24, 2025

@nastra Do you intend to continue working on this? I am seeing issues due to this in our test environments and would like to get a fix in upstream. I would be happy to take over the PR if you are preoccupied

@ZacBlanco this change requires a Spec change where the VOTE hasn't been closed/finalized yet

@jbonofre jbonofre self-requested a review May 5, 2025 14:38
Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, I just wonder if the client will actually get an error (500 for instance) if client and server don't use the same namespace separator (client send a separator different from the one expected by the server).
Also, related but not in the scope of this PR, if the server impl changes the namespace separator, it has to decide if it uses an IR or update the existing namespace in the storage.

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear this approach can easily cause issues. A couple concerns:

A client configures a non-default separator (e.g. _) but uses an older server, which is not aware of this. The client sends a "create namespace" for a.b, but the server would parse it as a top-level namespace a_b. I think that this is not the intended behavior.

Also the other way around: the server's configured with a non-default separator and pushed the config to the client, but it's an older client that cannot handle it and silently ignores is.

Another situation: what if an Iceberg REST service is re-configured to use a different separator but a client already got its config - so it continues to use the "old" separator char.

@nastra nastra force-pushed the configurable-namespace-separator branch 3 times, most recently from 406844d to f4f6e2f Compare October 15, 2025 10:09
@stevenzwu
Copy link
Contributor

@nastra I assume we are ready revive this work and this is ready for review?

@nastra
Copy link
Contributor Author

nastra commented Oct 30, 2025

@nastra I assume we are ready revive this work and this is ready for review?

@stevenzwu yes I revived this work and it's ready for review. Note that I extracted the spec change into #14448

Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on the overall concept and functionality

On the details of "encodings" though, I think some imprecision/incorrectness has gradually leaked into terminology in this part of the code over time relating to confusion between:

  1. The human-readable representation of underlying bytes (e.g. the string literals that appear in code and comments here)
  2. The character set (e.g. "everything defined by Unicode")
  3. The binary encoding (e.g. UTF-8,, UTF-16, etc)
  4. The URL-encoding (e.g. representing non-URL-safe characters with percent-escaping)

I believe sometimes "Unicode" is incorrectly used to mean UTF-16, and sometimes Java char is incorrectly assumed to be UTF-16.

Here's a good stackoverflow example to help consider the differences (I haven't personally verified correctness of the exact binary values, but at least the concept is worth trying to disambiguate): https://stackoverflow.com/a/27939161/3777211

A Chinese character:      汉
its Unicode value:        U+6C49
convert 6C49 to binary:   01101100 01001001
encode 6C49 as UTF-8:     11100110 10110001 10001001

If I plug that character into https://meyerweb.com/eric/tools/dencoder/ I get:

%E6%B1%89

Converting, I see 0xE6 == 230 dec == 11100110 binary, 0xB1 == 177 == 10110001 and 0x89 == 137 == 10001001.

So the website indeed used default UTF-8 encoding and then URL-escaped it.

Looking at the code it looks like we're asking for the server to give the already-URL-escaped version of the string, e.g. the server would give the string-literal "%E6%B1%89" in its config response, right?

If so I'd suggest we should document and name variables and config keys to make it clear that it's expected to already be URL-encoded.

@nastra nastra force-pushed the configurable-namespace-separator branch from 4915f0f to 3c61adc Compare December 4, 2025 14:59
@nastra nastra force-pushed the configurable-namespace-separator branch from 3c61adc to 817cc21 Compare December 8, 2025 14:46
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @nastra !

* @param ns namespace to encode
* @return UTF-8 encoded string representing the namespace, suitable for use as a URL parameter
*/
public static String encodeNamespace(Namespace ns) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this util method any more ? i wonder if we should mark it deprecated or explicity mention this will encoded with a certain seperator in the public doc ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, we should deprecate this one

@nastra
Copy link
Contributor Author

nastra commented Dec 9, 2025

thanks everyone for the reviews

@nastra nastra merged commit 344adf9 into apache:main Dec 9, 2025
44 checks passed
@nastra nastra deleted the configurable-namespace-separator branch December 9, 2025 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.