Skip to content

Interface changes for pagination #1528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 53 commits into from
May 9, 2025

Conversation

eric-maynard
Copy link
Contributor

The Apache Iceberg REST catalog spec describes how APIs like listNamespaces can implement pagination, but Polaris currently does not implement pagination.

This PR introduces a framework that will allow us to enable pagination for these APIs and that can also be extended to other APIs like listCatalogs.


This is a modified version of #273 that removes the actual pagination implementation at the persistence level.

@eric-maynard eric-maynard requested a review from HonahX as a code owner May 5, 2025 17:19

/**
* This is needed until real pagination is implemented just to capture limit / page-size without any
* particular start position TODO: Remove this type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire type can get removed once the pageTokenBuilder implementations are updated to return a more useful PageToken implementation.

It's needed for now because we are updating interfaces to remove int limit in lieu of a PageToken, but ReadEverythingPageToken (which will stay for testing and for calls that want no pagination) doesn't support a limit.

* left off. If the client provides a `pageToken` or `pageSize` but `next-page-token` is null in the
* response, that means there is no more data to read.
*/
public abstract class PageToken {
Copy link
Contributor Author

@eric-maynard eric-maynard May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviving the discussion on the name below...

It's currently called PageToken because it encapsulates the data in page-token sent by the client. It also encapsulates page-size and so there has been some objection to the naming.

At its core, this PageToken is used to describe how persistence should build a page of data. It is, by itself, not a sufficient "page request" in that it doesn't contain information like the type of entity to search for. But it is used in conjunction to with a request to filter results down.

Because the Iceberg spec is vague about the page-token implementation -- calling it "opaque" -- I feel that a page token which contains information about the size of the page is still consistent with the spec. page-size is still respected when it's present in a request alongside page-token.

To be clear, it's not just the Java type that contains the page size. The string page-token itself contains the page size information and the Java type is reflecting that.

@eric-maynard eric-maynard requested a review from pingtimeout as a code owner May 7, 2025 17:36
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback... I'll read/comment more presently

Comment on lines 74 to 78
if (tokenString == null) {
throw new IllegalArgumentException("Cannot build page token from null string");
} else if (tokenString.isEmpty()) {
if (this instanceof ReadEverythingPageToken.ReadEverythingPageTokenBuilder) {
return ReadEverythingPageToken.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is null that much different from an empty string?

Since the intention is to convey both the "page pointer" and the "page size" in PageToken, why have have a from(@Nullable String token, int pageSize) method?

Copy link
Contributor Author

@eric-maynard eric-maynard May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question, let me briefly overview how it's intended to behave.

The user may provide a token, a page-size, both, or neither.

  1. If they provide neither, there is no pagination.
  2. If they provide just a page size, we start a new listing operation with the given page size
  3. If they provide just a page token, we start from the location specified in the token, with the page size specified in the token.
  4. If they provide both, we start from the location specified in the token, with the page size specified in the request.

So this method is useful for situations (2) and (4). Consider, too, the case where the user does (4), but then follows up with (3) -- I think they would expect the same page size to continue to apply in the second call, which is why we stick the page size in the token itself.

Also, good point re: null

Copy link
Contributor

@dimas-b dimas-b May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1 I'd say we should use some server default page size for backends that support pagination. It may be large, but not unlimited.

For backends that do not support pagination, unlimited result set is ok.

Copy link
Contributor

@dimas-b dimas-b May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a more general perspective, I believe the server cannot be expected to always comply with the client's page size.

  1. To safeguard against DOS attacks
  2. For technical reasons:
    • A persistence query may return less data than expected. It might be preferable to finish the request with a smaller page and let the client re-request, than keep polling the backend.
    • A communication protocol may have restrictions based on the message size, not the number of items encoded in the message.

All in all, I think clients should not expect responses to have exactly as many entries as requested even when the result set is not exhausted. Polaris may be able to satisfy that expectation in most cases, but we should not make it a guaranteed feature.

Copy link
Contributor Author

@eric-maynard eric-maynard May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly governed by the Iceberg spec and our goal should be to comply with that.

What you wrote is true, that the server may return more or less records than the requested page size (whether embedded in the token or explicitly requested by the client). But we still need a way for the client to request a given page size as per the spec. The four scenarios outlined above are still valid and are still supported by the APIs defined here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specs are not infallible :)

Quoting from the REST API spec

Servers that support pagination must return all results in a single response with the value
of `next-page-token` set to `null` if the query parameter `pageToken` is not set in the
request.

IMHO, this requirement is a recipe for disaster if, for example, a catalog has millions of entries.

I would not want to force Polaris into following this statement by the letter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now I did notice that the REST API spec indicates that the presence of an empty PageToken should be interpreted as the client's intention to use pagination.

That statement in itself is fine. Given my comment above, I think it might be worth interpreting null and empty page tokens differently. I'd still suggest to use a possibly large, but limited page size for the null token. We could use a smaller default page size for the empty token.

Copy link
Contributor Author

@eric-maynard eric-maynard May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For at least the first iteration, I want to take a hard line on complying with the spec. Per the quote you cited that means that if there's no page token, we return everything.

Remember that today we just return all results no matter what the user asks for. I think it's okay to keep returning all results if the user asks for it, but at least now we will give users the option to ask for something else. If nothing else it's an incremental improvement.

wrt. null vs empty string, I'm a bit paranoid about us accidentally paginating results when the user doesn't expect it. So there too I think it's better to err on the side of the current (and still spec-compliant) behavior of just returning everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point about minimizing disturbance to existing Polaris behaviour in this PR.

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments. I'm done with my first review.

@eric-maynard
Copy link
Contributor Author

Thanks @dimas-b for the thorough review! Based on your advice above I've refactored things quite a bit and seemingly simplified the interfaces greatly. Several types are gone and the way that we build PageTokens is much simpler now. Let me know what you think

.map(ModelEntity::toEntity)
.filter(entityFilter);

if (pageToken instanceof HasPageSize hasPageSize) {
Copy link
Contributor

@dimas-b dimas-b May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is functionally correct. If the token is a sub-class of HasPageSize but has other information that the client provided, but EclipseLink does not recognise, I do not think it would be correct to just use the page size from the token.

I believe we should check for a specific type here and fail on all types, whose details cannot be fully processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct for the current code, where HasPageSize is just used as a limit. When page tokens are implemented this logic will definitely change!

Comment on lines +185 to +186
return CatalogHandlers.listNamespaces(
namespaceCatalog, parent, pageToken, String.valueOf(pageSize));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can do better in Polaris by using the REST API of baseCatalog directly.

This can be deferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be a REST catalog :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, I think we have to delegate pagination to the federated catalog if it supports pagination. CatalogHandlers.listNamespaces has a trivial implementation that is sub-optimal when the catalog is a REST Catalog.

Copy link
Contributor Author

@eric-maynard eric-maynard May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you're right. In most cases we should stop using CatalogHandlers methods as a fallback during federation or at least add our own logic, I think. Probably we can do this as a followup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup SGTM

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks, @eric-maynard !

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 8, 2025
@eric-maynard eric-maynard merged commit 44064cb into apache:main May 9, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board May 9, 2025
snazy added a commit to snazy/polaris that referenced this pull request May 22, 2025
* Doc: Add release guide on the website (apache#1539)

* main: Update actions/stale digest to f78de97 (apache#1547)

* main: Update dependency boto3 to v1.38.12 (apache#1548)

* main: Update postgres Docker tag to v17.5 (apache#1549)

* main: Update dependency com.adobe.testing:s3mock-testcontainers to v4.2.0 (apache#1551)

* main: Update dependency com.nimbusds:nimbus-jose-jwt to v10.3 (apache#1552)

* Interface changes for pagination (apache#1528)

* add missing apis

* more tests, fixes

* clean up drop

* autolint

* changes per review

* revert iceberg messages to comply with oss tests

* another revert

* more iceberg catalog changes

* autolint

* dependency issues

* more wiring

* continuing rebase

* remaining issues are related to task loading

* re-add tests

* debugging

* fix failing tests

* fix another test

* changes per review

* autolint

* some fixes

* stable

* updates for new persistence

* fix

* continuing work

* more reverts

* continue reverts

* more reverts

* yank tests

* autolint

* test reverts

* try to support limit without real page tokens

* autolint

* Stable

* change comment

* autolint

* remove catalog config for now

* changes per review

* more tweaks

* simplify types per review

* Stable, about to refactor more

* re-stable

* polish

* autolint

* more changes per review

* stable

* Introduce reserved-properties setting; reserve "polaris." by default (apache#1417)

* initial commit

* initial commit

* try to test

* quarkus fixes

* fix a bunch of callsites

* Start applying changes

* autolint

* chase todos

* autolint

* bugfix

* stable

* add one test

* stable with more tests

* autolint

* more tests

* autolint

* stable tests

* clean up

* oops

* stabilize on main

* autolint

* more changes per review

* Add cleanup support for partition-level statistics files when `DROP TABLE PURGE` (apache#1508)

* cleaning up partition stats

* update partition stat file extension

* update test partition stat write impl

* Implement federation to HadoopCatalog (apache#1466)

* wip

* quarkus fixes

* autolint

* hadoop impl

* autolint

* Refactors

* refactored

* autolint

* add config

* autolint

* stable

* Remove breakpoint anchor

* add line to application.properties

* yank HADOOP

* autolint

* Spark: Use builder for CreateGenericTableRequest instead of constructor for easier API spec update (apache#1546)

* main: Update docker.io/jaegertracing/all-in-one Docker tag to v1.69.0 (apache#1559)

* main: Update dependency io.opentelemetry:opentelemetry-bom to v1.50.0 (apache#1558)

* main: Update dependency software.amazon.awssdk:bom to v2.31.40 (apache#1567)

* main: Update dependency boto3 to v1.38.13 (apache#1556)

* Include DISCLAIMER in binary distributions (apache#1568)

* NoSQL: merge/rebase-fixes

* NoSQL: bump dependencies

* fix markdown

---------

Co-authored-by: JB Onofré <jbonofre@apache.org>
Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: Eric Maynard <eric.maynard+oss@snowflake.com>
Co-authored-by: danielhumanmod <danieltu.life@gmail.com>
Co-authored-by: gh-yzou <167037035+gh-yzou@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants