-
Notifications
You must be signed in to change notification settings - Fork 250
Interface changes for pagination #1528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…isGenericTableCatalog-2
|
||
/** | ||
* This is needed until real pagination is implemented just to capture limit / page-size without any | ||
* particular start position TODO: Remove this type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire type can get removed once the pageTokenBuilder
implementations are updated to return a more useful PageToken
implementation.
It's needed for now because we are updating interfaces to remove int limit
in lieu of a PageToken
, but ReadEverythingPageToken
(which will stay for testing and for calls that want no pagination) doesn't support a limit.
* left off. If the client provides a `pageToken` or `pageSize` but `next-page-token` is null in the | ||
* response, that means there is no more data to read. | ||
*/ | ||
public abstract class PageToken { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviving the discussion on the name below...
It's currently called PageToken
because it encapsulates the data in page-token
sent by the client. It also encapsulates page-size
and so there has been some objection to the naming.
At its core, this PageToken
is used to describe how persistence should build a page of data. It is, by itself, not a sufficient "page request" in that it doesn't contain information like the type of entity to search for. But it is used in conjunction to with a request to filter results down.
Because the Iceberg spec is vague about the page-token
implementation -- calling it "opaque" -- I feel that a page token which contains information about the size of the page is still consistent with the spec. page-size
is still respected when it's present in a request alongside page-token
.
To be clear, it's not just the Java type that contains the page size. The string page-token
itself contains the page size information and the Java type is reflecting that.
...e/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkMetaStoreSessionImpl.java
Outdated
Show resolved
Hide resolved
polaris-core/src/main/java/org/apache/polaris/core/persistence/dao/entity/EntitiesResult.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early feedback... I'll read/comment more presently
...re/src/main/java/org/apache/polaris/core/persistence/pagination/ReadEverythingPageToken.java
Outdated
Show resolved
Hide resolved
if (tokenString == null) { | ||
throw new IllegalArgumentException("Cannot build page token from null string"); | ||
} else if (tokenString.isEmpty()) { | ||
if (this instanceof ReadEverythingPageToken.ReadEverythingPageTokenBuilder) { | ||
return ReadEverythingPageToken.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is null
that much different from an empty string?
Since the intention is to convey both the "page pointer" and the "page size" in PageToken
, why have have a from(@Nullable String token, int pageSize)
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question, let me briefly overview how it's intended to behave.
The user may provide a token, a page-size, both, or neither.
- If they provide neither, there is no pagination.
- If they provide just a page size, we start a new listing operation with the given page size
- If they provide just a page token, we start from the location specified in the token, with the page size specified in the token.
- If they provide both, we start from the location specified in the token, with the page size specified in the request.
So this method is useful for situations (2) and (4). Consider, too, the case where the user does (4), but then follows up with (3) -- I think they would expect the same page size to continue to apply in the second call, which is why we stick the page size in the token itself.
Also, good point re: null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1 I'd say we should use some server default page size for backends that support pagination. It may be large, but not unlimited.
For backends that do not support pagination, unlimited result set is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a more general perspective, I believe the server cannot be expected to always comply with the client's page size.
- To safeguard against DOS attacks
- For technical reasons:
- A persistence query may return less data than expected. It might be preferable to finish the request with a smaller page and let the client re-request, than keep polling the backend.
- A communication protocol may have restrictions based on the message size, not the number of items encoded in the message.
All in all, I think clients should not expect responses to have exactly as many entries as requested even when the result set is not exhausted. Polaris may be able to satisfy that expectation in most cases, but we should not make it a guaranteed feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly governed by the Iceberg spec and our goal should be to comply with that.
What you wrote is true, that the server may return more or less records than the requested page size (whether embedded in the token or explicitly requested by the client). But we still need a way for the client to request a given page size as per the spec. The four scenarios outlined above are still valid and are still supported by the APIs defined here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specs are not infallible :)
Quoting from the REST API spec
Servers that support pagination must return all results in a single response with the value
of `next-page-token` set to `null` if the query parameter `pageToken` is not set in the
request.
IMHO, this requirement is a recipe for disaster if, for example, a catalog has millions of entries.
I would not want to force Polaris into following this statement by the letter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but now I did notice that the REST API spec indicates that the presence of an empty PageToken
should be interpreted as the client's intention to use pagination.
That statement in itself is fine. Given my comment above, I think it might be worth interpreting null
and empty page tokens differently. I'd still suggest to use a possibly large, but limited page size for the null
token. We could use a smaller default page size for the empty token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For at least the first iteration, I want to take a hard line on complying with the spec. Per the quote you cited that means that if there's no page token, we return everything.
Remember that today we just return all results no matter what the user asks for. I think it's okay to keep returning all results if the user asks for it, but at least now we will give users the option to ask for something else. If nothing else it's an incremental improvement.
wrt. null vs empty string, I'm a bit paranoid about us accidentally paginating results when the user doesn't expect it. So there too I think it's better to err on the side of the current (and still spec-compliant) behavior of just returning everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point about minimizing disturbance to existing Polaris behaviour in this PR.
polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java
Outdated
Show resolved
Hide resolved
polaris-core/src/main/java/org/apache/polaris/core/persistence/BasePersistence.java
Outdated
Show resolved
Hide resolved
...re/src/main/java/org/apache/polaris/core/persistence/pagination/ReadEverythingPageToken.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments. I'm done with my first review.
polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java
Outdated
Show resolved
Hide resolved
...e/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalogHandler.java
Outdated
Show resolved
Hide resolved
...mmon/src/main/java/org/apache/polaris/service/types/ListNamespacesResponseWithPageToken.java
Outdated
Show resolved
Hide resolved
Thanks @dimas-b for the thorough review! Based on your advice above I've refactored things quite a bit and seemingly simplified the interfaces greatly. Several types are gone and the way that we build |
.map(ModelEntity::toEntity) | ||
.filter(entityFilter); | ||
|
||
if (pageToken instanceof HasPageSize hasPageSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is functionally correct. If the token is a sub-class of HasPageSize
but has other information that the client provided, but EclipseLink does not recognise, I do not think it would be correct to just use the page size from the token.
I believe we should check for a specific type here and fail on all types, whose details cannot be fully processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct for the current code, where HasPageSize
is just used as a limit. When page tokens are implemented this logic will definitely change!
polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java
Outdated
Show resolved
Hide resolved
...ore/src/main/java/org/apache/polaris/core/persistence/pagination/ReadFromStartPageToken.java
Outdated
Show resolved
Hide resolved
polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java
Outdated
Show resolved
Hide resolved
...ore/src/main/java/org/apache/polaris/core/persistence/pagination/ReadFromStartPageToken.java
Outdated
Show resolved
Hide resolved
return CatalogHandlers.listNamespaces( | ||
namespaceCatalog, parent, pageToken, String.valueOf(pageSize)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can do better in Polaris by using the REST API of baseCatalog
directly.
This can be deferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be a REST catalog :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, I think we have to delegate pagination to the federated catalog if it supports pagination. CatalogHandlers.listNamespaces
has a trivial implementation that is sub-optimal when the catalog is a REST Catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you're right. In most cases we should stop using CatalogHandlers
methods as a fallback during federation or at least add our own logic, I think. Probably we can do this as a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followup SGTM
polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/DonePageToken.java
Show resolved
Hide resolved
polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/DonePageToken.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 Thanks, @eric-maynard !
* Doc: Add release guide on the website (apache#1539) * main: Update actions/stale digest to f78de97 (apache#1547) * main: Update dependency boto3 to v1.38.12 (apache#1548) * main: Update postgres Docker tag to v17.5 (apache#1549) * main: Update dependency com.adobe.testing:s3mock-testcontainers to v4.2.0 (apache#1551) * main: Update dependency com.nimbusds:nimbus-jose-jwt to v10.3 (apache#1552) * Interface changes for pagination (apache#1528) * add missing apis * more tests, fixes * clean up drop * autolint * changes per review * revert iceberg messages to comply with oss tests * another revert * more iceberg catalog changes * autolint * dependency issues * more wiring * continuing rebase * remaining issues are related to task loading * re-add tests * debugging * fix failing tests * fix another test * changes per review * autolint * some fixes * stable * updates for new persistence * fix * continuing work * more reverts * continue reverts * more reverts * yank tests * autolint * test reverts * try to support limit without real page tokens * autolint * Stable * change comment * autolint * remove catalog config for now * changes per review * more tweaks * simplify types per review * Stable, about to refactor more * re-stable * polish * autolint * more changes per review * stable * Introduce reserved-properties setting; reserve "polaris." by default (apache#1417) * initial commit * initial commit * try to test * quarkus fixes * fix a bunch of callsites * Start applying changes * autolint * chase todos * autolint * bugfix * stable * add one test * stable with more tests * autolint * more tests * autolint * stable tests * clean up * oops * stabilize on main * autolint * more changes per review * Add cleanup support for partition-level statistics files when `DROP TABLE PURGE` (apache#1508) * cleaning up partition stats * update partition stat file extension * update test partition stat write impl * Implement federation to HadoopCatalog (apache#1466) * wip * quarkus fixes * autolint * hadoop impl * autolint * Refactors * refactored * autolint * add config * autolint * stable * Remove breakpoint anchor * add line to application.properties * yank HADOOP * autolint * Spark: Use builder for CreateGenericTableRequest instead of constructor for easier API spec update (apache#1546) * main: Update docker.io/jaegertracing/all-in-one Docker tag to v1.69.0 (apache#1559) * main: Update dependency io.opentelemetry:opentelemetry-bom to v1.50.0 (apache#1558) * main: Update dependency software.amazon.awssdk:bom to v2.31.40 (apache#1567) * main: Update dependency boto3 to v1.38.13 (apache#1556) * Include DISCLAIMER in binary distributions (apache#1568) * NoSQL: merge/rebase-fixes * NoSQL: bump dependencies * fix markdown --------- Co-authored-by: JB Onofré <jbonofre@apache.org> Co-authored-by: Mend Renovate <bot@renovateapp.com> Co-authored-by: Eric Maynard <eric.maynard+oss@snowflake.com> Co-authored-by: danielhumanmod <danieltu.life@gmail.com> Co-authored-by: gh-yzou <167037035+gh-yzou@users.noreply.github.com>
The Apache Iceberg REST catalog spec describes how APIs like
listNamespaces
can implement pagination, but Polaris currently does not implement pagination.This PR introduces a framework that will allow us to enable pagination for these APIs and that can also be extended to other APIs like
listCatalogs
.This is a modified version of #273 that removes the actual pagination implementation at the persistence level.