Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rest Catalog: Add RESTful AppendFiles data operation #9292

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

geruh
Copy link
Contributor

@geruh geruh commented Dec 13, 2023

Hi All,

I have created this smaller PR focused more on the AppendFiles operations. These changes are stemming from this PR: #9237.

@jackye1995 @amogh-jahagirdar @rdblue

@geruh geruh changed the title Rest append files Rest Catalog: Add RESTful AppendFiles data operation Dec 13, 2023
action:
type: string
enum: [ "append-files" ]
appended-manifests:
Copy link
Contributor

@jackye1995 jackye1995 Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we did not really finalize in the discussion threads if this is a good idea. I think it's an easy integration with existing JVM-based engines since they already know how to produce manifests, but if our goal is to make it easy for other non-JVM engines to integrate, seems like this should be a list of DataFiles, or data file locations.

Copy link
Contributor Author

@geruh geruh Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Jack, you make a good point. The intentions behind the appended-manifest was to showcase the need for compression, and it was a middle ground between a List of DataFiles and a ManifestList. In the case of a large scale Append operation this list can become very large therefore, we can trade some overhead for these operations. However, at the end of the day this is a list of string locations. How the service appends these files is depending on it's ability to read DataFiles into their service from the Request body.

Some other potential approaches we could take are:

  • Pagination/batching
  • Compression (GZIP, Brotli)
  • File Upload of a different type other than avro.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a similar comment about what content to add below, but I think it's best to choose the lowest common denominator (ContentFile/data_file) since that would reduce the need to rewrite manifests to properly define things like the sequence number and alleviates the need for a client to know how to construct a manifest.

@geruh geruh closed this Jan 26, 2024
@geruh geruh restored the rest-append-files branch January 26, 2024 01:32
@geruh geruh reopened this Jan 26, 2024
@@ -490,4 +491,21 @@ public void applyTo(ViewMetadata.Builder viewMetadataBuilder) {
viewMetadataBuilder.setCurrentVersionId(versionId);
}
}

class AppendFilesUpdate implements MetadataUpdate {
Copy link
Contributor

@danielcweeks danielcweeks Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to rename this to AppendManifestsUpdate AppendContentFileUpdate as appending files would be more of a DataFile type object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it makes more sense to pass DataFile entries through this interface since that will actually handle both cases. There are issues with passing manifests because they will need to be rewritten in certain cases. ContentFileParser already supports a json representation, so I think we should pass the data file records through this interface directly (as opposed to just pointing to files).

Copy link
Contributor

@jackye1995 jackye1995 Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to pass DataFile entries through this interface since that will actually handle both cases

+1. I think using manifest was more for scalability considerations because there might be many data files. But that might be an over-optimization, and start with passing data file entries is a better starting point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the naming, should this be more like AddContentFileUpdate instead of AppendContentFileUpdate in this case? @danielcweeks given this can be used to add both data and delete files which are both content files, and "append delete files" does not sound like the right terminology.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to initially disallow delete files from being added because there are more concerns around achieving correct semantics. We could split this into AppendDataFileUpdate and if we later decide to support deletes, add AppendDeleteFileUpdate.

@@ -112,6 +111,7 @@ public class RESTSessionCatalog extends BaseViewSessionCatalog
private static final Logger LOG = LoggerFactory.getLogger(RESTSessionCatalog.class);
private static final String DEFAULT_FILE_IO_IMPL = "org.apache.iceberg.io.ResolvingFileIO";
private static final String REST_METRICS_REPORTING_ENABLED = "rest-metrics-reporting-enabled";
private static final String REST_DATA_COMMIT_ENABLED = "rest-data-commit-enabled";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few places we probably need to update the naming.

@jackye1995
Copy link
Contributor

Based on our offline conversations, this is a good starting point, but please add more tests so we can properly merge this PR after reviews.

Copy link

github-actions bot commented Oct 8, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants