Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Add RESTScanReporter to send scan report to REST endpoint #5407

Merged
merged 2 commits into from
Sep 30, 2022

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Aug 1, 2022

depends on #5427

@nastra nastra marked this pull request as draft August 1, 2022 17:07
@nastra nastra force-pushed the scan-reporting-rest branch 2 times, most recently from b69475b to 0bda731 Compare August 3, 2022 15:39
@danielcweeks danielcweeks self-requested a review August 3, 2022 16:00
@nastra nastra force-pushed the scan-reporting-rest branch from 0bda731 to 228e7be Compare August 5, 2022 10:15
@nastra nastra force-pushed the scan-reporting-rest branch from 8a49731 to 98a9be3 Compare August 16, 2022 18:18
@nastra nastra marked this pull request as ready for review August 16, 2022 18:19
@github-actions github-actions bot removed the API label Aug 16, 2022
@nastra nastra force-pushed the scan-reporting-rest branch 4 times, most recently from 2505013 to 2c43833 Compare August 19, 2022 06:12
@nastra nastra requested a review from rdblue August 19, 2022 15:45
"tables",
RESTUtil.encodeString(identifier.name()),
"metrics",
"scan");
Copy link
Contributor

@rdblue rdblue Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to make multiple endpoints? Why not just post to metrics with a type included in the serialized metrics report? That way there's only one metrics endpoint needed and we don't need to specify this later.

We can also make the update to the REST spec generic using anyOf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment in #5407 (comment) about the initial reasoning for a /metrics/scan endpoint (rather than just /metrics).
Obviously we could post simply to /metrics with a type but eventually you might want to also do a GET and retrieve metrics of a particular type, so using /metrics/scan seems more in-line with how REST would model post/get requests.

Example: We could have a /metrics/commit endpoint where we'd POST commit metrics to and could also perform a GET to retrieve them

I understand the reasoning for this, since we'd like to keep changes to the REST spec to a minimum. I don't have a strong opinion on this, but I'd like us to at least consider this. IMO this is a bit cleaner and aligns better with REST to have it this way. However, I'm interested in what others think about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielcweeks, what do you think about having one metrics endpoint vs. separate endpoints for different metrics types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this for a while and I think we should go with the generic metrics endpoint rather than specific endpoints for scan, commit, etc. That keeps the REST spec simpler. I also think that it fits with the potential for future GET requests because we could provide a listing of metric reports and filter them using query params for the .../metrics endpoint. We could also return individual reports using .../metrics/<report-id>.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good. The only concerns I have are:

  • Can we make the metrics endpoint and requests generic so that we don't have to modify this to send other metrics?
  • Do we need a request object that adds another layer to the JSON? This is minor but it seems odd to have a record that just wraps another record to make the Java types work out (i.e. ScanReportRequest is a RESTRequest).

@nastra
Copy link
Contributor Author

nastra commented Sep 5, 2022

  • Do we need a request object that adds another layer to the JSON? This is minor but it seems odd to have a record that just wraps another record to make the Java types work out (i.e. ScanReportRequest is a RESTRequest).

Not sure if there's an easier way to do it, but essentially our RESTClient requires the body to be a RESTRequest and we can't make ScanReport (part of iceberg-api) just implement RESTRequest (part of iceberg-core) unfortunately.

@nastra nastra force-pushed the scan-reporting-rest branch from 2c43833 to 1f459d2 Compare September 5, 2022 09:46
@github-actions github-actions bot added the API label Sep 5, 2022
@nastra nastra force-pushed the scan-reporting-rest branch 2 times, most recently from 24f6882 to 698d22a Compare September 5, 2022 16:46
@github-actions github-actions bot added the API label Sep 26, 2022
@nastra nastra force-pushed the scan-reporting-rest branch 2 times, most recently from 31d9d93 to 199c8ba Compare September 26, 2022 07:52
*/
package org.apache.iceberg.metrics;

public interface MetricsReport {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also expose the metrics from the underlying report. That way we only need parsers to serialize the scan- or commit-specific fields and can then delegate to a metrics parser. In addition, we can also still handle metrics from an unknown report type (minor).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might make sense, but I would probably defer this until we have a better picture of how commit reports look like

@nastra nastra force-pushed the scan-reporting-rest branch 3 times, most recently from 44f0738 to 3d66379 Compare September 27, 2022 13:05
Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, other than the serialization format for the metrics report request.

We can also break this into smaller pieces if that's going to be a blocker. All of the API refactoring looks ready to go in to me.

@nastra nastra force-pushed the scan-reporting-rest branch 3 times, most recently from f738731 to 1ee9511 Compare September 30, 2022 10:11
.addDeserializer(OAuthTokenResponse.class, new OAuthTokenResponseDeserializer())
.addSerializer(ReportMetricsRequest.class, new ReportMetricsRequestSerializer<>())
.addDeserializer(ReportMetricsRequest.class, new ReportMetricsRequestDeserializer<>())
.addSerializer(ImmutableReportMetricsRequest.class, new ReportMetricsRequestSerializer<>())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this in a follow-up. I was expecting that registering a custom serializer/deserializer for ReportMetricsRequest would also work out-of-the-box for ImmutableReportMetricsRequest but apparently that's not the case and it requires the subclass to be registered explicitly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #5911 to address this

@nastra nastra force-pushed the scan-reporting-rest branch from 1ee9511 to 99d4539 Compare September 30, 2022 11:12
@nastra
Copy link
Contributor Author

nastra commented Sep 30, 2022

@rdblue thanks for the deep review. I've simplified/improved the things you mentioned and a few other things

@nastra nastra requested a review from rdblue September 30, 2022 11:23
() -> ReportMetricsRequestParser.fromJson("{\"report-type\":\"invalid\"}"))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage(
"No enum constant org.apache.iceberg.rest.requests.ReportMetricsRequest.ReportType.INVALID");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: we could replace this with a better error message for the caller. No enum constant doesn't really tell me that I was trying to deserialize an unsupposed report type, invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a similar pattern in a few other parsers as well. I'll follow up with that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #5910 to address this


Assertions.assertThat(ReportMetricsRequestParser.fromJson(json).report())
.usingRecursiveComparison()
.ignoringFields("projection")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? It seems strange to me that the projection doesn't match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no equals() method defined on Schema, so it will always perform reference equality check

@rdblue rdblue merged commit 133b255 into apache:master Sep 30, 2022
@rdblue
Copy link
Contributor

rdblue commented Sep 30, 2022

Nice work, @nastra! This looks ready so I merged it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants