-
Notifications
You must be signed in to change notification settings - Fork 250
Docs/getting-started: Switch to use iceberg-aws-bundle jar #1609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
That's need a Check and update on LICENSE in distribution as iceberg-aws-bundle shade dependencies. |
Okay, let me take a look tomorrow to see if they include anything extra which we may not already included. |
@jbonofre I am not seeing license refs to AWS jar as well from latest https://github.com/apache/polaris/blob/main/LICENSE. Can you confirm if this is the right place to check. Quick peak on the dependencies used by iceberg aws bundle, here are the ones:
From above, the only one that is not apache license is following:
I am not sure if this is okay. Let me know what you think. |
@MonkeyCanCode this is the license for source distribution. This one is ok. I mean the license in quarkus/server and quarkus/admin. Here we are using AWS dependencies. We have to check if already ok regarding iceberg-AWS-bundle. |
Got it. Thanks for the clarification. I will take a look tomorrow. |
@jbonofre actually, do we really need to check the Quarkus server/admin license for this change as the change is only for the client side (quickstart and notebook examples) and there is no change of dependencies used by core services? |
Good point. But if the client (including Spark plugin) bundles it we have to add license/notice/disclaimer there. It's something that I plan to check after 0.10 release (before 1.0). |
TBH, I don't understand the point of "vulnerabilities" and how the bundle-jar would solve this. We already use very recent versions of the awssdk - so the versions in the "bundle-jar" are already older. The right way IMHO would be to not use the gcp-bundle but instead use up-to-date versions of it directly. I see more issues with using the bundle-jar, especially duplicate classes on the class path, because uber-jars silently bring these duplicate classes and all the "joy" of having duplicate classes from different versions on the classpath. These "bundle-jars" are probably nice in some cases, but I do not think that those are a good fit for Polaris. |
@snazy - not sure if we're on the same page here but these bundle JARs aren't the ones from AWS/Azure/GCP but rather ones that are compiled by Apache Iceberg to solve this exact problem of duplicated classpaths. I had the same concerns when I was introducing the GCP and Azure bundle JARs when rewriting the Quickstart but found them incredibly helpful and rather solving my class path issues. |
@adnanhemani Exactly. @snazy So I ran into this issue long time back when I was trying to hook spark with iceberg and AWS jars...it took me a while to figure out all of the needed pieces and class path issues mentioned above. For an end-user who is not familiar with those version dependencies, blindly bump the version to the next one can introduce class path issues as certain needed classes got removed even between minor version. Also, keeping 2-3 AWS JARs with arbitrary versioning seems very hard to manage in the long run as oppose to use the one compiled by iceberg for the needed pieces when integrated with various cloud vendors. The bundle jar doesn't necessary solve the problem of newly discovered vulns, but an end-users may expect the next version (e.g. 1.9.0 to 1.9.1) to have those issues resolved as oppose to try to change |
Wait a sec - maybe there's some confusion on my side. You want to change only the docs - aka the packages passed on the command line to Spark et all? That's a different thing. I'm against changing the dependencies of Polaris itself. |
The PR summary and description let me assume it's changed everywhere. |
only the doc and getting start examples/notebooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving since I don't have much context on the license issues that are also being discussed - I'm sure that the author and @jbonofre will figure this part out. All the diff LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good hint about the LICENSE/NOTICE file. I've created #1627 to track this. But I don't think it's "caused" or "regressed" by this PR.
+1
Thanks @MonkeyCanCode ! Please just validate the merge commit subject + message before hitting "merge".
Thanks @MonkeyCanCode for the change. Thanks @jbonofre @snazy @adnanhemani for the review! |
* main: Update docker.io/prom/prometheus Docker tag to v3.4.0 (apache#1602) * Site: Update production configuration page (apache#1606) * main: Update dependency com.google.cloud:google-cloud-storage-bom to v2.52.3 (apache#1623) * main: Update dependency boto3 to v1.38.19 (apache#1622) * Remove Bouncy Castle dependency usage from PemUtils (apache#1318) - Added PEM format parsing in PemUtils - Added unit test for PemUtils for empty file and multiple PEM objects - Removed Bouncy Castle Provider dependency from service common module - Removed Bouncy Castle Provider dependency from quarkus service module * Site: Add a page for policy management (apache#1600) * [Policy Store | Management Spec] Add policy privileges to spec and update admin service impl (apache#1529) This PR adds new policy related privileges to polaris-management-api.yml and update PolarisAdminService to allow granting new privileges * Spec: Add SigV4 Auth Support for Catalog Federation (apache#1506) * Spec changes for SigV4 Auth Support for Catalog Federation * Extract service identity info as a nested object * nit: fix admin tool log level and comments (apache#1626) The previous WARNING log levels seems to work, but WARN aligns better with standard Quarkus log levels. Fixes apache#1612 * Doc: switch to use iceberg-aws-bundle jar (apache#1609) * main: Update dependency org.mockito:mockito-core to v5.18.0 (apache#1630) * main: Update dependency boto3 to v1.38.20 (apache#1631) * Require explicit user-consent to enable HadoopFileIO (apache#1532) Using `HadoopFileIO` in Polaris can enable "hidden features" that users are likely not aware of. This change requires users to manually update the configuration to be able to use `HadoopFileIO` in way that highlights the consequences of enabling it. This PR updates Polaris in multiple ways: * The default of `SUPPORTED_CATALOG_STORAGE_TYPES` is changed to not include the `FILE` storage type. * Respect the `ALLOW_SPECIFYING_FILE_IO_IMPL` configuration on namespaces, tables and views to prevent setting an `io-impl` value for anything but one of the configured, supported storage-types. * Unify validation code in a new class `IcebergPropertiesValidation`. * Using `FILE` or `HadoopFileIO` now _also_ requires the explicit configuration `ALLOW_INSECURE_STORAGE_TYPES_ACCEPTING_SECURITY_RISKS=true`. * Added production readiness checks that trigger when `ALLOW_INSECURE_STORAGE_TYPES_ACCEPTING_SECURITY_RISKS` is `true` or `SUPPORTED_CATALOG_STORAGE_TYPES` contains `FILE` (defaults and per-realm). * The two new readiness checks are considered _severe_. Severe readiness-errors prevent the server from starting up - unless the user explicitly configured `polaris.readiness.ignore-security-issues=true`. Log messages and configuration options explicitly use "clear" phrases highlighting the consequences. With these changes it is intentionally extremely difficult to start Polaris with HadoopFileIO. People who work around all these safety nets must have realized that what they are doing. A lot of the test code relies on `FILE`/`HadoopFileIO`, those tests got all the configurations to let those tests continue to work as they are, bypassing the added security safeguards. --------- Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@dremio.com> --------- Co-authored-by: Mend Renovate <bot@renovateapp.com> Co-authored-by: Yufei Gu <yufei@apache.org> Co-authored-by: David Handermann <exceptionfactory@apache.org> Co-authored-by: Honah (Jonas) J. <honahx@apache.org> Co-authored-by: Rulin Xing <xjdkcsq3@gmail.com> Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@dremio.com> Co-authored-by: MonkeyCanCode <yongzheng0809@gmail.com>
We are using a combination of following 3 jars throughout this repo when connecting to AWS:
Those are great and well documented in https://iceberg.apache.org/docs/1.4.0/aws/#enabling-aws-integration. However, it can be pretty confusing in my option for people who are not familiar with these 3 jars and tried to mitigate a vulnerability by just bump the jar version (there are many vulnerabilities for those 3 jars as they are pretty old and just change to latest will break as some needed classes got removed in the newer version).
Then for the new examples, we are using
,org.apache.iceberg:iceberg-gcp-bundle:1.9.0,org.apache.iceberg:iceberg-azure-bundle:1.9.0
already, thus, I think it makes sense to switch from these 3 individual jars toorg.apache.iceberg:iceberg-aws-bundle:1.9.0
.