-
Notifications
You must be signed in to change notification settings - Fork 47
[25.2] Iceberg - AWS Glue #1208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[25.2] Iceberg - AWS Glue #1208
Conversation
✅ Deploy Preview for redpanda-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing Touches🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
ab721c0 to
737d011
Compare
737d011 to
d833f11
Compare
wdberkeley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. My only nit is that the base location is written as s3://<bucket path> in the example config but it can be more than just the name of a bucket and include a path. You mention that just below though.
d7c3f11 to
b387538
Compare
| * If you want to configure authentication to AWS Glue separately from authentication to S3, there are equivalent credential configuration properties named `iceberg_rest_catalog_*` that override the object storage credentials. These properties only apply to REST catalog authentication, and never to S3 authentication. | ||
| ** `iceberg_rest_catalog_aws_access_key` overrides `cloud_storage_access_key` | ||
| ** `iceberg_rest_catalog_aws_secret_key` overrides `cloud_storage_secret_key` | ||
| ** `iceberg_rest_catalog_aws_region` overrides `cloud_storage_region` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lazin here we have a separate region configuration than the main one used by TS. So it should be easy enough to do this for RRR topics as well I guess.
| iceberg_catalog_type: rest | ||
| iceberg_rest_catalog_endpoint: https://glue.<aws-region>.amazonaws.com/iceberg | ||
| iceberg_rest_catalog_authentication_mode: aws_sigv4 | ||
| iceberg_rest_catalog_base_location: 's3://<bucket-name>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good example of a base location and it would be a horrible mistake if the user set this at the bucket level as it could overwrite all of tiered storage if Glue itself had write permissions on the entire buckets. We need:
- a proper base location example here. NOT at the bucket root!
- guidance for configuring Glue's own access to just the iceberg parquqet data and catalog folders within our TS bucket, if possible
cc @wdberkeley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattschumpert @wdberkeley can you choose any subpath/name for iceberg_rest_catalog_base_location? For example, if I change my config value to s3://kbatuigas-iceberg-glue-test/iceberg, this is where the Parquet files get written to:
The TS segments are on the same level as iceberg. Is there any sort of naming convention we should recommend? cc @rpdevmp
Also, regarding Glue access, I'm not sure how standard this is but my policy looks like:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:PutObjectTagging",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::kbatuigas-iceberg-glue-test/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::kbatuigas-iceberg-glue-test"
},
{
"Sid": "GlueCatalogAccess",
"Effect": "Allow",
"Action": [
"glue:*"
],
"Resource": [
"arn:aws:glue:us-east-2:992382373299:catalog",
"arn:aws:glue:us-east-2:992382373299:database/*",
"arn:aws:glue:us-east-2:992382373299:table/*/*"
]
}
]
}
The Glue permissions only act on what appears to be Glue-specific resources (based on the ARN), and I'm not sure if this also means Glue has write permissions to the entire bucket. Would like someone to provide guidance on whether this should be more specific or restrictive. cc @rpdevmp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did we document for the other catalogs. IIUC Redpanda has a default , and with Glue you're force to explicitly specify it.
I would guess @andrwng can also chime in here on what we recommend to set it to (maybe the binary default) and whether it impacts only the iceberg metadata files or the parquet files (I think both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattschumpert We don't provide specific guidance on iceberg_catalog_base_location for the other catalogs; afaiu there isn't a need to modify the default (redpanda-iceberg-catalog) for that property. iceberg_rest_catalog_base_location is the one you have to explicitly set with Glue (default is null/unset).
I've added some tweaks to this section: https://deploy-preview-1208--redpanda-docs-preview.netlify.app/25.2/manage/iceberg/iceberg-topics-aws-glue/#authorize-access-to-aws-glue but still not sure about specific guidance for Glue accessing Iceberg topics. In addition to allowing Redpanda access to Glue as described, it sounds like there should be an additional role/policy for Glue on S3? I think this is the relevant AWS doc: https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a reasonable recommendation is to say to have it be s3:///, though cc @wdberkeley for a spot check that that's reasonable. Glue is the only catalog that explicitly requires iceberg_rest_catalog_base_location to be set -- I think under the hood other catalogs have a default location that gets used, or base it off the warehouse. But Glue doesn't and spits an error without a base location supplied to the table IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrwng huh? s3:/// what would that do? See above. f its set at the bucket root that's really dangerous with respect to TS data this means Glue can overwrite things at the bucket root? Don't we want it in actual canonical catalog location to avoid interfering with TS. We need ti think this through carefuly and be very prescriptive here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, github ate my formating. I meant s3://{bucket}/{warehouse}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying this one doesn't actually get used? are we sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right right. ok. And with GLUE does the warehouse need to be created first and we need to match it (like Databricks) @wdberkeley ?
cc @kbatuigas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| iceberg_catalog_type=rest \ | ||
| iceberg_rest_catalog_endpoint=https://glue.<aws-region>.amazonaws.com/iceberg \ | ||
| iceberg_rest_catalog_authentication_mode=aws_sigv4 \ | ||
| iceberg_rest_catalog_base_location='s3://<bucket-name>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem as above here. Root bucket as the base location is wrong/dangerous
|
|
||
| === Iceberg integration | ||
|
|
||
| * `iceberg_rest_catalog_base_location`: Specifies the base location for the Iceberg REST catalog. Required for AWS Glue Data Catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link this to the property in Cluster properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This property will become available when #1238 is merged.
Feediver1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments throughout.
Co-authored-by: Joyce Fee <102751339+Feediver1@users.noreply.github.com>
…ic catalog resources
Co-authored-by: Joyce Fee <102751339+Feediver1@users.noreply.github.com>
Description
Related: redpanda-data/cloud-docs#363 to add docs to Cloud
This pull request introduces documentation updates to support querying Redpanda topics as Iceberg tables using AWS Glue. It includes a new guide detailing prerequisites, configuration steps, and querying methods, as well as a file rename and updates to reflect schema integration with Iceberg topics.
New AWS Glue Integration Guide:
modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc: Added a comprehensive guide for integrating Redpanda topics with AWS Glue. This includes prerequisites, limitations (e.g., nested partition spec support), IAM policy configuration, cluster and topic setup, and querying Iceberg tables via Glue and Athena.File Rename and Schema Integration:
modules/manage/pages/iceberg/specify-iceberg-schema.adoc(renamed fromchoose-iceberg-mode.adoc): Updated the file title and description to focus on specifying Iceberg schemas and integrating them with topics, reflecting broader schema-related content.Resolves https://redpandadata.atlassian.net/browse/
Review deadline: 23 July
Page previews
Query Iceberg Topics using AWS Glue
What's New
Checks