Skip to content

Conversation

@kbatuigas
Copy link
Contributor

@kbatuigas kbatuigas commented Jul 14, 2025

Description

Related: redpanda-data/cloud-docs#363 to add docs to Cloud

This pull request introduces documentation updates to support querying Redpanda topics as Iceberg tables using AWS Glue. It includes a new guide detailing prerequisites, configuration steps, and querying methods, as well as a file rename and updates to reflect schema integration with Iceberg topics.

New AWS Glue Integration Guide:

  • modules/manage/pages/iceberg/iceberg-topics-aws-glue.adoc: Added a comprehensive guide for integrating Redpanda topics with AWS Glue. This includes prerequisites, limitations (e.g., nested partition spec support), IAM policy configuration, cluster and topic setup, and querying Iceberg tables via Glue and Athena.

File Rename and Schema Integration:

  • modules/manage/pages/iceberg/specify-iceberg-schema.adoc (renamed from choose-iceberg-mode.adoc): Updated the file title and description to focus on specifying Iceberg schemas and integrating them with topics, reflecting broader schema-related content.

Resolves https://redpandadata.atlassian.net/browse/
Review deadline: 23 July

Page previews

Query Iceberg Topics using AWS Glue
What's New

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

@netlify
Copy link

netlify bot commented Jul 14, 2025

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 76380e2
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/68894019cc78a70008aaff30
😎 Deploy Preview https://deploy-preview-1208--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 14, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch DOC-1377-document-feature-aws-glue-support-for-iceberg

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@kbatuigas kbatuigas force-pushed the DOC-1377-document-feature-aws-glue-support-for-iceberg branch from ab721c0 to 737d011 Compare July 14, 2025 22:49
@paulohtb6 paulohtb6 force-pushed the DOC-1377-document-feature-aws-glue-support-for-iceberg branch from 737d011 to d833f11 Compare July 15, 2025 15:45
@kbatuigas kbatuigas marked this pull request as ready for review July 21, 2025 18:03
@kbatuigas kbatuigas requested a review from a team as a code owner July 21, 2025 18:03
Copy link

@wdberkeley wdberkeley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. My only nit is that the base location is written as s3://<bucket path> in the example config but it can be more than just the name of a bucket and include a path. You mention that just below though.

@kbatuigas kbatuigas force-pushed the DOC-1377-document-feature-aws-glue-support-for-iceberg branch from d7c3f11 to b387538 Compare July 22, 2025 21:40
* If you want to configure authentication to AWS Glue separately from authentication to S3, there are equivalent credential configuration properties named `iceberg_rest_catalog_*` that override the object storage credentials. These properties only apply to REST catalog authentication, and never to S3 authentication.
** `iceberg_rest_catalog_aws_access_key` overrides `cloud_storage_access_key`
** `iceberg_rest_catalog_aws_secret_key` overrides `cloud_storage_secret_key`
** `iceberg_rest_catalog_aws_region` overrides `cloud_storage_region`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lazin here we have a separate region configuration than the main one used by TS. So it should be easy enough to do this for RRR topics as well I guess.

iceberg_catalog_type: rest
iceberg_rest_catalog_endpoint: https://glue.<aws-region>.amazonaws.com/iceberg
iceberg_rest_catalog_authentication_mode: aws_sigv4
iceberg_rest_catalog_base_location: 's3://<bucket-name>'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good example of a base location and it would be a horrible mistake if the user set this at the bucket level as it could overwrite all of tiered storage if Glue itself had write permissions on the entire buckets. We need:

  1. a proper base location example here. NOT at the bucket root!
  2. guidance for configuring Glue's own access to just the iceberg parquqet data and catalog folders within our TS bucket, if possible
    cc @wdberkeley

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattschumpert @wdberkeley can you choose any subpath/name for iceberg_rest_catalog_base_location? For example, if I change my config value to s3://kbatuigas-iceberg-glue-test/iceberg, this is where the Parquet files get written to:

image

The TS segments are on the same level as iceberg. Is there any sort of naming convention we should recommend? cc @rpdevmp

Also, regarding Glue access, I'm not sure how standard this is but my policy looks like:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectTagging",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::kbatuigas-iceberg-glue-test/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::kbatuigas-iceberg-glue-test"
        },
        {
            "Sid": "GlueCatalogAccess",
            "Effect": "Allow",
            "Action": [
                "glue:*"
            ],
            "Resource": [
                "arn:aws:glue:us-east-2:992382373299:catalog",
                "arn:aws:glue:us-east-2:992382373299:database/*",
                "arn:aws:glue:us-east-2:992382373299:table/*/*"
            ]
        }
    ]
}

The Glue permissions only act on what appears to be Glue-specific resources (based on the ARN), and I'm not sure if this also means Glue has write permissions to the entire bucket. Would like someone to provide guidance on whether this should be more specific or restrictive. cc @rpdevmp

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did we document for the other catalogs. IIUC Redpanda has a default , and with Glue you're force to explicitly specify it.

I would guess @andrwng can also chime in here on what we recommend to set it to (maybe the binary default) and whether it impacts only the iceberg metadata files or the parquet files (I think both)

Copy link
Contributor Author

@kbatuigas kbatuigas Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattschumpert We don't provide specific guidance on iceberg_catalog_base_location for the other catalogs; afaiu there isn't a need to modify the default (redpanda-iceberg-catalog) for that property. iceberg_rest_catalog_base_location is the one you have to explicitly set with Glue (default is null/unset).

I've added some tweaks to this section: https://deploy-preview-1208--redpanda-docs-preview.netlify.app/25.2/manage/iceberg/iceberg-topics-aws-glue/#authorize-access-to-aws-glue but still not sure about specific guidance for Glue accessing Iceberg topics. In addition to allowing Redpanda access to Glue as described, it sounds like there should be an additional role/policy for Glue on S3? I think this is the relevant AWS doc: https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a reasonable recommendation is to say to have it be s3:///, though cc @wdberkeley for a spot check that that's reasonable. Glue is the only catalog that explicitly requires iceberg_rest_catalog_base_location to be set -- I think under the hood other catalogs have a default location that gets used, or base it off the warehouse. But Glue doesn't and spits an error without a base location supplied to the table IIUC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrwng huh? s3:/// what would that do? See above. f its set at the bucket root that's really dangerous with respect to TS data this means Glue can overwrite things at the bucket root? Don't we want it in actual canonical catalog location to avoid interfering with TS. We need ti think this through carefuly and be very prescriptive here

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, github ate my formating. I meant s3://{bucket}/{warehouse}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying this one doesn't actually get used? are we sure?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right right. ok. And with GLUE does the warehouse need to be created first and we need to match it (like Databricks) @wdberkeley ?

cc @kbatuigas

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iceberg_catalog_type=rest \
iceberg_rest_catalog_endpoint=https://glue.<aws-region>.amazonaws.com/iceberg \
iceberg_rest_catalog_authentication_mode=aws_sigv4 \
iceberg_rest_catalog_base_location='s3://<bucket-name>'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem as above here. Root bucket as the base location is wrong/dangerous


=== Iceberg integration

* `iceberg_rest_catalog_base_location`: Specifies the base location for the Iceberg REST catalog. Required for AWS Glue Data Catalog.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link this to the property in Cluster properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property will become available when #1238 is merged.

Copy link
Contributor

@Feediver1 Feediver1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments throughout.

@kbatuigas kbatuigas merged commit 7c038df into beta Jul 29, 2025
7 checks passed
@kbatuigas kbatuigas deleted the DOC-1377-document-feature-aws-glue-support-for-iceberg branch July 29, 2025 21:49
paulohtb6 pushed a commit that referenced this pull request Jul 30, 2025
Co-authored-by: Joyce Fee <102751339+Feediver1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants