Skip to content

hdfs: add Kerberos authentication support#4419

Open
Khalid-Nowaf wants to merge 1 commit into
redpanda-data:mainfrom
Khalid-Nowaf:hdfs/upgrade-v2
Open

hdfs: add Kerberos authentication support#4419
Khalid-Nowaf wants to merge 1 commit into
redpanda-data:mainfrom
Khalid-Nowaf:hdfs/upgrade-v2

Conversation

@Khalid-Nowaf
Copy link
Copy Markdown

Adds Kerberos authentication support for the HDFS connector by upgrading to github.com/colinmarc/hdfs/v2 and introducing shared HDFS auth configuration for input and output.

Changes

  • Adds auth.kerberos config for HDFS output.
  • Adds limited Kerberos support for HDFS input.
  • Supports keytab-based Kerberos auth with:
    • krb5_conf
    • keytab
    • principal
    • service_principal
    • data_transfer_protection

Related Issues
#1347

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 11, 2026

CLA assistant check
All committers have signed the CLA.

@Khalid-Nowaf
Copy link
Copy Markdown
Author

I would like to know how I trigger the review process.

Copy link
Copy Markdown
Contributor

@Jeffail Jeffail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, @Khalid-Nowaf — Kerberos support here has been long awaited. I spent some time tracing through the diff and the upstream libraries; the notes below are intended as collaborative feedback.

Kerberos client lifecycle (internal/impl/hdfs/config.go + input.go/output.go Close)

Login() inside hdfsKerberosConfig.client() spawns a background renewal goroutine via gokrb5's enableAutoSessionRenewal (client/session.go). Because the client is constructed with NewWithKeytab, the TGT will refresh indefinitely for the lifetime of the *krbclient.Client — so long-running pipelines should authenticate without issue, which is great.

However, neither hdfsReader.Close nor hdfsWriter.Close currently calls Destroy() on the *krbclient.Client (which stops the renewal goroutine, drops the session, and zeroes credential material), nor closes the underlying *hdfs.Client. For a process that brings the component up once and runs forever this is harmless, but on component rebuild (config reload, fatal-error retry) the previous instance's renewal goroutine retains a reference to the Client and the in-memory keytab, preventing GC. Stashing the Kerberos client on the reader/writer struct and calling Destroy() (plus closing the HDFS client) in Close would resolve both.

Dead branch in validateInput (config.go)

hdfsConfigFromParsed already invokes c.kerberos.validate() before validateInput runs, and validate() rejects any data_transfer_protection value outside {\"\", \"authentication\", \"integrity\", \"privacy\"}. The default: arm of the validateInput switch is therefore unreachable and could be removed for clarity.

`user` field semantics under Kerberos

In hdfs/v2, when KerberosClient is set, User becomes the effective/proxy user rather than the authenticating identity (which is derived from the principal). The current field description — "A user ID to connect as." — is accurate for the non-Kerberos case but potentially misleading once Kerberos is enabled. A brief note in the description would help operators understand the interaction.

Origin of the input integrity/privacy restriction

Would you mind clarifying the source of the "integrity"/"privacy" limitation for the HDFS input? Knowing whether this stems from colinmarc/hdfs/v2, a deliberate Connect-side decision, or a gap to be closed later would be useful for operators — and a short reference in the validation error message itself would aid debugging in the field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants