hdfs: add Kerberos authentication support#4419
Conversation
|
I would like to know how I trigger the review process. |
Jeffail
left a comment
There was a problem hiding this comment.
Thank you for the contribution, @Khalid-Nowaf — Kerberos support here has been long awaited. I spent some time tracing through the diff and the upstream libraries; the notes below are intended as collaborative feedback.
Kerberos client lifecycle (internal/impl/hdfs/config.go + input.go/output.go Close)
Login() inside hdfsKerberosConfig.client() spawns a background renewal goroutine via gokrb5's enableAutoSessionRenewal (client/session.go). Because the client is constructed with NewWithKeytab, the TGT will refresh indefinitely for the lifetime of the *krbclient.Client — so long-running pipelines should authenticate without issue, which is great.
However, neither hdfsReader.Close nor hdfsWriter.Close currently calls Destroy() on the *krbclient.Client (which stops the renewal goroutine, drops the session, and zeroes credential material), nor closes the underlying *hdfs.Client. For a process that brings the component up once and runs forever this is harmless, but on component rebuild (config reload, fatal-error retry) the previous instance's renewal goroutine retains a reference to the Client and the in-memory keytab, preventing GC. Stashing the Kerberos client on the reader/writer struct and calling Destroy() (plus closing the HDFS client) in Close would resolve both.
Dead branch in validateInput (config.go)
hdfsConfigFromParsed already invokes c.kerberos.validate() before validateInput runs, and validate() rejects any data_transfer_protection value outside {\"\", \"authentication\", \"integrity\", \"privacy\"}. The default: arm of the validateInput switch is therefore unreachable and could be removed for clarity.
`user` field semantics under Kerberos
In hdfs/v2, when KerberosClient is set, User becomes the effective/proxy user rather than the authenticating identity (which is derived from the principal). The current field description — "A user ID to connect as." — is accurate for the non-Kerberos case but potentially misleading once Kerberos is enabled. A brief note in the description would help operators understand the interaction.
Origin of the input integrity/privacy restriction
Would you mind clarifying the source of the "integrity"/"privacy" limitation for the HDFS input? Knowing whether this stems from colinmarc/hdfs/v2, a deliberate Connect-side decision, or a gap to be closed later would be useful for operators — and a short reference in the validation error message itself would aid debugging in the field.
Adds Kerberos authentication support for the HDFS connector by upgrading to github.com/colinmarc/hdfs/v2 and introducing shared HDFS auth configuration for input and output.
Changes
Related Issues
#1347