Skip to content

fix: remote_base path in HdfsRemoteLogIO (remove schema from path) #51207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

joaopamaral
Copy link
Contributor

@joaopamaral joaopamaral commented May 29, 2025

WebHDFSHook fails when the hdfs:/ schema is included in the path and it's returning the following error causing Celery tasks to fail:

hdfs.util.HdfsError: Pathname /user/USER/hdfs:/var/airflow/logs/DAG/run_id=scheduled__2025-05-28T21_30_00_00_00/task_id=TASK_ID/attempt=1.log from /user/USER/hdfs:/var/airflow/logs/DAG/run_id=scheduled__2025-05-28T21_30_00_00_00/task_id=TASK_ID/attempt=1.log is not a valid DFS filename.

This can be replicated from a simple test with WebHDFSHook trying to read an existing file:

> hook.read_file('hdfs:///folder1/folder2/file')

hdfs.util.HdfsError: Pathname /user/current-user/hdfs:/folder1/folder2/file from /user/current-user/hdfs:/folder1/folder2/file is not a valid DFS filename.

> hook.read_file('/folder1/folder2/file')

FILE_OUTPUT

So this PR, we are making sure the HdfsRemoteLogIO will always receive the hdfs path without schema to avoid these failures.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@joaopamaral joaopamaral changed the title Fix remote_base path in HdfsRemoteLogIO (remove schema from path) fix: remote_base path in HdfsRemoteLogIO (remove schema from path) May 29, 2025
@joaopamaral joaopamaral marked this pull request as ready for review May 29, 2025 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant