Bug fix: Add HDFS hostname to protocol prefix#7935
Bug fix: Add HDFS hostname to protocol prefix#7935li-yi-dong wants to merge 1 commit intohuggingface:mainfrom
Conversation
|
Hi ! is it related to #7934 ? It's not clear to me why the protocol would need this, given hostname should be present in resolve_pattern("hdfs://hostname/user/xxx", ...) |
It's related to #7934 in a subttle way. In my use case, I need to specify the hdfs hostname. In theory, I can do it by ds = load_dataset(
"parquet",
data_files={
"train": "hdfs://hostname/xxx*.parquet",
},
streaming=True,
)or ds = load_dataset(
"parquet",
data_files={
"train": "hdfs:///xxx*.parquet",
},
streaming=True,
storage_options={
"host": "hostname"
}
)None of them work. Yes, |
|
@lhoestq |
|
I see, I think the path forward is to fix #7934 which sounds like an actual xPath bug, while resolve_pattern dropping the hostname comes from fsspec HDFS implementation that we should probably try to follow |
|
Fixing #7934 alone can solve my problem. But I don't think fsspec intends to drop the hostname. Function From another point of view, in |
For HDFS url with hostname like
hdfs://hostname/user/xxx, the functionresolve_patternwould drop the hostname, and outputshdfs:///user/xxx. This may break later file operations by trying to connect to wrong HDFS cluster.