-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Describe the bug
_as_str('hdfs:///xxxx') would return hdfs://xxxx. Removing one / and making the path invalid.
For the use case like
ds = load_dataset(
"parquet",
data_files={
"train": "hdfs:///user/path/to/data/train*.parquet",
},
streaming=True,
storage_options={
"host": "hostname",
}
)
would get
File "/usr/local/lib/python3.11/site-packages/datasets/load.py", line 1511, in load_dataset
return builder_instance.as_streaming_dataset(split=split)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/datasets/builder.py", line 1193, in as_streaming_dataset
splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/datasets/packaged_modules/parquet/parquet.py", line 123, in _split_generators
with open(file, "rb") as f:
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/datasets/streaming.py", line 73, in wrapper
return function(*args, download_config=download_config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 963, in xopen
file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/fsspec/core.py", line 508, in open
out = open_files(
^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/fsspec/core.py", line 295, in open_files
fs, fs_token, paths = get_fs_token_paths(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/fsspec/core.py", line 672, in get_fs_token_paths
chain = _un_chain(urlpath0, storage_options or {})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/fsspec/core.py", line 365, in _un_chain
kw = dict(
^^^^^
TypeError: dict() got multiple values for keyword argument 'host'
Due to the file passed to fsspec.open is hdfs://user/path/to/data/trainxxx.parquet, and fsspec would take user as the hostname
Steps to reproduce the bug
Expected behavior
Keep all three /
Environment info
datasets 4.4.2
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels