Skip to content

[Data] read_parquet doesn't work with multiple input directories #46049

Open
@bveeramani

Description

What happened + What you expected to happen

Title.

Versions / Dependencies

d844d63

Reproduction script

import ray

ray.data.read_parquet(["s3://anonymous@air-example-data-2/10G-image-data-synthetic-raw-parquet"] * 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/read_api.py", line 772, in read_parquet
    datasource = ParquetDatasource(
                 ^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/datasource/parquet_datasource.py", line 238, in __init__
    _handle_read_os_error(e, paths)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/datasource/file_meta_provider.py", line 250, in _handle_read_os_error
    raise error
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/datasource/parquet_datasource.py", line 225, in __init__
    pq_ds = pq.ParquetDataset(
            ^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1354, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pyarrow/dataset.py", line 785, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/balaji/anaconda3/envs/ray/lib/python3.11/site-packages/pyarrow/dataset.py", line 475, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3025, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Error creating dataset. Could not read schema from 'air-example-data-2/10G-image-data-synthetic-raw-parquet'. Is this a 'parquet' file?: Path does not exist 'air-example-data-2/10G-image-data-synthetic-raw-parquet'. Detail: [errno 2] No such file or directory

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tdataRay Data-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions