Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_parquet fails for hdfs:// files with latest fsspec #50639

Closed
3 tasks done
f4hy opened this issue Jan 9, 2023 · 5 comments
Closed
3 tasks done

BUG: read_parquet fails for hdfs:// files with latest fsspec #50639

f4hy opened this issue Jan 9, 2023 · 5 comments
Labels
IO Parquet parquet, feather Upstream issue Issue related to pandas dependency

Comments

@f4hy
Copy link

f4hy commented Jan 9, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# fsspec==2022.8.2
df = pd.read_parquet("hdfs:///path/to/myfile.parquet") #works
# fsspec==2022.11.0
df = pd.read_parquet("hdfs:///path/to/myfile.parquet") #errors
# OSError: only valid on seekable files

Issue Description

fsspec has changed the backend for hdfs to use the new filesystem in pyarrow in 2022.10.0. This seems to break compatibility with pandas as this apparently gives back a non seekable file now which pandas expects.

One solution could be to have pandas require fsspec<=2022.8.2 which is the last version which worked.

Another option would be to look upstream to fsspec and have them guarantee a seekable filehandle.

A third would be to modify the pandas reader to detect a non seekable filehandle and buffer the file.

Expected Behavior

read_parquet should continue to work with hdfs remote files as it did with earlier versions of the fsspec dependency

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-77-generic
Version : #86~18.04.1-Ubuntu SMP Fri Jun 18 01:23:22 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.5.2
numpy : 1.24.1
pytz : 2022.7
dateutil : 2.8.2
setuptools : 51.3.3
pip : 20.3.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.26.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@f4hy f4hy added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 9, 2023
@j3doucet
Copy link

j3doucet commented Jan 9, 2023

+1 this has a significant impact for us as users of Pandas + HDFS.

@jbrockmendel jbrockmendel added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 12, 2023
@f4hy
Copy link
Author

f4hy commented Feb 24, 2023

Adding more details. The error message given is the following

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-10-51b2586c3c25> in <module>
----> 1 pd.read_parquet("hdfs:///path/to/myfile.parquet")

~/.local/lib/python3.8/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
    501     impl = get_engine(engine)
    502 
--> 503     return impl.read(
    504         path,
    505         columns=columns,

~/.local/lib/python3.8/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
    249         )
    250         try:
--> 251             result = self.api.parquet.read_table(
    252                 path_or_handle, columns=columns, **kwargs
    253             ).to_pandas(**to_pandas_kwargs)

~/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
   2924             )
   2925         try:
-> 2926             dataset = _ParquetDatasetV2(
   2927                 source,
   2928                 schema=schema,

~/.local/lib/python3.8/site-packages/pyarrow/parquet/core.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
   2464 
   2465             self._dataset = ds.FileSystemDataset(
-> 2466                 [fragment], schema=schema or fragment.physical_schema,
   2467                 format=parquet_format,
   2468                 filesystem=fragment.filesystem

~/.local/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()

~/.local/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.local/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile.tell()

~/.local/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile.get_random_access_file()

~/.local/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile._assert_seekable()

OSError: only valid on seekable files

@f4hy
Copy link
Author

f4hy commented Feb 24, 2023

It looks like fsspec has updated this in 2022.1.0 in PR#1154 to be able to make a read of an pyarrow file seekable, but requires passing the seekable=True option.
e.g.

fs = fsspec.filesystem("hdfs")
path = "hdfs:///path/to/myfile.parquet"
with fs.open(path, 'rb', seekable=True) as f:
    print(f.seekable()) # True
with fs.open(path, 'rb') as f:
    print(f.seekable()) # False

It is unclear to me how to pass the seekable option to the open of the underlying fsspec.

@martindurant
Copy link
Contributor

This is fixed on master since fsspec/filesystem_spec#1186

@mroeschke
Copy link
Member

Sounds like this was an upstream issue that has been fixed so closing

@mroeschke mroeschke added Upstream issue Issue related to pandas dependency and removed Bug labels Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

5 participants