-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Describe the bug, including details regarding any error messages, version, and platform.
Requirement.txt
requests==2.32.2
dataclasses-json==0.6.6
readerwriterlock==1.0.9
fsspec==2024.9.0
pyarrow==16.1.0
cachetools==5.3.3
google-auth==2.35.0
from pyarrow.fs import GcsFileSystem
from fsspec.implementations.arrow import ArrowFSWrapper
import os
import pandas
import pyarrow.dataset as dt;
fileset_storage_location = "gs://xxxx/catalog/schema/fileset3"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "xxxxx.json"
selffs = ArrowFSWrapper(GcsFileSystem())
data = pandas.DataFrame({"Name": ["A", "B", "C", "D"], "ID": [20, 21, 19, 18]})
parquet_file = fileset_storage_location + "/test.parquet"
data.to_parquet(parquet_file, filesystem=selffs)
arrow_dataset = dt.dataset(parquet_file, filesystem=selffs)We will run into the following message:
Traceback (most recent call last):
File "", line 1, in
File "/home/ec2-user/gravitino/clients/client-python/venv/lib64/python3.9/site-packages/pyarrow/dataset.py", line 794, in dataset
return _filesystem_dataset(source, **kwargs)
File "/home/ec2-user/gravitino/clients/client-python/venv/lib64/python3.9/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
return factory.finish(schema)
File "pyarrow/_dataset.pyx", line 3089, in pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
File "pyarrow/io.pxi", line 341, in pyarrow.lib.NativeFile.seek
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: google::cloud::Status(OUT_OF_RANGE: Permanent error, with a last message of Request range not satisfiable error_info={reason=, domain=, metadata={gcloud-cpp.retry.function=ReadObjectNotWrapped, gcloud-cpp.retry.reason=permanent-error, gcloud-cpp.retry.original-message=Request range not satisfiable}})
If we switch the pyarrow version to:
fsspec==2024.3.1
pyarrow==15.0.2
then the error message will be:
Traceback (most recent call last):
File "", line 1, in
File "/home/ec2-user/gravitino/clients/client-python/venv/lib64/python3.9/site-packages/pyarrow/dataset.py", line 782, in dataset
return _filesystem_dataset(source, **kwargs)
File "/home/ec2-user/gravitino/clients/client-python/venv/lib64/python3.9/site-packages/pyarrow/dataset.py", line 475, in _filesystem_dataset
return factory.finish(schema)
File "pyarrow/_dataset.pyx", line 3025, in pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
File "pyarrow/io.pxi", line 328, in pyarrow.lib.NativeFile.seek
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: google::cloud::Status(OUT_OF_RANGE: Permanent error ReadObjectNotWrapped: Request range not satisfiable)
OS & python
(venv) [ec2-user@ip-111- client-python]$ python --version
Python 3.9.16
(venv) [ec2-user@ip-111-client-python]$ uname -a
Linux ip-xxxxx.ap-northeast-1.compute.internal 6.1.102-111.182.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Aug 13 22:23:09 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
(venv) [ec2-user@ip-172-31-10-123 client-python
Component(s)
Python