Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_pyarrow_table() on a table in S3 kept getting "Generic S3 error: error decoding response body" #2595

Open
k-ye opened this issue Jun 13, 2024 · 18 comments
Labels
bug Something isn't working

Comments

@k-ye
Copy link

k-ye commented Jun 13, 2024

Environment

Delta-rs version: deltalake==0.18.1

Binding: Python

Environment:

  • Cloud provider: AWS S3
  • OS: Linux
  • Other:
    • Python 3.10.4
    • pyarrow==16.1.0
    • pyarrow-hotfix==0.6

Bug

What happened:

Trying to do a simple table loading from S3, but kept getting this OSError: Generic S3 error: error decoding response body

table = DeltaTable(table_uri, storage_options=storage_options)
print(f"version: {table.version()}")
print(f"schema: {table.schema()}")
print(table.files())

ts = time.time()
df = table.to_pyarrow_table()
version: 0
schema: Schema([Field(id, PrimitiveType("string"), nullable=True), Field(path, PrimitiveType("string"), nullable=True)])
['0-e03dac34-16a0-4b6e-82c8-fd1098d1bf45-0.parquet']
Traceback (most recent call last):
  File "test.py", line 32, in <module>
    df = table.to_pyarrow_table()
  File "***/lib/python3.10/site-packages/deltalake/table.py", line 1161, in to_pyarrow_table
    return self.to_pyarrow_dataset(
  File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 88, in pyarrow.lib.check_status
OSError: Generic S3 error: error decoding response body

Stack shows that this is actually in pyarrow. Not sure if it possible to tweak pyarrow's behavior with S3 from deltalake.

What you expected to happen:

I can get the pyarrow table.

How to reproduce it:

More details:

I have verified the integrity of this table with these methods:

  1. Cloning the table locally, then load from there. to_pyarrow_table() runs fine.
  2. Reading the S3 table with duckdb (and its delta extension). Worked fine, too.
@k-ye k-ye added the bug Something isn't working label Jun 13, 2024
@k-ye
Copy link
Author

k-ye commented Jun 13, 2024

Seems related to #2301 and #2592

@ion-elgreco
Copy link
Collaborator

@k-ye can you try deltalake 0.19.1 please and report back if you see improvements

@k-ye
Copy link
Author

k-ye commented Aug 20, 2024

@k-ye can you try deltalake 0.19.1 please and report back if you see improvements

I've done a few tests:

  • On a U.S. machine (but not on AWS), this generally works now. Occasionally it was able to finish the script execution, but the process ended with
terminate called without an active exception
Aborted
  • On a CN machine behind a VPN, I could still bump into this error, though.

@ion-elgreco
Copy link
Collaborator

@k-ye can you try deltalake 0.19.1 please and report back if you see improvements

I've done a few tests:

  • On a U.S. machine (but not on AWS), this generally works now. Occasionally it was able to finish the script execution, but the process ended with
terminate called without an active exception
Aborted
  • On a CN machine behind a VPN, I could still bump into this error, though.

Hmm that's a C++ error

Regarding VPN's, I would defer to what Tustvold said: apache/arrow-rs#5882 (comment)

@shriram-louisa
Copy link

I have the same issue,
@k-ye were you able to solve this?

@k-ye
Copy link
Author

k-ye commented Sep 10, 2024

I have the same issue, @k-ye were you able to solve this?

Hi @shriram-louisa ,

Sorry I didn't follow up on this. We were just testing the water when I filed this issue, and have sinced moved on to other solutions...

@shriram-louisa
Copy link

Thank you for the response 🙌 @k-ye

@ion-elgreco
Copy link
Collaborator

Thank you for the response 🙌 @k-ye

What delta-rs version are you using?

@shriram-louisa
Copy link

0.19.2
I am able to avoid this error by using older version i.e 0.15.3

@sim-san
Copy link

sim-san commented Sep 10, 2024

I am getting the same error. Do know what causes the error?

delta-rs 0.19.1

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Sep 10, 2024

@shriram-louisa @sim-san how stable is your connection, are you running this behind a VPN, what's the throughput, how big is your table?

I need more info guys

@ion-elgreco
Copy link
Collaborator

Have any of you tired increasing the timeout to 60s or 120s? storage_options = {"timeout":"120s"}

@sim-san
Copy link

sim-san commented Sep 10, 2024

Have any of you tired increasing the timeout to 60s or 120s? storage_options = {"timeout":"120s"}

The delta table has a file size of around 700MB. The increase of the timeout solved the problem.
But where can I find further documentation about the possible storage_options ?

@ion-elgreco
Copy link
Collaborator

Have any of you tired increasing the timeout to 60s or 120s? storage_options = {"timeout":"120s"}

The delta table has a file size of around 700MB. The increase of the timeout solved the problem.
But where can I find further documentation about the possible storage_options ?

You will have to check the object_store crate documentation, and then choose your storage aws S3, it shows the config keys

@junhl
Copy link

junhl commented Sep 18, 2024

This seems to also happen when querying a longer data column, probably with same cause where during operation, the network connection is interrupted.

table = DeltaTable("...") # WITHOUT timeout addition
df = table.to_pandas(columns=["some_id"], filters=[("some_id", "in", ["id1", "id2"])]) # works fine in 1 second
df = table.to_pandas(columns=["some_id", "json_data"], filters=[("some_id", "in", ["id1", "id2"])]) # gives decoding error after few mins - json_data is JSON string about few kb per row.

table = DeltaTable("...", storage_options = {"timeout":"120s"}) # WITH timeout addition
df = table.to_pandas(columns=["some_id"], filters=[("some_id", "in", ["id1", "id2"])]) # works fine in 1 second
df = table.to_pandas(columns=["some_id", "json_data"], filters=[("some_id", "in", ["id1", "id2"])]) # works after few mins

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Nov 19, 2024

We've been having some similar-ish problems recently on 0.19.0 with Azure. I thought it might be interesting to mention because in our case we only use the object store based filesystem for parsing the delta transaction log and we use the native Azure filesystem built into pyarrow apache/arrow#39968 for the big data reads. Maybe switching to pyarrow filesystem could help some people here? I've also found it to be significantly faster.

It looks like the original issue report was having issues when pyarrow tries to make use of the object store based filesystem implementation.

I'm going to try upgrading us to the latest delta-rs and hopefully #2789 will solve it.

@Tom-Newton
Copy link
Contributor

I tried the latest version of delta-rs (python-v0.22.0) and I even tried modifying it to make use of apache/arrow-rs#6612, but I can still reproduce the problem. I'll try to investigate further.

@ion-elgreco
Copy link
Collaborator

I tried the latest version of delta-rs (python-v0.22.0) and I even tried modifying it to make use of apache/arrow-rs#6612, but I can still reproduce the problem. I'll try to investigate further.

It already uses a separate runtime so probably that didn't have much effect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants