-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting "error sending request for url" AzureError when writing very large deltatable to Azure Gen 2 #2639
Comments
After further investigating, this error propagates actually depending on the partition I chose to partition the table by, why is that? One of the partition schemes I chose was in total 267 partitions, the next scheme I chose had over 30k+, my problem here is why is this the case? Why is my choice of partition_by affecting this? It should be error free regardless of the number of partitions or the time I need to wait. |
You can pass in storage_options, {"timeout": "120s"} |
The error still persists depending on the partition chosen. |
Through further looking into the data, I do not believe the data has anything to do with this error @ion-elgreco however possibly the number of partitions might be an issue, with the assumption of 30k+ partitions, can deltalake even handle that? If so, is there a possibility that write_deltalake is trying to write to partitions before even making the partition folder? |
@ion-elgreco Would it be an issue if I try making partitions that are of timestamp? |
|
calling |
Yes, making the delta table is fine. |
@rtyler I have come up with a solution in the meantime, it seems like (atleast with the data I have) deltalake simply cant handle writing all the partitions at once, so I simply write in batches, meaning I essentially get a few thousand dataframes (4096 is what I used for testing), pd.concat them all, and write them essentially as a batch, and go on to the next batch and so on and this is fast enough for my purposes. I am not exactly sure if deltalake can handle writing 10k+ partitions at a time, as writing 31k resulted in the error I was originally getting. |
We need to take a deeper look at this. Can you provide a sample data frame with similar characteristics, such as those 30k partitions |
Environment
Delta-rs version:
This happens on both 0.18.1 and 0.16.1, I haven't tested anything else.
Environment: Python 3.11
Bug
What happened:
When writing an extremely large deltalake file (30000 total partition folders) to Azure Gen 2, I keep getting the following:
This error happens regardless of what engine (I tested both Rust and PyArrow) and regardless of Deltalake version (I tried 0.18.1 and 0.16.1), I run the following call after creating an extremely large dataframe via
pd.concat
:My deltatable contains millions of rows, however this should not be an issue to write to the deltalake so I am not sure why exactly I am getting this error at all. When it comes to writing very small deltalakes (1000s of rows) its fine, what exactly can be the cause and solution to this?
Everything works, including concatenation, it just errors when I try writing the DF.
What you expected to happen:
For the write to be successful regardless of how long it takes.
How to reproduce it:
Most likely need to get an extremely large deltatable of millions of rows (gb of data) and try performing a write to Azure Gen 2.
It should be important to note the error message I gave is when I tried PyArrow, Rust is similar except it performs 10 retries, I dont know why azure wont even retry. I am using abfss in my url when going to Azure Gen 2
The text was updated successfully, but these errors were encountered: