-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
Hi, I have a use case that thousands of jobs are writing hive partitioned parquet files daily to the same bucket via S3FS filesystem. Note each job may generate from single digit to a few thousand parquet files depending on the volume from its data source.
After abstraction, these jobs follow regex path patterns like s3fs://my-S3-bucket/<vendor-name>/<fruit-type>/<color>/<origination>/<creation_date>/<data-center-location>/date=YYYY-MM-DD/.... The gist here is a lot of keys are being created at the same time hense jobs hits AWS Error SLOW_DOWN. during Put Object operation: The object exceeded the rate limit for object mutation operations(create, update, and delete). Please reduce your rate request error. frequently throughout the day.
After investigation i realize they are creating too many objects in S3FileSystem::CreateDir(..) function one by one. My local experiments show that if my implementation avoids/checks the existence of the path first then call impl_->CreateEmptyDir(...) only when necessary, it addresses the issue in my production environment.
(I understand various cloud vendors have various IO limits on a single bucket, in order to completely fix the the issue is another story to my daily work)
I'm proposing a code change like below. Hi @pitrou I see you are the main author of s3fs.cc, can you pls share your insights when you have time?
Also even with a vanilla build S3FS test fails quite a few on my mac... can you guide how to make them run successfully..?
Many thanks.
diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 640888e1c..782d5f75d 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -2871,7 +2871,10 @@ Status S3FileSystem::CreateDir(const std::string& s, bool recursive) {
for (const auto& part : path.key_parts) {
parent_key += part;
parent_key += kSep;
- RETURN_NOT_OK(impl_->CreateEmptyDir(path.bucket, parent_key));
+ ARROW_ASSIGN_OR_RAISE(FileInfo parent_key_info, this->GetFileInfo(parent_key));
+ if (parent_key_info.type() == FileType::NotFound) {
+ RETURN_NOT_OK(impl_->CreateEmptyDir(path.bucket, parent_key));
+ }
}
return Status::OK();
} else {
TestS3FS.CreateDir even fails with a Clean build :sigh
➜ build ninja && ./debug/arrow-s3fs-test --gtest_filter="TestS3FS.CreateDir"
ninja: no work to do.
Running main() from /Users/haochengliu/Documents/projects/Arrow/build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = TestS3FS.CreateDir
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestS3FS
[ RUN ] TestS3FS.CreateDir
/Users/haochengliu/Documents/projects/Arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:934: Failure
Failed
Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got OK
[ FAILED ] TestS3FS.CreateDir (219 ms)
[----------] 1 test from TestS3FS (219 ms total)
Component(s)
C++