Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: Introduce opt-in S3LocationProvider which is optimized for S3 performance #11112

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ookumuso
Copy link
Contributor

The S3 team is introducing S3LocationProvider, a replacement to ObjectStorageLocationProvider that is better suited for the performance of Iceberg workloads running on Amazon S3. This configuration applies to Iceberg workloads using General Purpose and Directory Buckets in Amazon S3, and we expect it to further improve throughput of Iceberg workloads. Although this change can benefit all Iceberg workloads, the degree of improvement may vary based on specific workload characteristics.

This implementation changes the hash scheme from 32-bit base64 to 24-bit base2. The reduced character range allows S3 to automatically scale request capacity more quickly to match the demands of the target workload and reduce the amount of time that workloads observe throttle responses. A 24-bit hash allows for 2^24 possible prefixes.
This implementation also changes the path structure to reduce the number of directories created. The partition directories are eliminated, and the position of the hash is moved to before the data filename.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments but a had a possibly naive question to just check my understanding:

In the past for object storage provider, we've used a wider character set in the hash portion of the file as a means to maximize entropy and ultimately improve heat distribution (#7128). With this new approach are we saying we can get a good enough heat distribution and at the same time enable s3 to scale capacity more quickly?


private String computeHash(String fileName) {
HashCode hashCode = HASH_FUNC.hashString(fileName, StandardCharsets.UTF_8);
int hash = hashCode.asInt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we could just inline hashCode.asInt() below when computing the binaryString.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works, updating

@ookumuso
Copy link
Contributor Author

Left some comments but a had a possibly naive question to just check my understanding:

In the past for object storage provider, we've used a wider character set in the hash portion of the file as a means to maximize entropy and ultimately improve heat distribution (#7128). With this new approach are we saying we can get a good enough heat distribution and at the same time enable s3 to scale capacity more quickly?

@amogh-jahagirdar
Yes, essentially both base64 and base2 distributes the traffic evenly which is the most crucial thing. With auto-scaling they can both support high TPS workloads. Main difference between them is how fast they auto-scale to the point where there are no more throttles observed. Base2 being more effective there due to every next char in the hash taking ~50% of the traffic.

* s3://my-bucket/my-table/data/011101101010001111101000-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
* </code>.
*/
public class S3LocationProvider implements LocationProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just include this in the ObjectStoreLocationProvider as an option to choose between base2 and base32? It seems like we could include most of this in just the computeHash function. That might also allow for choosing whether to include partition context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was our original plan but with the partition values and other trailing directories removed, it turns into a very S3 specific location provider due to reasons mentioned in the above comment. I agree that if it was just the base2 optimization then it would fit into the ObjectStoreLocationProvider with one extra table property probably.

@danielcweeks
Copy link
Contributor

@ookumuso Overall, this looks like a great feature if this is better for S3 to repartition and distribute data, but it also seems like it would fit cleanly into the existing ObjectStoreLocationProvider as opposed to a separate provider.

* </ol>
*
* The data file is placed immediately under the data location. Partition names are <b>not</b>
* included. The data filename is prefixed with a 24-character binary hash, which ensures that files
Copy link
Contributor

@danielcweeks danielcweeks Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 24-character prefix and lack of a directory marker (/) is a little problematic for older clients (like HadoopFS) and procedures like orphan files because anything that needs to list that directory will run into problems with the key space. We already have an issue with this with the current provider, but this would compound that problem since there's no hierarch.

This would make it a little easier to brake up prefix-separated ranges for listing (especially due to the low character limit), but that's something would need to look into on the orphan files action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the problem coming from the delimited list or is it some type parallelization issue due to lack of directories? I was thinking not having the directory essentially remove the requirement for discovering them and it can be just a flat list to discover everything under the write.data.path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is having too much cardinality for FS implementations that think in terms of directory structure. By breaking this up we can address both scenarios (files system based and object store listing) at the same time.

@ookumuso
Copy link
Contributor Author

ookumuso commented Sep 17, 2024

@ookumuso Overall, this looks like a great feature if this is better for S3 to repartition and distribute data, but it also seems like it would fit cleanly into the existing ObjectStoreLocationProvider as opposed to a separate provider.

Thanks for the review @danielcweeks! See my comment here regarding why we chose to go with a separate a provider. It was essentially done to capture optimizations for all bucket types in one place without having an impact on the existing default ObjectStoreLocationProvider.

@ookumuso
Copy link
Contributor Author

ookumuso commented Sep 23, 2024

@ookumuso Overall, this looks like a great feature if this is better for S3 to repartition and distribute data, but it also seems like it would fit cleanly into the existing ObjectStoreLocationProvider as opposed to a separate provider.

Thanks for the review Daniel! See my comment here regarding why we chose to go with a separate a provider. It was essentially done to capture optimizations for all bucket types in one place without having an impact on the existing default ObjectStoreLocationProvider.

@danielcweeks let me know what you think about this. One alternative is to maybe provide both as in having S3LocationProvider as is and a base2 option for the ObjectStoreLocationProvider to keep the option for partition values along with directories

@danielcweeks
Copy link
Contributor

@danielcweeks let me know what you think about this. One alternative is to maybe provide both as in having S3LocationProvider as is and a base2 option for the ObjectStoreLocationProvider to keep the option for partition values along with directories

@ookumuso I'm still of the opinion that we should just incorporate this into the existing object store location provider. Ultimately, we can change the hashing behavior as the existing behavior is based on earlier discussions with S3, so replacing it or providing alternative options is fine.

I do think partition values in the path still needs to be an option.

I think we also need to consider adding a path separator to help with some of the maintenance routines. For example s3://bucket/<prefix>/<db>/<table>/data/00101001/0010100111011.... The / in the hash portion allows for 2^8 prefixes to allow listing under those paths for clients that list. If we omit that, we get 2^24 prefixes that need to be listed under the data directory, which makes maintenance difficult.

@ookumuso
Copy link
Contributor Author

@danielcweeks let me know what you think about this. One alternative is to maybe provide both as in having S3LocationProvider as is and a base2 option for the ObjectStoreLocationProvider to keep the option for partition values along with directories

@ookumuso I'm still of the opinion that we should just incorporate this into the existing object store location provider. Ultimately, we can change the hashing behavior as the existing behavior is based on earlier discussions with S3, so replacing it or providing alternative options is fine.

I do think partition values in the path still needs to be an option.

I think we also need to consider adding a path separator to help with some of the maintenance routines. For example s3://bucket/<prefix>/<db>/<table>/data/00101001/0010100111011.... The / in the hash portion allows for 2^8 prefixes to allow listing under those paths for clients that list. If we omit that, we get 2^24 prefixes that need to be listed under the data directory, which makes maintenance difficult.

@danielcweeks Understood thanks for the feedback Daniel. Looks like it is not going to be feasible for us to remove / delimeter. With this, I agree that it doesn't make sense to offer a separate provider. I think it would be better if we update the existing provider to use base2 directly instead of opt in which can simplify the onboarding process for users as they can see the auto-scaling improvement right away. @jackye1995 Are you also okay with switching ObjectStoreLocationProvider to use base2 as a default as well? I can potentially cut down 24 bits to 20 or even 16 to reduce the length a bit.

For partition values, I can probably send a separate follow up as an option to remove them from the file name with a new table property so callers can decide but planning to exclude that for now.

@jackye1995
Copy link
Contributor

Sorry for the late review, was busy with some internal work...

it also seems like it would fit cleanly into the existing ObjectStoreLocationProvider as opposed to a separate provider.

+1 for that. I think the main concern was that this seems to be too S3 specific, but if there is no major concern from others I am good with changing the ObjectStoreLocationProvider directly, that actually helps adoption.

think we also need to consider adding a path separator to help with some of the maintenance routines.

I reviewed the internal experiments related to this, and having no path separators as well as having at least 24 bits I believe are all important based on load testing. If the concern is compatibility with other systems, would it make sense to only use this approach for locations that begins with s3:// or s3a://?

@ookumuso
Copy link
Contributor Author

ookumuso commented Oct 1, 2024

@danielcweeks @amogh-jahagirdar @jackye1995 As discussed above, made the change to update the existing ObjectStoreLocationProvider to use base2 entropy and also added new config to omit the partition values. I set it to false as default to make it backwards compatible but let me know whether you prefer that to be true to prevent customers to rely on partition values in the file path.

I kept the "/" delimiter in place due to concerns raised by @danielcweeks but would like to understand what would it take to remove that as a separate thread to potentially follow up in a separate PR. Orphan clean up path was one thing identified but I am curious whether there are any other things we need to consider to make it a viable change

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. I think there are 2 pending topics and I am approving since I am good with the current way it is written:

  1. should we do it for all paths, or just limit to s3:// and s3a:// paths: I think given that even the documentation of this is in AWS, it is probably fine to just apply this strategy for all users using object storage location provider.
  2. should we add / in the middle of the binary hash values: that would mitigate the directory listing issue, but will impact the ability for scaling up S3 due to the additional slashes in the middle. My understanding is that the directory lisitng concern could be resolved if we move all HDFS-based listing to use prefix-based listing, and the team can take that as a follow up item.

Any further thoughts with the latest approach and the 2 points above? @danielcweeks @amogh-jahagirdar @nastra

@jackye1995
Copy link
Contributor

Any additional comments? @amogh-jahagirdar @nastra @danielcweeks

@danielcweeks
Copy link
Contributor

looks good to me. I think there are 2 pending topics and I am approving since I am good with the current way it is written:

  1. should we do it for all paths, or just limit to s3:// and s3a:// paths: I think given that even the documentation of this is in AWS, it is probably fine to just apply this strategy for all users using object storage location provider.

I think we should treat all of the prefixes the same (e.g. s3,s3n,s3a). These are all just aliases to map different implementations to, but the behavior should be the same.

  1. should we add / in the middle of the binary hash values: that would mitigate the directory listing issue, but will impact the ability for scaling up S3 due to the additional slashes in the middle. My understanding is that the directory listing concern could be resolved if we move all HDFS-based listing to use prefix-based listing, and the team can take that as a follow up item.

I think we should add a single / to break up the prefix as I noted in my comment above. I doubt that a single slash is going to have a major impact on s3's ability to partition effectively. My understanding is that S3 doesn't treat the / character any different than any other in determining heat/partitioning and since it's a single static value, it would probably be skipped over.

That slash is important for directory based file system implementations as omitting it would result in listing effectively the full contents of the table and would likely break clients because there are too many results to hold in memory.

@jackye1995
Copy link
Contributor

That slash is important for directory based file system implementations as omitting it would result in listing effectively the full contents of the table and would likely break clients because there are too many results to hold in memory.

when you say "directory based file system implementations", do you mean implementation of catalog, or FileIO?

And what is the use case of this? I thought the only place we are doing something like this is in orphan file removal, but for that we are just iterating through the directories to find matching files, we are holding all the matching files in memory, not holding everything in memory.

@danielcweeks
Copy link
Contributor

when you say "directory based file system implementations", do you mean implementation of catalog, or FileIO?

This is for FileIO, not catalog related.

And what is the use case of this? I thought the only place we are doing something like this is in orphan file removal, but for that we are just iterating through the directories to find matching files, we are holding all the matching files in memory, not holding everything in memory.

This is related to orphan file removal. The hashing is all performed at the top-level of the data directory. That means to list that directory using a FS based FileIO, you need to list that directory which could have 2^24 subdirs. Adding a slash within the hash allows that to be broken up and listed in a distributed way to build the dataset used by spark for the full listing.

@jackye1995
Copy link
Contributor

jackye1995 commented Oct 7, 2024

The hashing is all performed at the top-level of the data directory. That means to list that directory using a FS based FileIO, you need to list that directory which could have 2^24 subdirs.

I see. But I think this is already a problem today with the 64^6 (2^36) subdirs, and it's actually reducing the number of sub-directories with the new strategy?

I looked at the code of orphan files removal more closely, looks like the issue is at here, where it forces listing the entire directory and then check for if the sub directory count exceeds the maximum.

And to properly fix this, my understanding is that we should switch to use prefix-based listing, so that if the FileIO implements SupportsPrefixListing, then that API should be used rather than HDFS FileSystem. Would you agree this is the right way to fix this specific issue? I think it's better than trying to fit S3 into HDFS use pattern. @danielcweeks

@jackye1995
Copy link
Contributor

jackye1995 commented Oct 7, 2024

Adding a slash within the hash allows that to be broken up and listed in a distributed way to build the dataset used by spark for the full listing.

Btw, another way to solve the problem, just to brainstorm here:

My understanding from S3 side (@ookumuso correct me if I am wrong) is that the ideal case that also works better with the new S3 directory bucket is that: all the files are in the same directory, we don't even want / by the end of the hash prefix. So files will be stored in places like /data/0101010101010101-file-name.parquet

Would this be a better strategy? If we store files like this, it also does not create the too many nested directory problem during orphan file cleanup that you described. For the code pointer above:

      for (FileStatus file : fs.listStatus(path, pathFilter)) {
        if (file.isDirectory()) {
          subDirs.add(file.getPath().toString());
        } else if (file.isFile() && predicate.test(file)) {
          matchingFiles.add(file.getPath().toString());
        }
      }

      // stop listing if the number of direct sub dirs is bigger than maxDirectSubDirs
      if (subDirs.size() > maxDirectSubDirs) {
        remainingSubDirs.addAll(subDirs);
        return;
      }

This will just go through all files and store the matching files in a single call because there is no nested directories under /data/, and I don't see any risk of memory issue (assuming matching files don't blow up memory)

@danielcweeks
Copy link
Contributor

danielcweeks commented Oct 7, 2024

The hashing is all performed at the top-level of the data directory. That means to list that directory using a FS based FileIO, you need to list that directory which could have 2^24 subdirs.

I see. But I think this is already a problem today with the 64^6 (2^36) subdirs, and it's actually reducing the number of sub-directories with the new strategy?

It's a problem with both strategies. If we weren't changing the hashing, we should introduce a path separator (though likely at a different level) for the current strategy. Since we're considering changing the hash, we should fix this as well.

I looked at the code of orphan files removal more closely, looks like the issue is at here, where it forces listing the entire directory and then check for if the sub directory count exceeds the maximum.

My understanding is that the current approach does an initial top-level listing of the data dir and then farms subsequent listings out to executors. This would allow for parallel listing in a more reasonable way without running into limits. I believe this might already work with the default depths (though we should double-check that).

And to properly fix this, my understanding is that we should switch to use prefix-based listing, so that if the FileIO implements SupportsPrefixListing, then that API should be used rather than HDFS FileSystem. Would you agree this is the right way to fix this specific issue? I think it's better than trying to fit S3 into HDFS use pattern. @danielcweeks

I think we need to try to accommodate both since there's a fair amount of usage of S3A. I believe we can do a much better job by splitting the prefix range up for parallel listings with any FileIO that supports prefix listing, but the current implementation relies on S3A and is built around that behavior. My point is we can accommodate both reasonably, but I agree there's a better solution with prefix listing.

@jackye1995
Copy link
Contributor

My understanding is that the current approach does an initial top-level listing of the data dir and then farms subsequent listings out to executors. This would allow for parallel listing in a more reasonable way without running into limits. I believe this might already work with the default depths (though we should double-check that).

I think the code does not do what you describe, please correct me if I am wrong or looking at the wrong place:

      for (FileStatus file : fs.listStatus(path, pathFilter)) {
        if (file.isDirectory()) {
          subDirs.add(file.getPath().toString());
        } else if (file.isFile() && predicate.test(file)) {
          matchingFiles.add(file.getPath().toString());
        }
      }

      // stop listing if the number of direct sub dirs is bigger than maxDirectSubDirs
      if (subDirs.size() > maxDirectSubDirs) {
        remainingSubDirs.addAll(subDirs);
        return;
      }

Because it must first finish doing the fs.listStatus call to produce the full list of subDirs before checking against maxDirectSubDirs, you will end up listing all the random prefixes before able to terminate the call, and this happens before farming out subsequent calls to any executor.

And the files in a table is listed based on this code:

    // list at most MAX_DRIVER_LISTING_DEPTH levels and only dirs that have
    // less than MAX_DRIVER_LISTING_DIRECT_SUB_DIRS direct sub dirs on the driver
    listDirRecursively(
        location,
        predicate,
        hadoopConf.value(),
        MAX_DRIVER_LISTING_DEPTH,
        MAX_DRIVER_LISTING_DIRECT_SUB_DIRS,
        subDirs,
        pathFilter,
        matchingFiles);

which starts listing at table root location level and list up to depth 3, that will cover listing the /data/ directory on driver since that is only 1 level under the root location. And I don't think adjusting the listing depth or listing direct dub dirs would fix the issue. But overall I think we are all agreeing that that we have a problem today about doing orphan file removal using S3A with object storage layout.

Since we all agree that there is a better path forward using prefix-listing, and the S3 team is planning to work on that, should we still add the slash in the middle of the hash? It feels like a pretty hacky fix. I will let @ookumuso and @drewschleit chime in here, what I heard was that having the middle slash would impact throttling handling, and is also a bad practice for directory buckets as it creates more sparse directories. Even if we want to do the middle slash fix, I think we should add it with a config which is disabled by default, and focus on the right way to fix it.

@ookumuso
Copy link
Contributor Author

ookumuso commented Oct 8, 2024

Since we all agree that there is a better path forward using prefix-listing, and the S3 team is planning to work on that, should we still add the slash in the middle of the hash? It feels like a pretty hacky fix. I will let @ookumuso and @drewschleit chime in here, what I heard was that having the middle slash would impact throttling handling, and is also a bad practice for directory buckets as it creates more sparse directories. Even if we want to do the middle slash fix, I think we should add it with a config which is disabled by default, and focus on the right way to fix it.

For S3 General Purpose bucket scaling, we mainly care about the entropy and not the directories so where "/" is doesn't really matter that much. This is a concern for directory buckets due to sparse directories as in too many directories with very little number of files under each since directories are first class citizens there and there is no prefix based scaling concern.

Moving forward with the base2 change even as is with the "/" in the end is crucial today for S3 performance so I would like to tackle these concerns separately. It looks like with the base2 change we are removing a bunch of work from the driver since we are reducing directory count from 64^6 (2^36) to 2^20 which is a significant drop.

Prefix listing is a good idea but I think it might be limited in its effectiveness against directory buckets since there is no native api to list any prefix there. See the comment here particularly Directory buckets - For directory buckets, only prefixes that end in a delimiter (/) are supported.

Given there are complications there, would like get your take on moving forward with this unless you think it makes things worse since this is important for S3 auto-scaling in general.

@jackye1995
Copy link
Contributor

we mainly care about the entropy and not the directories so where "/" is doesn't really matter that much.

Moving forward with the base2 change even as is with the "/" in the end is crucial today for S3 performance so I would like to tackle these concerns separately.

@ookumuso thanks for the quick response, just to confirm we are on the same page with these comments, what @danielcweeks is suggesting is to have the "/" also in the middle, e.g. "/data/01010101/01010101/file-name.parquet". It is not just in the end as you said later, e.g. "/data/0101010101010101/file-name.parquet".

For the specific suggestion to have the "/" also in the middle, does it matter or not matter in the entropy design?

@ookumuso
Copy link
Contributor Author

ookumuso commented Oct 8, 2024

we mainly care about the entropy and not the directories so where "/" is doesn't really matter that much.

Moving forward with the base2 change even as is with the "/" in the end is crucial today for S3 performance so I would like to tackle these concerns separately.

@ookumuso thanks for the quick response, just to confirm we are on the same page with these comments, what @danielcweeks is suggesting is to have the "/" also in the middle, e.g. "/data/01010101/01010101/file-name.parquet". It is not just in the end as you said later, e.g. "/data/0101010101010101/file-name.parquet".

For the specific suggestion to have the "/" also in the middle, does it matter or not matter in the entropy design?

Yeah that does not matter for entropy design, auto-scaling behavior will be same on general purpose buckets. For directory buckets, I guess it is the same amount of sparseness since it does not change the total directory count. I just agree that it looks a bit hacky overall

@jackye1995
Copy link
Contributor

Okay in that case at least I feel much more comfortable with adding slashes in the middle, thanks for confirming.

Looking at the orphan file delete logic, If we really want to fully take advantage of how the recursive + distributed file listing is implemented today, it seems like the right way to do the hashing path is to add not just 1 slash, but multiple slashes, something like "/data/0101/0101/0101/0101/..../file-name.parquet", so it gives enough directory depth out of the box, and that everything further nested is sent to executors for further listing. What do you think? @danielcweeks

And would that still be fine with S3 scaling, or a single slash is better? @ookumuso

@ookumuso
Copy link
Contributor Author

ookumuso commented Oct 8, 2024

And would that still be fine with S3 scaling, or a single slash is better? @ookumuso

Probably not but I will dig in a bit more and confirm this tomorrow just to be sure.

Looking at the orphan file delete logic, If we really want to fully take advantage of how the recursive + distributed file listing is implemented today, it seems like the right way to do the hashing path is to add not just 1 slash, but multiple slashes, something like "/data/0101/0101/0101/0101/..../file-name.parquet", so it gives enough directory depth out of the box, and that everything further nested is sent to executors for further listing. What do you think? @danielcweeks

Is there an optimal number of directories and depth? maybe we can just create those and put rest of the entropy into the fileName. For example: /data/010/001/100/01010101001-file-name.parquet. This can reduce both the sparse directory problem and help with orphan clean-up?

@jackye1995
Copy link
Contributor

Is there an optimal number of directories and depth?

The hard-coded default is 3. Usually, there is a /data/ folder which takes 1 level of depth, but that might not always be the case. So if we stick to that default, then we should ensure 3 levels of depth.

put rest of the entropy into the fileName

When write.object-storage.partitioned-path=true, that will create paths like /data/010/001/100/01010101001-key=val/file-name.parquet, which will be a bit awkward.

We should potentially introduce a separator concept write.object-storage.separator that defaults to slash, and can be other values like - and can produce paths like /data/010/001/100/01010101001-key=val-file-name.parquet, which seems like the most ideal solution.

@ookumuso
Copy link
Contributor Author

ookumuso commented Oct 8, 2024

What do you think about this approach @danielcweeks:

Is there an optimal number of directories and depth? maybe we can just create those and put rest of the entropy into the fileName. For example: /data/010/001/100/01010101001-file-name.parquet. This can reduce both the sparse directory problem and help with orphan clean-up?

It might be a nice win-win case both solving sparse directory problem and orphan clean up. As @jackye1995 it is a bit weird for partitioned-paths but we can do it for not partitioned paths only, so it would like:

  • partitioned-path=true: /data/010/001/100/01010101001/key=val/file-name.parquet
  • partitioned-path=false: /data/010/001/100/01010101001-file-name.parquet

@danielcweeks
Copy link
Contributor

What do you think about this approach @danielcweeks:

Is there an optimal number of directories and depth? maybe we can just create those and put rest of the entropy into the fileName. For example: /data/010/001/100/01010101001-file-name.parquet. This can reduce both the sparse directory problem and help with orphan clean-up?

It might be a nice win-win case both solving sparse directory problem and orphan clean up. As @jackye1995 it is a bit weird for partitioned-paths but we can do it for not partitioned paths only, so it would like:

  • partitioned-path=true: /data/010/001/100/01010101001/key=val/file-name.parquet
  • partitioned-path=false: /data/010/001/100/01010101001-file-name.parquet

I like removing additional path when partitioned-path=false (that's a great idea).

I'm just wondering about the where we want to put the slashes in the bit field. Breaking it up by three bits means that each recursive listing just operates on 8 subpaths, which feels too small. I feel like we might want to move to four bits (i.e. /data/0100/0110/0010/101010010110). This leaves the leaf paths with a reasonable 2^12 paths and each parent with 16, which feels like enough to parallelize at some level.

@ookumuso
Copy link
Contributor Author

Sounds good @danielcweeks, I will work on this change and update the PR!

@ookumuso
Copy link
Contributor Author

@danielcweeks @jackye1995 Updated the change to divide entropy into dirs so we follow the following format now:

partitioned-path=true: /data/0100/0110/0010/10101001/key=val/file-name.parquet
partitioned-path=false: /data/0100/0110/0010/10101001-file-name.parquet

Set the dir depth to 3 and dir length to 4 as discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants