High performance cloud object storage (for reading chunked multi-dimensional arrays)

Some notes (in no particular order) about speeding up reading many chunks of data from cloud storage.

## General info about cloud storage buckets

In general, cloud storage buckets are highly distributed storage systems. They can be capable of very high bandwidth. But - crucially - they also have _very_ high latencies of 100-200 milliseconds for first-byte-out.

Contrast this with read latencies of locally attached storage: HDD: ~5 ms; SSD: ~80 µs; DDR5 RAM: 90 ns. In other words: cloud storage latencies are two orders of magnitude higher than a local HDD; and cloud storage latencies are _four_ orders of magnitude higher than a local SSD!

## How many IOPS can we get from cloud storage to a single machine?

[AWS docs](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html) say:

> Your applications can easily achieve thousands of transactions per second in request performance when uploading and retrieving storage from Amazon S3. Amazon S3 automatically scales to high request rates [although the scaling takes time]. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second... 

> Some data lake applications on Amazon S3 scan millions or billions of objects for queries that run over petabytes of data. These data lake applications achieve single-instance transfer rates that maximize the network interface use for their [Amazon EC2](https://docs.aws.amazon.com/ec2/index.html) instance, which can be up to 100 Gb/s on a single instance. These applications then aggregate throughput across multiple instances to get multiple terabits per second.

(Also see [this Stack Overflow discusion](https://stackoverflow.com/questions/52443839/s3-what-exactly-is-a-prefix-and-what-ratelimits-apply/52445252))

So... it _sounds_ like we could achieve arbitrarily high IOPS by using lots of prefixes.

How fast can a single VM go? A 100 Gbps NIC could submit 20 _million_ GET requests per second (assuming each GET request is 500 bytes). And a 100 Gbps NIC could receive 2 million 4 kB chunks per second (assuming a 20% overhead, so each request is 5 kB); or 2,000 x 4 MB chunks per second.

To hit 2 million IOPS, you'd need to evenly distribute your reads across 364 prefixes.

2 million requests per second per machine is a _lot_! It wasn't that long ago that it was considered very impressive if a single HTTP server could handle 10,000 RPS. I'm not even sure if 2 million RPS is actually possible! Here's [a benchmark](https://www.haproxy.com/blog/haproxy-forwards-over-2-million-http-requests-per-second-on-a-single-aws-arm-instance) of a proxy server handling 2 million RPS, on a single machine.

I'd assume that, in order for the operating system to handle anything close to 2 million requests per second, we'd have to submit our GET requests using io_uring.

## Some "tricks" that LSIO could use to improve read bandwidth from cloud storage buckets to a single VM:

* Submit many GET requests concurrently (already planned)
* GET large files in multiple "parts" (using the HTTP `Range` header, or `partNumber`), and download the parts in parallel (already planned)
* [Stripe data](https://en.wikipedia.org/wiki/Data_striping) across many prefixes (like [RAID 0](https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_0)). (LSIO could act just like a RAID controller: LSIO's user would read and write single files. Under the hood, LSIO would stripe that data over a use-configurable number of prefixes. The [RAID chunk size](https://raid.wiki.kernel.org/index.php/RAID_setup#Chunk_sizes) would be user-configurable. Or, perhaps, when the user submits writes as a sequence of arbitrary-sized chunks (like Zarr) then LSIO could split those chunks across multiple S3 prefixes. And the docs could tell users that chunks which are often read together should be close together in the submitted sequence of chunks. That works for striping. And works for merging reads into sequential reads for HDDs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High performance cloud object storage (for reading chunked multi-dimensional arrays) #10

General info about cloud storage buckets

How many IOPS can we get from cloud storage to a single machine?

Some "tricks" that LSIO could use to improve read bandwidth from cloud storage buckets to a single VM:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

High performance cloud object storage (for reading chunked multi-dimensional arrays) #10

Description

General info about cloud storage buckets

How many IOPS can we get from cloud storage to a single machine?

Some "tricks" that LSIO could use to improve read bandwidth from cloud storage buckets to a single VM:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions