How to make it super-easy to lazily open a dataset with trillions of chunks? #14

JackKelly · 2024-09-27T10:11:00Z

JackKelly
Sep 27, 2024
Maintainer

A GRIB dataset of the ECMWF ENS forecast consists of about 10 billion GRIB messages per year¹! (See calculations).

So, if the manifest was stored in a file, and if the manifest only used 10 bytes per GRIB message, the manifest file would be 100 gigabytes per year of NWP data!

What is far too big! (hypergrib's ultimate aim is that users should be able to lazily open a huge NWP dataset from a laptop).

(This discussion is a continuation of a conversation that started over at mpiannucci/gribberish#41 (comment))

As pointed out by @emfdavid in this comment. ↩

JackKelly · 2024-09-27T10:11:49Z

JackKelly
Sep 27, 2024
Maintainer Author

Quick fix: Split the dataset & have one manifest file per split.

Examples of how to split an NWP dataset. Have one manifest file per:

ensemble member: 193 million GRIB messages per year
ensemble member and per vertical level: 1.4 million GRIB messages per year.
ensemble member and per vertical level and per variable: 156 thousand GRIB messages per year.

This is easy to do. And "works". So perhaps this is the right approach for an MVP? (Although I'd love to hear opinions!)

But this approach is a little unsatisfactory. This approach fails to achieve hypergrib's ultimate goal of allowing users to lazily open the entirety of a petabyte-scale NWP dataset with xr.open_dataset from their laptop. It forces the user to do more work if they want to combine information from multiple "splits". (But maybe not that much more work?)

0 replies

JackKelly · 2024-09-27T10:26:18Z

JackKelly
Sep 27, 2024
Maintainer Author

Lazily open the manifest

Instead of expecting the end-user to download a many-gigabyte "manifest file", the user could instead request just the parts of the manifest that they want, when they need it.

This introduces complexity and latency. But would allow us to get back to the "dream" of enabling users to lazily open the entirety of a multi-year NWP dataset (with hundreds of billions of chunks) with a single line of Python: xr.open_dataset.

@emfdavid has had great success using Google Cloud BigQuery to store a manifest. (Although at a smaller scale)

Ideally the manifest would also store information like when the NWP's horizontal resolution changes etc.

0 replies

JackKelly · 2024-09-27T11:01:56Z

JackKelly
Sep 27, 2024
Maintainer Author

Algorithmically compute the path, offset, and size of each GRIB message

TL;DR: For the hypergrib MVP, we could algorithmically compute paths of GRIBs, and load the relevant .idx files to find the byte offsets of each GRIB message.

All the ideas above assume that the manifest is a mapping from a key to a value. The key in our case is a 5-tuple consisting of the NWP initialisation datetime, forecast step, variable, vertical level, and ensemble member. The value is a 3-tuple consisting of the path of the GRIB file, the byte offset to the GRIB message, and the message's length. This mapping could be stored in-memory as a hashmap; or in the cloud in a cloud database. Each GRIB message is a single "row" in this mapping. A big problem is that we could easily end up with trillions of rows for a single NWP dataset.

But what if we could compute the path, offset, and message length algorithmically?

Compute the path

It's easy to see how this is possible for the path because GRIB filenames usually follow a well-behaved naming pattern.

Compute the byte offset and length of each grib message

We could probably compute the offset and message length for uncompressed grib files.

But most GRIBs are compressed! So GRIB messages are variable length. So we can't compute the offset and length. We have to store the offset and length somewhere.

A solution

A slow but cheap solution could be that we:

Store the bare minimum in the cloud. No manifest file. No database. All we store is a json file containing the bare minimum required by xr.open_dataset: The size of the array, the dimension names, and the coordinate labels.
We compute the path of the GRIB file (and `.idx file) on-the-fly using simple rules.
We then load each required .idx file on-the-fly to find the GRIB offset and message length.
Then we load just the GRIB messages we want.

This solution might not even be that slow if we're careful to keep hundreds of network requests in-flight at any moment, and have a sensible async processing pipeline.

I think this might actually be my preferred approach for the MVP.

Thoughts?

5 replies

jacobbieker Sep 27, 2024

Doesn't kerchunk somewhat do this, or can? In terms of reading a single GRIB file and then building up the rest of the virtual zarr from those offsets?

JackKelly Sep 27, 2024
Maintainer Author

Oh, yes, kerchunk can definitely do this! In a sense, hypergrib aims to "just" to be a kerchunk that can scale to trillions of chunks by "cheating" and only concentrating on GRIB. (In contrast, kerchunk is a much more general tool). If you tried to index trillions of chunks with kerchunk then I think you'd end up with a manifest that's (much?) too big to fit into RAM. (Although I haven't tried!).

I think kerchunk is awesome! And I'm chatting to the author and maintainer of kerchunk. kerchunk is very much the main inspiration for hypergrib.

emfdavid Sep 27, 2024

I love this.
Algorithmically determining the chunks on the fly is a game changer if we can get it to work with xarray. It would need robust error handling, but the in memory footprint of the open dataset would <1MB.

After 6 months of working with the database manifest approach it is a step change in capability but still strongly constrained by needing to carefully craft a narrow subset of the whole dataset. Having to coordinate a Big Query every time I run a forecast is a operational headache. You can absolutely write code to pick the file you need for a given valid time and step.

FYE - Current table sizes for NOAA HRRR, GFS, GEFS Avg

HRRR SFC 97,537,311 rows
HRRR SUBH 104,968,463 rows
GFS PGRB2-0P25 763,386,053 rows
GEFS PGRB2S-0P25-GEAVG 14,744,745 rows

Camus builds these manifest tables using the tools in kerchunk _grid_idx.py
The reinflate method (still to be moved over to kerchunk) can then build a logical dataset for the specified axis range and data products.

TomNicholas Oct 1, 2024

Algorithmically determining the chunks on the fly is a game changer if we can get it to work with xarray.

I think it's possible to do this, you just need a lazy (i.e. algorithmically inflatable) version of the ChunkManifest object. See zarr-developers/VirtualiZarr#238 (comment). That could hopefully give you both the neat API of virtualizarr whilst also the scalability that you're looking for from kerchunk/hypergrib.

JackKelly Oct 3, 2024
Maintainer Author

Sounds good to me! Great suggestion!

emfdavid · 2024-09-27T16:08:28Z

emfdavid
Sep 27, 2024

Have you thought about how you want to represent the products (variables) with conflicting dimensions in the grid format?
For my needs, using Xarray DataTree was the most expedient solution. Making each grib product a top level group, then using the stepType and typeOfLevel to build a unique tree for each data product. Here is an example of the result for the low cloud layer product which has an average and instantaneous step type.

This is not an esoteric issue - this is also the case for variables like downward shortwave radiation and wind vectors.

2 replies

JackKelly Sep 30, 2024
Maintainer Author

Good question! I must admit I still haven't had any hands-on experience with xarray DataTree. I agree that sounds like a good solution! I haven't given this issue a huge amount of thought yet. But I definitely agree that it's an important issue!

TomNicholas Oct 1, 2024

FYI I think that DataTrees containing virtualizarr ManifestArray objects is a natural synergy of the two projects, and I have been suggesting that someone with GRIB data that contains groups to try that out (see zarr-developers/VirtualiZarr#11 and zarr-developers/VirtualiZarr#84).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make it super-easy to lazily open a dataset with trillions of chunks? #14

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to make it super-easy to lazily open a dataset with trillions of chunks? #14

JackKelly Sep 27, 2024 Maintainer

Footnotes

Replies: 4 comments · 7 replies

JackKelly Sep 27, 2024 Maintainer Author

Quick fix: Split the dataset & have one manifest file per split.

JackKelly Sep 27, 2024 Maintainer Author

Lazily open the manifest

JackKelly Sep 27, 2024 Maintainer Author

Algorithmically compute the path, offset, and size of each GRIB message

Compute the path

Compute the byte offset and length of each grib message

A solution

jacobbieker Sep 27, 2024

JackKelly Sep 27, 2024 Maintainer Author

emfdavid Sep 27, 2024

TomNicholas Oct 1, 2024

JackKelly Oct 3, 2024 Maintainer Author

emfdavid Sep 27, 2024

JackKelly Sep 30, 2024 Maintainer Author

TomNicholas Oct 1, 2024

JackKelly
Sep 27, 2024
Maintainer

Replies: 4 comments 7 replies

JackKelly
Sep 27, 2024
Maintainer Author

JackKelly
Sep 27, 2024
Maintainer Author

JackKelly
Sep 27, 2024
Maintainer Author

JackKelly Sep 27, 2024
Maintainer Author

JackKelly Oct 3, 2024
Maintainer Author

emfdavid
Sep 27, 2024

JackKelly Sep 30, 2024
Maintainer Author