Replies: 4 comments 7 replies
-
Quick fix: Split the dataset & have one manifest file per split.Examples of how to split an NWP dataset. Have one manifest file per:
This is easy to do. And "works". So perhaps this is the right approach for an MVP? (Although I'd love to hear opinions!) But this approach is a little unsatisfactory. This approach fails to achieve |
Beta Was this translation helpful? Give feedback.
-
Lazily open the manifestInstead of expecting the end-user to download a many-gigabyte "manifest file", the user could instead request just the parts of the manifest that they want, when they need it. This introduces complexity and latency. But would allow us to get back to the "dream" of enabling users to lazily open the entirety of a multi-year NWP dataset (with hundreds of billions of chunks) with a single line of Python: @emfdavid has had great success using Google Cloud BigQuery to store a manifest. (Although at a smaller scale) Ideally the manifest would also store information like when the NWP's horizontal resolution changes etc. |
Beta Was this translation helpful? Give feedback.
-
Algorithmically compute the path, offset, and size of each GRIB messageTL;DR: For the hypergrib MVP, we could algorithmically compute paths of GRIBs, and load the relevant .idx files to find the byte offsets of each GRIB message. All the ideas above assume that the manifest is a mapping from a key to a value. The key in our case is a 5-tuple consisting of the NWP initialisation datetime, forecast step, variable, vertical level, and ensemble member. The value is a 3-tuple consisting of the path of the GRIB file, the byte offset to the GRIB message, and the message's length. This mapping could be stored in-memory as a hashmap; or in the cloud in a cloud database. Each GRIB message is a single "row" in this mapping. A big problem is that we could easily end up with trillions of rows for a single NWP dataset. But what if we could compute the path, offset, and message length algorithmically? Compute the pathIt's easy to see how this is possible for the path because GRIB filenames usually follow a well-behaved naming pattern. Compute the byte offset and length of each grib messageWe could probably compute the offset and message length for uncompressed grib files. But most GRIBs are compressed! So GRIB messages are variable length. So we can't compute the offset and length. We have to store the offset and length somewhere. A solutionA slow but cheap solution could be that we:
This solution might not even be that slow if we're careful to keep hundreds of network requests in-flight at any moment, and have a sensible async processing pipeline. I think this might actually be my preferred approach for the MVP. Thoughts? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
A GRIB dataset of the ECMWF ENS forecast consists of about 10 billion GRIB messages per year1! (See calculations).
So, if the manifest was stored in a file, and if the manifest only used 10 bytes per GRIB message, the manifest file would be 100 gigabytes per year of NWP data!
What is far too big! (
hypergrib
's ultimate aim is that users should be able to lazily open a huge NWP dataset from a laptop).(This discussion is a continuation of a conversation that started over at mpiannucci/gribberish#41 (comment))
Footnotes
As pointed out by @emfdavid in this comment. ↩
Beta Was this translation helpful? Give feedback.
All reactions