-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manifest format #1
Comments
There's nuances to some providers - unfortunately they don't follow a consistent surfacing technique when it comes to their grib, so it's something to bear in mind when you're thinking about this manifest. The simplest thing is obviously one big folder with a grib file per timestep, e.g.
but you'll also come across some that have a few files per step, for example:
or even some that are in parameter subfolders grouped by init hour etc:
I guess just pick one provider and layout you're aiming for at the start and then you can always expand to others when it's proven? But its worth bearing in mind when you're thinking about a how a concise representation could look, how the actual data could look! |
That's super-useful, thank you Sol! Maybe, for each new dataset, a human would manually specify a template to say "this is the broad pattern for how the data is structured". And the code would populate the specifics from the actual data. Humans definitely wouldn't have to manually specify the details of each GRIB file! And, yeah, as you say, let's start by hard-coding the template for a single NWP dataset, to get a proof-of-concept up-and-running. |
I think I'll start by implementing code which can search through a bunch of GRIB files and output a manifest using Kerchunk's Parquet reference spec. It should be easy to convert that to an even more concise format if necessary. Or to convert it to an alternative manifest format. |
We're not longer planning to use a manifest file. Instead, we're planning to use the approach outlined in this comment: #14 (comment) |
A core part of
hypergrib
is a "manifest" which is basically a table of contents to tellhypergrib
which GRIB files to load for a given datetime etc.hypergrib
is all about performance and scale. The hope is that users will be able to lazily access petabytes of GRIB files on cloud object storage, as quickly as possible. Which I think means that the manifest will have to be very concise (ideally so concise that even a manifest covering, say, a 20-year dataset of ensemble NWPs would download in a few milliseconds and fit into CPU L2 cache).How to make the manifest concise? The basic idea is to exploit the fact that NWP datasets are usually regular in all dimensions (e.g. hourly). And the file names usually follow a consistent naming convention. So you could have a manifest which just says
(These formats etc would be specific to each dataset)
There would also exist tools to convert to and from Kerchunk's manifest
Existing manifests
The text was updated successfully, but these errors were encountered: