Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest format #1

Closed
JackKelly opened this issue Jul 26, 2024 · 4 comments
Closed

Manifest format #1

JackKelly opened this issue Jul 26, 2024 · 4 comments
Assignees

Comments

@JackKelly
Copy link
Owner

JackKelly commented Jul 26, 2024

A core part of hypergrib is a "manifest" which is basically a table of contents to tell hypergrib which GRIB files to load for a given datetime etc.

hypergrib is all about performance and scale. The hope is that users will be able to lazily access petabytes of GRIB files on cloud object storage, as quickly as possible. Which I think means that the manifest will have to be very concise (ideally so concise that even a manifest covering, say, a 20-year dataset of ensemble NWPs would download in a few milliseconds and fit into CPU L2 cache).

How to make the manifest concise? The basic idea is to exploit the fact that NWP datasets are usually regular in all dimensions (e.g. hourly). And the file names usually follow a consistent naming convention. So you could have a manifest which just says

"the time coords start at 2020-01-01 and end at 2024-01-01; with hourly cadence. The missing timesteps are 2021-01-01 and 2022-03-04. And all filenames prior to 2022 are of the form YYYY-DD-MM.grib; filenames after 2022 are of the form YYYY-DD-MM-foo.grib".

(These formats etc would be specific to each dataset)

There would also exist tools to convert to and from Kerchunk's manifest

Existing manifests

@devsjc
Copy link

devsjc commented Jul 26, 2024

There's nuances to some providers - unfortunately they don't follow a consistent surfacing technique when it comes to their grib, so it's something to bear in mind when you're thinking about this manifest.

The simplest thing is obviously one big folder with a grib file per timestep, e.g.

data
 - 2021-01-01T00:00Z.grib2 <- all parameters, all horizons, one init time
 - 2021-01-01T00:06Z.grib2 <- all parameters, all horizons, other init time
 - ...

but you'll also come across some that have a few files per step, for example:

data
- 2021-01-01
	- Wholesale1.grib <- some parameters, horizon up to 54 hours
	- Wholesale1T54.grib <- some parameters, horizon 54-120 hours
	- Wholesale2.grib <- other parameters, horizon up to 54 hours
	- ...
- ...

or even some that are in parameter subfolders grouped by init hour etc:

data
- 00
	- dswrf
		- 2021-01-01T00:00.grib2 <- one parameter, one init time, all horizons
		- 2021-01-02T00:00.grib2 <- same parameter, other init time, all horizons
	- vis
		- ...
	- ...
- 06
	- dswrf
		- 2021-01-01T06:00.grib2
		- ...

I guess just pick one provider and layout you're aiming for at the start and then you can always expand to others when it's proven? But its worth bearing in mind when you're thinking about a how a concise representation could look, how the actual data could look!

@JackKelly
Copy link
Owner Author

That's super-useful, thank you Sol!

Maybe, for each new dataset, a human would manually specify a template to say "this is the broad pattern for how the data is structured". And the code would populate the specifics from the actual data. Humans definitely wouldn't have to manually specify the details of each GRIB file!

And, yeah, as you say, let's start by hard-coding the template for a single NWP dataset, to get a proof-of-concept up-and-running.

@JackKelly
Copy link
Owner Author

I think I'll start by implementing code which can search through a bunch of GRIB files and output a manifest using Kerchunk's Parquet reference spec. It should be easy to convert that to an even more concise format if necessary. Or to convert it to an alternative manifest format.

@JackKelly
Copy link
Owner Author

We're not longer planning to use a manifest file. Instead, we're planning to use the approach outlined in this comment: #14 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants