Manifest format #1

JackKelly · 2024-07-26T12:17:07Z

A core part of hypergrib is a "manifest" which is basically a table of contents to tell hypergrib which GRIB files to load for a given datetime etc.

hypergrib is all about performance and scale. The hope is that users will be able to lazily access petabytes of GRIB files on cloud object storage, as quickly as possible. Which I think means that the manifest will have to be very concise (ideally so concise that even a manifest covering, say, a 20-year dataset of ensemble NWPs would download in a few milliseconds and fit into CPU L2 cache).

How to make the manifest concise? The basic idea is to exploit the fact that NWP datasets are usually regular in all dimensions (e.g. hourly). And the file names usually follow a consistent naming convention. So you could have a manifest which just says

"the time coords start at 2020-01-01 and end at 2024-01-01; with hourly cadence. The missing timesteps are 2021-01-01 and 2022-03-04. And all filenames prior to 2022 are of the form YYYY-DD-MM.grib; filenames after 2022 are of the form YYYY-DD-MM-foo.grib".

(These formats etc would be specific to each dataset)

There would also exist tools to convert to and from Kerchunk's manifest

Existing manifests

The text was updated successfully, but these errors were encountered:

devsjc · 2024-07-26T12:25:41Z

There's nuances to some providers - unfortunately they don't follow a consistent surfacing technique when it comes to their grib, so it's something to bear in mind when you're thinking about this manifest.

The simplest thing is obviously one big folder with a grib file per timestep, e.g.

data
 - 2021-01-01T00:00Z.grib2 <- all parameters, all horizons, one init time
 - 2021-01-01T00:06Z.grib2 <- all parameters, all horizons, other init time
 - ...

but you'll also come across some that have a few files per step, for example:

data
- 2021-01-01
	- Wholesale1.grib <- some parameters, horizon up to 54 hours
	- Wholesale1T54.grib <- some parameters, horizon 54-120 hours
	- Wholesale2.grib <- other parameters, horizon up to 54 hours
	- ...
- ...

or even some that are in parameter subfolders grouped by init hour etc:

data
- 00
	- dswrf
		- 2021-01-01T00:00.grib2 <- one parameter, one init time, all horizons
		- 2021-01-02T00:00.grib2 <- same parameter, other init time, all horizons
	- vis
		- ...
	- ...
- 06
	- dswrf
		- 2021-01-01T06:00.grib2
		- ...

I guess just pick one provider and layout you're aiming for at the start and then you can always expand to others when it's proven? But its worth bearing in mind when you're thinking about a how a concise representation could look, how the actual data could look!

JackKelly · 2024-07-26T12:29:45Z

That's super-useful, thank you Sol!

Maybe, for each new dataset, a human would manually specify a template to say "this is the broad pattern for how the data is structured". And the code would populate the specifics from the actual data. Humans definitely wouldn't have to manually specify the details of each GRIB file!

And, yeah, as you say, let's start by hard-coding the template for a single NWP dataset, to get a proof-of-concept up-and-running.

JackKelly · 2024-08-06T13:17:44Z

I think I'll start by implementing code which can search through a bunch of GRIB files and output a manifest using Kerchunk's Parquet reference spec. It should be easy to convert that to an even more concise format if necessary. Or to convert it to an alternative manifest format.

JackKelly · 2024-09-30T15:21:39Z

We're not longer planning to use a manifest file. Instead, we're planning to use the approach outlined in this comment: #14 (comment)

JackKelly self-assigned this Aug 6, 2024

This was referenced Aug 6, 2024

FEATURE: Retrieve .idx files from cloud object storage, given a path and search pattern #3

Closed

FEATURE: Check grib files when creating manifest #4

Closed

JackKelly closed this as completed Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manifest format #1

Manifest format #1

JackKelly commented Jul 26, 2024 •

edited

Loading

devsjc commented Jul 26, 2024 •

edited

Loading

JackKelly commented Jul 26, 2024

JackKelly commented Aug 6, 2024

JackKelly commented Sep 30, 2024

Manifest format #1

Manifest format #1

Comments

JackKelly commented Jul 26, 2024 • edited Loading

Existing manifests

devsjc commented Jul 26, 2024 • edited Loading

JackKelly commented Jul 26, 2024

JackKelly commented Aug 6, 2024

JackKelly commented Sep 30, 2024

JackKelly commented Jul 26, 2024 •

edited

Loading

devsjc commented Jul 26, 2024 •

edited

Loading