Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding design notes on core data setup and configuration #65

Merged
merged 4 commits into from
Sep 2, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions docs/source/development/design/core.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,130 @@ The system should have:
- A `config_loader` function to read a particular file, optionally validating it
against a matching config template.
- A central `Config` class, which can be built up using `ConfigLoader`.

## Data

We need a system for providing forcing data to the simulation. Although some forcing
variables are likely to be module specific, it seems better to avoid having arbitrary
locations and collect everything using configuration of `core.data`.

For the moment, let's assume a common NetCDF format for data inputs.

### Spatial indices

Data is (always?) associated with a grid cell, so the system needs to be able to match
data values to the cell id of grid cells. The following spatial indexing options seem
useful (and all of these could add a time dimension):

#### Two dimensial spatial indexing

- Simple indices (`idx_n` and `idx_y`) giving the index of a square grid. The data
davidorme marked this conversation as resolved.
Show resolved Hide resolved
should match the configured grid size and only really works for square grids (it could
just about work for a hex grid with alternate offsets!).
- Standard `x` and `y` coordinates (or `easting` and `northing`). This would require
matching the coordinates of the NetCDF data to the grid definition (origin and
resolution) as well as the grid size. This again only really works for square grids.

#### One dimensional spatial indexing

- Simply using a `cell_id` dimension in the NetCDF file to match data to grid cells.
This is entirely agnostic about the grid shape - users just need to provide a
dimension that covers all of the configure cell ids.

- A **mapping** of particular attributes to sets of cells. The obvious use case here is
habitat type style data - for example different soil bulk densities or plant
communities. To unpack that a bit:

### Defining mappings
davidorme marked this conversation as resolved.
Show resolved Hide resolved

A user can optionally configure mappings, which are loaded and validated before any
other data. These use one of the first three approaches above, which index _individual_
davidorme marked this conversation as resolved.
Show resolved Hide resolved
cells, to provide the spatial layout of a categorical variable. For example, it could
use `x` and `y` coordinates to map `habitat` with values `forest`, `logged_forest` and
`matrix`.

When loading **data**, a variable could now use that mapping variable to unpack values
for each of the three categories into cells.

These are configured as an array of tables to support multiple mappings:

```toml
[[core.mappings]]
file = /path/to/netcdf.nc
davidorme marked this conversation as resolved.
Show resolved Hide resolved
var = habitat
```

### Loading data

Once any mappings have been established, then the configuration defines the source file
and variable name for required forcing variables. The dimension names are used to infer
how to spatially index the data and we have a restricted set of dimension names that
must be used to avoid ambiguity. We also need to have `time` and `depth` as reserved
davidorme marked this conversation as resolved.
Show resolved Hide resolved
dimension names but the spatial indexing uses:

- `x`, `y`: use coordinates to match data to cells,
- `idx_x`, `idx_y`: just use indices to match data to cells,
- `cell_id`: use cell ids,
- Otherwise, the dimension should be a previously defined mapping.

If a variable only has a `time` or `depth` dimension, it is assumed to be spatially
constant.

These are configured directly to defined input slots. The config should also accept
values directly to avoid having to create NetCDF files with trivial variables.

```toml
[core.data.precipitation]
file = /path/to/netcdf.nc
davidorme marked this conversation as resolved.
Show resolved Hide resolved
var = prec

[core.data.air_temperature]
file = /path/to/netcdf.nc
var = temp

[core.data.ambient_co2]
values = 400

[core.data.elevation]
values = [[1,2,3], [2,3,4], [3,4,5]]
```

### Data Generator

It seems useful to have a `DataGenerator` class that can be used via the configuration
to provide random or constant data.

davidorme marked this conversation as resolved.
Show resolved Hide resolved
The basic idea would be something that defines:

- a spatial structure,
- a range or central value,
- optionally some kind of variation,
- optionally a time axis,
- optionally some kind of cycle,
- optionally some kind of probability of different states.

It would be good if these could be set via configuration but also use the same
functionality to create a NetCDF output. That _should_ effectively be the same as
configuring the data generator with a set random number generator seed.

These could get arbitrarily complex - so at some point we should just say, if you want
sufficiently complex generated data, just roll your own NetCDF files!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain this a bit more? When would this apply?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to define! But if someone wanted to simulate a scenario of 5 drought years followed by 3 years of elevated rainfall then an El Niño and then a plague of frogs - that's probably outside the scope 😄


I think this can basically just use all the options of `numpy.random`, possibly with the
inclusion of interpolation along a time dimension at a given interval if a time axis is
present.

```python
class DataGenerator:

def __init__(
self,
spatial_axis: str,
temporal_axis: str,
temporal_interpolation: np.timedelta64,
seed: Optional[int],
method: str, # one of the numpy.random.Generator methods
**kwargs
) -> np.ndarray

```