Merge pull request #65 from ImperialCollegeLondon/feature/core_data_n…

…otes Adding design notes on core data setup and configuration
ImperialCollegeLondon · Sep 2, 2022 · d879d33 · d879d33
2 parents 0276ee8 + f7a91f9
commit d879d33
Showing 1 changed file with 189 additions and 0 deletions.
diff --git a/docs/source/development/design/core.md b/docs/source/development/design/core.md
@@ -46,3 +46,192 @@ The system should have:
 - A `config_loader` function to read a particular file, optionally validating it
   against a matching config template.
 - A central `Config` class, which can be built up using `ConfigLoader`.
+
+## Data
+
+We need a system for providing forcing data to the simulation. Although some forcing
+variables are likely to be module specific, it seems better to avoid having arbitrary
+locations and collect everything using configuration of `core.data`.
+
+For the moment, let's assume a common NetCDF format for data inputs.
+
+### Spatial indices
+
+Data is (always?) associated with a grid cell, so the system needs to be able to match
+data values to the cell id of grid cells. The following spatial indexing options seem
+useful (and all of these could add a time dimension):
+
+#### Two dimensial spatial indexing
+
+- Simple indices (`idx_x` and `idx_y`) giving the index of a square grid. The data
+  should match the configured grid size and only really works for square grids (it could
+  just about work for a hex grid with alternate offsets!).
+- Standard `x` and `y` coordinates (or `easting` and `northing`). This would require
+  matching the coordinates of the NetCDF data to the grid definition (origin and
+  resolution) as well as the grid size. This again only really works for square grids.
+
+#### One dimensional spatial indexing
+
+- Simply using a `cell_id` dimension in the NetCDF file to match data to grid cells.
+  This is entirely agnostic about the grid shape - users just need to provide a
+  dimension that covers all of the configure cell ids.
+
+- A **mapping** of particular attributes to sets of cells. The obvious use case here is
+  habitat type style data - for example different soil bulk densities or plant
+  communities. To unpack that a bit:
+
+### Defining mappings
+
+A user can optionally configure mappings, which are loaded and validated before any
+other data. These _have_ to use one of the methods that provides values to every cell,
+so either of the 2D approaches or using `cell_id`. So for example, the array below might
+define an arrangement of a gradient between different levels of forest cover: matrix to
+logged to forest.
+
+```toml
+[core.data.forest_cover]
+values  = [['M', 'M', 'L'],
+          ['L', 'L', 'F'],
+          ['L', 'F', 'F']]
+```
+
+Then you could have a separate variable that uses that variable to map values per class
+onto the spatial grid. The config structure here is in _no way fixed_ - it is the
+concept that matters!
+
+```toml
+[core.data.soil_depth]
+values = {forest_cover.M=0.1, 
+          forest_cover.L=0.5, 
+          forest_cover.F=1.0}
+```
+
+These use one of the first three approaches above, which index _individual_
+cells, to provide the spatial layout of a categorical variable. For example, it could
+use `x` and `y` coordinates to map `habitat` with values `forest`, `logged_forest` and
+`matrix`.
+
+When loading **data**, a variable could now use that mapping variable to unpack values
+for each of the three categories into cells.
+
+These are configured as an array of tables to support multiple mappings:
+
+```toml
+[[core.mappings]]
+file = /path/to/netcdf.nc
+var = habitat
+```
+
+### Loading data
+
+Once any mappings have been established, then the configuration defines the source file
+and variable name for required forcing variables. The dimension names are used to infer
+how to spatially index the data and we have a restricted set of dimension names that
+must be used to avoid ambiguity. We also need to have other reserved dimension names
+(for example, `time`, `depth`, `height`) but spatial indexing expects the following
+dimension names to define particular indexing approaches:
+
+- `x`, `y`: use coordinates to match data to cells,
+- `idx_x`, `idx_y`: just use indices to match data to cells,
+- `cell_id`: use cell ids,
+- Otherwise, the dimension should be a previously defined mapping.
+
+If a variable only has a `time` or `depth` dimension, it is assumed to be spatially
+constant.
+
+These are configured directly to defined input slots. The config should also accept
+values directly to avoid having to create NetCDF files with trivial variables.
+
+```toml
+[core.data.precipitation]
+file = /path/to/netcdf.nc
+var = prec
+
+[core.data.air_temperature]
+file = /path/to/netcdf.nc
+var = temp
+
+[core.data.ambient_co2]
+values = 400
+
+[core.data.elevation]
+values = [[1,2,3], [2,3,4], [3,4,5]]
+```
+
+### Data Generator
+
+It seems useful to have a `DataGenerator` class that can be used via the configuration
+to provide random or constant data.
+
+The basic idea would be something that defines:
+
+- a spatial structure,
+- a range or central value,
+- optionally some kind of variation,
+- optionally a time axis,
+- optionally some kind of cycle,
+- optionally some kind of probability of different states.
+
+It would be good if these could be set via configuration but also use the same
+functionality to create a NetCDF output. That _should_ effectively be the same as
+configuring the data generator with a set random number generator seed.
+
+I think this can basically just use all the options of `numpy.random`, possibly with the
+inclusion of interpolation along a time dimension at a given interval if a time axis is
+present.
+
+```python
+class DataGenerator:
+
+    def __init__(
+        self,
+        spatial_axis: str,
+        temporal_axis: str,
+        temporal_interpolation: np.timedelta64,
+        seed: Optional[int],
+        method: str, # one of the numpy.random.Generator methods
+        **kwargs
+        ) -> np.ndarray
+
+```
+
+The model I have in my head is based around the `numpy.random` methods
+[](https://numpy.org/doc/stable/reference/random/generator.html).
+
+A user could provide a scalar (so a global value) or an array (matching a spatial grid
+or mapping) that stipulates a method and keyword arguments. So here a DataGenerator
+might be:
+
+```python
+# Global value varying as a normal distribution around 5
+ex1 = DataGenerator(loc=5, scale=2, distribution='normal')
+# A 2x2 grid with lognormal values with mean varying by cell, but constant variation.
+ ex2 = DataGenerator(mean=[[5, 6], [7, 8]], sigma=2, distribution='lognormal')
+```
+
+More advanced would be providing a time series of values with variation. Here, you'd
+need a time axis giving the temporal location of the sampling points, which could be
+interpolated if necessary. So for example, a 2 x 2 grid with normally distributed values
+that increase in location and scale over a year.
+
+```python
+loc = [[[1, 2],
+        [3, 4]],
+       [[2, 3],
+        [4, 5]]]
+
+scale = [[[1, 1],
+          [1, 1]],
+         [[1.2, 1.2],
+          [1.2, 1.2]]]
+
+time = ['2020-01-01', '2020-12-31']
+
+ex3 = DataGenerator(loc = loc, scale=scale, method='normal', time=time, time_axis=2)
+```
+
+We could provide ways to provide sequences of generators to provide more complex
+scenarios or probabilistic switching between generators (El Niño years?). However, this
+could end up being a deep rabbit hole so at some point we should just say, if you want
+sufficiently complex scenarios, just roll your own NetCDF files! We could provide
+examples of that.