Skip to content

Commit 638a399

Browse files
bouweandelaschlunmavaleriupredoiRémi Kazeroni
authored
Support wildcards in the recipe and improve support for ancillary variables and dataset versioning (#1609)
Co-authored-by: Manuel Schlund <manuel.schlund@dlr.de> Co-authored-by: Valeriu Predoi <valeriu.predoi@gmail.com> Co-authored-by: Rémi Kazeroni <remi.kazeroni@dlr.de>
1 parent bebc637 commit 638a399

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+5283
-3722
lines changed

doc/recipe/overview.rst

Lines changed: 167 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -73,8 +73,8 @@ the following:
7373
Recipe section: ``datasets``
7474
============================
7575

76-
The ``datasets`` section includes dictionaries that, via key-value pairs, define standardized
77-
data specifications:
76+
The ``datasets`` section includes dictionaries that, via key-value pairs or
77+
"facets", define standardized data specifications:
7878

7979
- dataset name (key ``dataset``, value e.g. ``MPI-ESM-LR`` or ``UKESM1-0-LL``).
8080
- project (key ``project``, value ``CMIP5`` or ``CMIP6`` for CMIP data,
@@ -114,6 +114,162 @@ For example, a datasets section could be:
114114
- {dataset: HadGEM3-GC31-MM, project: CMIP6, exp: dcppA-hindcast, ensemble: r1i1p1f1, sub_experiment: s2000, grid: gn, start_year: 2000, end_year, 2002}
115115
- {dataset: BCC-CSM2-MR, project: CMIP6, exp: dcppA-hindcast, ensemble: r1i1p1f1, sub_experiment: s2000, grid: gn, timerange: '*'}
116116
117+
.. _dataset_wildcards:
118+
119+
Automatically populating a recipe with all available datasets
120+
-------------------------------------------------------------
121+
122+
It is possible to use :obj:`glob` patterns or wildcards for certain facet
123+
values, to make it easy to find all available datasets locally and/or on ESGF.
124+
Note that ``project`` cannot be a wildcard.
125+
126+
The facet values for local files are retrieved from the directory tree where the
127+
directories represent the facets values.
128+
Reading facet values from file names is not yet supported.
129+
See :ref:`CMOR-DRS` for more information on this kind of file organization.
130+
131+
When (some) files are available locally, the tool will not automatically look
132+
for more files on ESGF. To populate a recipe with all available datasets from
133+
ESGF, ``offline`` should be set to ``false`` and ``always_search_esgf`` should
134+
be set to ``true`` in the
135+
:ref:`user configuration file<user configuration file>`.
136+
137+
For more control over which datasets are selected, it is recommended to use
138+
a Python script or `Jupyter notebook <https://jupyter.org/>`_ to compose
139+
the recipe.
140+
See :ref:`/notebooks/composing-recipes.ipynb` for an example.
141+
This is particularly useful when specific relations are required between
142+
datasets, e.g. when a dataset needs to be available for multiple variables
143+
or experiments.
144+
145+
An example recipe that will use all CMIP6 datasets and all ensemble members
146+
which have a ``'historical'`` experiment could look like this:
147+
148+
.. code-block:: yaml
149+
150+
datasets:
151+
- project: CMIP6
152+
exp: historical
153+
dataset: '*'
154+
institute: '*'
155+
ensemble: '*'
156+
grid: '*'
157+
158+
After running the recipe, a copy specifying exactly which datasets were used
159+
is available in the output directory in the ``run`` subdirectory.
160+
The filename of this recipe will end with ``_filled.yml``.
161+
162+
For the ``timerange`` facet, special syntax is available.
163+
See :ref:`timerange_examples` for more information.
164+
165+
If populating a recipe using wildcards does not work, this is because there
166+
were either no files found that match those facets, or the facets could not be
167+
read from the directory name or ESGF.
168+
169+
.. _supplementary_variables:
170+
171+
Defining supplementary variables (ancillary variables and cell measures)
172+
------------------------------------------------------------------------
173+
174+
It is common practice to store ancillary variables (e.g. land/sea/ice masks)
175+
and cell measures (e.g. cell area, cell volume) in separate datasets that are
176+
described by slightly different facets.
177+
In ESMValCore, we call ancillary variables and cell measures "supplementary
178+
variables".
179+
Some :ref:`preprocessor functions <Preprocessors>` need this information to
180+
work.
181+
For example, the :ref:`area_statistics<area_statistics>` preprocessor function
182+
needs to know area of each grid cell in order to compute a correctly weighted
183+
statistic.
184+
185+
To attach these variables to a dataset, the ``supplementary_variables`` keyword
186+
can be used.
187+
For example, to add cell area to a dataset, it can be specified as follows:
188+
189+
.. code-block:: yaml
190+
191+
datasets:
192+
- dataset: BCC-ESM1
193+
project: CMIP6
194+
exp: historical
195+
ensemble: r1i1p1f1
196+
grid: gn
197+
supplementary_variables:
198+
- short_name: areacella
199+
mip: fx
200+
exp: 1pctCO2
201+
202+
Note that the supplementary variable will inherit the facet values from the main
203+
dataset, so only those facet values that differ need to be specified.
204+
205+
.. _supplementary_dataset_wildcards:
206+
207+
Automatically selecting the supplementary dataset
208+
-------------------------------------------------
209+
210+
When using many datasets, it may be quite a bit of work to find out which facet
211+
values are required to find the corresponding supplementary data.
212+
The tool can automatically guess the best matching supplementary dataset.
213+
To use this feature, the supplementary dataset can be specified as:
214+
215+
.. code-block:: yaml
216+
217+
datasets:
218+
- dataset: BCC-ESM1
219+
project: CMIP6
220+
exp: historical
221+
ensemble: r1i1p1f1
222+
grid: gn
223+
supplementary_variables:
224+
- short_name: areacella
225+
mip: fx
226+
exp: '*'
227+
activity: '*'
228+
ensemble: '*'
229+
230+
With this syntax, the tool will search all available values of ``exp``,
231+
``activity``, and ``ensemble`` and use the supplementary dataset that shares the
232+
most facet values with the main dataset.
233+
Note that this behaviour is different from
234+
:ref:`using wildcards in the main dataset <dataset_wildcards>`,
235+
where they will be expanded to generate all matching datasets.
236+
The available datasets are shown in the debug log messages when running a recipe
237+
with wildcards, so if a different supplementary dataset is preferred, these
238+
messages can be used to see what facet values are available.
239+
The facet values for local files are retrieved from the directory tree where the
240+
directories represent the facets values.
241+
Reading facet values from file names is not yet supported.
242+
If wildcard expansion fails, this is because there were either no files found
243+
that match those facets, or the facets could not be read from the directory
244+
name or ESGF.
245+
246+
Automatic definition of supplementary variables
247+
-----------------------------------------------
248+
249+
If an ancillary variable or cell measure is
250+
:ref:`needed by a preprocessor function <preprocessors_using_supplementary_variables>`,
251+
but it is not specified in the recipe, the tool will automatically make a best
252+
guess using the syntax above.
253+
Usually this will work fine, but if it does not, it is recommended to explicitly
254+
define the supplementary variables in the recipe.
255+
256+
To disable this automatic addition, define the supplementary variable as usual,
257+
but add the special facet ``skip`` with value ``true``.
258+
See :ref:`preprocessors_using_supplementary_variables` for an example recipe.
259+
260+
Saving ancillary variables and cell measures
261+
--------------------------------------------
262+
263+
By default, ancillary variables and cell measures will be removed
264+
from the main variable before saving it to file because they can be as big as
265+
the main variable.
266+
To keep the supplementary variables, disable the preprocessor function that
267+
removes them by setting ``remove_supplementary_variables: false`` in the
268+
preprocessor profile in the recipe.
269+
270+
Concatenating data corresponding to multiple facets
271+
---------------------------------------------------
272+
117273
It is possible to define the experiment as a list to concatenate two experiments.
118274
Here it is an example concatenating the `historical` experiment with `rcp85`
119275

@@ -130,6 +286,9 @@ In this case, the specified datasets are concatenated into a single cube:
130286
datasets:
131287
- {dataset: CanESM2, project: CMIP5, exp: [historical, rcp85], ensemble: [r1i1p1, r1i2p1], start_year: 2001, end_year: 2004}
132288
289+
Short notation of ensemble members and sub-experiments
290+
------------------------------------------------------
291+
133292
ESMValTool also supports a simplified syntax to add multiple ensemble members from the same dataset.
134293
In the ensemble key, any element in the form `(x:y)` will be replaced with all numbers from x to y (both inclusive),
135294
adding a dataset entry for each replacement. For example, to add ensemble members r1i1p1 to r10i1p1
@@ -152,7 +311,7 @@ Please, bear in mind that this syntax can only be used in the ensemble tag.
152311
Also, note that the combination of multiple experiments and ensembles, like
153312
exp: [historical, rcp85], ensemble: [r1i1p1, "r(2:3)i1p1"] is not supported and will raise an error.
154313

155-
The same simplified syntax can be used to add multiple sub-experiment ids:
314+
The same simplified syntax can be used to add multiple sub-experiments:
156315

157316
.. code-block:: yaml
158317
@@ -161,6 +320,9 @@ The same simplified syntax can be used to add multiple sub-experiment ids:
161320
162321
.. _timerange_examples:
163322

323+
Time ranges
324+
-----------
325+
164326
When using the ``timerange`` tag to specify the start and end points, possible values can be as follows:
165327

166328

@@ -278,17 +440,15 @@ section will include:
278440
- a description of the diagnostic and lists of themes and realms that it applies to;
279441
- an optional ``additional_datasets`` section.
280442
- an optional ``title`` and ``description``, used to generate the title and description
281-
of the ``index.html`` output file.
443+
in the ``index.html`` output file.
282444

283445
.. _tasks:
284446

285447
The diagnostics section defines tasks
286448
-------------------------------------
287449
The diagnostic section(s) define the tasks that will be executed when running the recipe.
288450
For each variable a preprocessing task will be defined and for each diagnostic script a
289-
diagnostic task will be defined. If variables need to be derived
290-
from other variables, a preprocessing task for each of the variables
291-
needed to derive that variable will be defined as well. These tasks can be viewed
451+
diagnostic task will be defined. These tasks can be viewed
292452
in the main_log_debug.txt file that is produced every run. Each task has a unique
293453
name that defines the subdirectory where the results of that task are stored. Task
294454
names start with the name of the diagnostic section followed by a '/' and then

0 commit comments

Comments
 (0)