-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conventions for placing CF constructs in a group hierarchy #533
Comments
Hello Patrick, Thanks for putting this together. It is very clear, but I'm afraid that I'm not yet convinced by some of this. I like the new terminology definitions, but would like to see some more reasons why the new recommendations are as they are, as I currently think that they may not be necessary. A few points/questions:
Even if it was some a priori intel, the full search algorithm defined by CF will still need to be applied. I suppose you could write library software that employed a search algorithm that only looked in places according to these recommendations, but that would fail on CF-compliant files that didn't adhere to them, so I doubt anyone would do that!
I would say that there is no de facto reason for using groups. For instance:
I don't think we should restrict the data writer when they are creating a structure for a dataset, happy in the knowledge that the well-defined search algorithm will be applied by the data reader. For instance - where should you put an orography data variable (with its own referenced variables) that is also used as a formula term to parametric vertical coordinate variable? I don't think there is any right answer to that ...
Why is this preferred? Thanks for your patience, |
Hello David @davidhassell, I see your points on the language in the Requirements Summary and that can definitely use some tidying up. None of that is intended to make it into the conventions document, though. On the "convincing" part of your post: section 2.7 currently has no conventions whatsoever that would guide data writers (not "restrict", as you mention, the four conventions are stated as "recommendations", "should" instead of "shall") on how to distribute CF constructs and their defining netCDF dimensions and variables over a data set using groups. Those four recommendations are supporting the development of new data collections, not to invalidate existing ones or coerce data writers to apply a specific approach. While it is certainly possible (and not particularly difficult) to implement the scoping rules in reader software, there is also beauty in simplicity that comes from following conventions. Your question on relative versus absolute paths is a case-in-point: with relative paths the structure between the CF constructs and netCDF elements in the data set becomes apparent. A long relative path is an indication of a potential logical design flaw in the data organisation, even if it is fully compliant with the conventions. Absolute paths are a "lazy" and "blind" approach in that perspective. |
Conventions for placing CF constructs in a group hierarchy
Moderator
To be decided
Moderator Status Review [last updated: 2024-07-29]
New issue created 2024-07-29, based on a discussion in https://github.com/orgs/cf-convention/discussions/333
Requirement Summary
NetCDF-4 introduced the concept of groups: a directory-like structure that distributes the contents of a data set over multiple groups. "They can be used to organize large numbers of variables" (netCDF User's Guide). The CF Conventions allow the use of groups, elaborating on scoping rules and introducing some restrictions on placement of coordinate variables and attributes. There is no guidance on how to organize large numbers of variables. This absence of guidance creates ambiguity and leaves open the possibility of creating unnecessarily complex data sets, complicating the task of software readers to correctly interpret the data set.
A CF-compliant data set may consist of a few netCDF variables (say, a data variable and three or four coordinate variables) that do not require specific guidance beyond what is currently in the convention text. There are also data sets, however, that are composed of many more netCDF variables (say, an atmosphere product of the Arctic region with XY axes with auxilliary lat-lon coordinate variables, a parametric Z axis with two terms, and a T axis, all with their bounds variables, and grid mapping - that's 13 netCDF variables associated with 1 data variable). Such more complex data variables can be organized in groups in different ways. Currently, data producers have no guidance from the conventions in terms of a preferred, recommended or required design pattern. Likewise, data readers have no a priori intel to interpret a data set and thus have to implement comprehensive search algorithms to locate CF constructs distributed over multiple groups.
As multiple data variables are added to a data set (the raison d'être for using groups), the need for structuring the contents of the data set quickly becomes apparent for two principal reasons:
To fully unlock the potential of groups, while avoiding a proliferation of different approaches to organizing and sharing constructs, it is suggested to add guidance to the convention document to aid data producers in using a design pattern that is easily understood by software readers and end users.
Technical Proposal Summary
The intent of this proposal is to define a (small set of) general principle(s) on the distribution of CF constructs encompassing one or more data variables over multiple groups in a single data set. Based on the general principle(s), define a handful of conventions that provide practical guidance to data producers and readers. The agreed text of this proposal, if and when that stage is reached, is then to be integrated into the current section 2.7 of the conventions document, with clarifications and cross-references added to other sections of the conventions document.
This proposal centers on two main CF constructs: the data variable (DV) and the coordinate variable (CV). The logic behind this is that other CF constructs are typically related to one or the other in a dependent manner while the use of the additional constructs is (mostly?) mutually exclusive between the DV and the CV. Notable exceptions are grid mapping variables (these could be labeled naturally general as their definition does not depend on any other CF construct) and scalar coordinate variables (which for purposes of this discussion can be grouped with CVs).
General principle
Relative to a DV or a CV, elements that are general should be placed in the group of the DV or CV or an ancestor group thereof, while elements that are specific to the DV or CV should be placed in the group of the DV or CV or a child group thereof.
"General" are those elements that could (potentially) be shared between DVs or CVs. This includes grid mapping variables, CVs (for DVs), and terms for parametric vertical coordinates (for CVs). When general elements are shared between DVs or CVs in a data set and those DVs or CVs are located in multiple groups, the general elements should be located in an ancestor group of all affected DVs or CVs.
"Specific" are those elements that are applicable to a single DV or CV within the data set.
Conventions
Terminology
The following changes to terminlogy in section 1.3 of the conventions document are proposed:
Cross-references and edits to other sections
Cross-references from various sections to this text. Appendix I.
To be completed
Examples
More to be added.
Benefits
The proposed conventions will aid data producers in defining a design pattern for placing multiple data variables in a single data set. Data readers will benefit from interpreting data sets using groups with a compact yet complete set of guidelines that are not very dissimilar from reading a "flat" data set.
Status Quo
The conventions currently provide the option of using groups in netCDF-4 files with scoping rules and a few restrictions on coordinate variables and attributes but there is no guidance on how to distribute the netCDF variables that make up a single CF data variable over groups.
Associated pull request
A pull request has not yet been created.
The text was updated successfully, but these errors were encountered: