Skip to content

Ocean/develop: add CF-compliant _FillValue attribute #533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

wenshanw
Copy link

Modified registry XML files and registry parser to add the _FillValue attribute to both state and tracer variables.

@mgduda
Copy link
Contributor

mgduda commented Apr 22, 2020

Although this is a PR to ocean/develop, these changes might eventually make their way into develop, so it may be worth considering that _FillValue and missing_value can have subtly different meaning.

@xylar
Copy link
Collaborator

xylar commented Apr 23, 2020

@mgduda, thanks for the feedback.

these changes might eventually make their way into develop

Indeed, we would like these changes to affect other cores in E3SM, so the changes to the registry parser would almost certainly make their way to develop.

it may be worth considering that _FillValue and missing_value can have subtly different meaning.

Could you elaborate or point us to documentation about this? To me, they seem entirely redundant and I'm not aware of NetCDF supporting anything other than _FillValue. Would missing_value have a meaning at runtime that would differ form _FillValue in I/O?

@xylar
Copy link
Collaborator

xylar commented Apr 23, 2020

@wenshanw, thanks very much for making this pull request. As you can see, this will be a good place to get some discussion going about the changes you are working on an make sure we avoid any unintended consequences.

@xylar
Copy link
Collaborator

xylar commented Apr 23, 2020

This work is a first step at addressing #437

@mark-petersen
Copy link
Contributor

This looks like a simple enough change. If the testing shows that the output files have a CF-compliant fill value, then the changes should be fine.

// Uncomment to add _FillValue to match missing_value
// fortprintf(fd, " call mpas_add_att(%s(%d) %% attLists(const_index) %% attList, '_FillValue', %s)\n", pointer_name, time_lev, missing_value);
fortprintf(fd, " call mpas_add_att(%s(%d) %% attLists(const_index) %% attList, '_FillValue', %s)\n", pointer_name, time_lev, missing_value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a framework file, we need to merge this into the develop branch instead, and then I will merge develop into the LANL branches: ocean/develop, seaice/develop, landice/develop. @wenshanw, don't worry about that right now. This PR is convenient for testing, and I can help with the different branches after testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was thinking that would be the right approach, too, and that @wenshanw shouldn't have to worry about that. We need to do more testing anyway.

@@ -61,7 +61,7 @@

<var_struct name="state" time_levs="2">
<var_struct name="tracers" time_levs="2">
<var_array name="activeTracers" dimensions="nVertLevels nCells Time" type="real" packages="activeTracersPKG" >
<var_array name="activeTracers" dimensions="nVertLevels nCells Time" type="real" packages="activeTracersPKG" missing_value="FILLVAL">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we need to add missing_value="FILLVAL" to every var_array and var(not within a var_array) in the registry? It would be nice to set missing_value="FILLVAL" as the default so no changes need to be made to the Registry.

@mgduda would a default setting of the missing_value="FILLVAL" cause problems for the atmospheric core?

An alternative is to have a global variable or attribute at the top of the registry file so each core can specify the missing_value for all variables with a single line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@czender seemed to think that we would want to manually mark variables with _FillValue and that the default should be that variable don't have a fill value. It isn't good practice to mark variable that are valid everywhere as having a _FillValue, for example. But I'll let @czender further clarify.

I don't know what missing_value is for so I won't comment on that until I get clarity from @mgduda.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Variables that are valid everywhere (e.g., most of EAM) will never use a _FillValue and therefore should not a the _FillValue attribute. The only reason why is efficiency: _FillValue slows-down code. In-line code and generic analysis tools must check every single value of a variable with _FillValue before operating on that value. Lots of if's. When variables do not have _FillValue we can assume all values are valid then the analysis of the variable needs no if's. That said, it is perfect legal to have a _FillValue attribute that is never used. It is just more efficient not to.

@xylar
Copy link
Collaborator

xylar commented Apr 23, 2020

This looks like a simple enough change. If the testing shows that the output files have a CF-compliant fill value, then the changes should be fine.

To be clear, all this change does so far is add the attribute to NetCDF output with the correct value. Changes (likely many of them) still need to be made in the code so the missing value within data arrays actually gets filled in with the proper value. We have been exceedingly bad about this in MPAS development so far. So this will be a big pain for @wenshanw to fix, with my help and hopefully help from others from the various MPAS teams.

@xylar
Copy link
Collaborator

xylar commented Apr 23, 2020

We thought it was a good idea to break the process into smaller PRs that do no harm, rather than a mega PR that affects a large number of files at once.

@matthewhoffman
Copy link
Member

@wenshanw and @xylar , thank you for your work on this! This is long overdue and will be useful.
@xylar , that seems wise to break things into pieces. Is there an outline for the all the changes required for the complete implementation? That would help to envision how it all works. I'm also open to an old-school MPAS Developers telecon to hash out the details.

@xylar
Copy link
Collaborator

xylar commented Apr 23, 2020

@matthewhoffman, the work is on an E3SM confluence page. That may end up being a bit of an unfortunate place for the discussion only because @mgduda and others not in E3SM can't access it. At the same time, this is work being supported by the E3SM infrastructure team so that does see like an appropriate place to keep track of the work. Suggestions?

@xylar
Copy link
Collaborator

xylar commented Apr 23, 2020

I'm also open to an old-school MPAS Developers telecon to hash out the details.

I think we need a developer's telecon anyway to discuss #472, so I would be very eager to have one scheduled. I agree, this would be another issue for the agenda.

@matthewhoffman
Copy link
Member

I think that Confluence page makes the most sense for detailed discussion, unless @mgduda prefers something else. We can email @mgduda a PDF export of the page, and assuming his feedback doesn't become too involved, we can manually add his input to the page.

@czender
Copy link

czender commented Apr 23, 2020

it may be worth considering that _FillValue and missing_value can have subtly different meaning.

Could you elaborate or point us to documentation about this? To me, they seem entirely redundant and I'm not aware of NetCDF supporting anything other than _FillValue. Would missing_value have a meaning at runtime that would differ form _FillValue in I/O?

Indeed there is a slight difference between _FillValue and missing_value as described in the NUG https://www.unidata.ucar.edu/software/netcdf/docs/attribute_conventions.html and in the CF-Conventions http://cfconventions.org/cf-conventions/cf-conventions.html#missing-data. My takeaway is that _FillValue is almost always what should be used. Moreover, have distinct _FillValue and missing_value attributes will lead to a world of hurt during analysis because that means tools would have to separately check values against both! Highly inefficient, and few if any tools do that.

@mgduda
Copy link
Contributor

mgduda commented Apr 23, 2020

It may be that my recollection is fuzzy. I had been thinking that our intent back c. 2016 was that _FillValue should indicate values that were not actually written to disk by MPAS (since _FillValue is used to pre-fill variables allocated on disk), and missing_value should indicate values that were written but that do not contain valid values. However, I'm wondering now whether the decision to go with missing_value was simply one of performance.

For reference, commit 88bff9c seems to have been the original commit that added both missing_value and _FillValue attributes. Commit bc9bbb8 commented out the code to write the _FillValue attribute, leaving just the missing_value attribute.

@mgduda
Copy link
Contributor

mgduda commented Apr 23, 2020

Adding missing_value="FILLVAL" everywhere wouldn't cause problems for MPAS-Atmosphere from a pre- or post-processing perspective -- our tools should work without problem (not really through brilliant design, but simply because they ignore most attributes).

However, I'm concerned that there may be a performance penalty to pre-writing all variables with _FillValue, then rewriting those variables with valid data at the netCDF library level. And, as @czender says, it probably isn't good practice to add a _FillValue to fields that are valid everywhere, which is the case for all MPAS-Atmosphere fields that I can think of.

@mgduda
Copy link
Contributor

mgduda commented Apr 23, 2020

I'm definitely in favor of the suggestion by @xylar to break development into small PRs that can be easily reviewed rather than creating one large PR spanning several parts of the code. As @matthewhoffman said it can be really helpful to have an outline of the smaller steps that are planned to accomplish the larger objective.

@xylar
Copy link
Collaborator

xylar commented Apr 23, 2020

@mgduda, it sounds like the difference between missing_value and _FillValue is to do with how cores would choose to fill in invalid values. It seems to me that this distinction is irrelevant in file output and should not propagate through to file I/O. All invalid values should be have the value specified by _FillValue by the output stage.

In terms of implementation, it is hard for me to see how first filling the array with _FillValue and later filling in certain invalid fields with missing_value would lead to something useful. Instead, I would think one would simply not assign a _FillValue or missing_value to fields where it is know that all entries will be valid. One would assign a _FillValue to fields that need it, and would explicitly fill that field at some point (either before other operations or during model run) with that value where it is invalid. If cores wanted to do this by setting the full array to _FillValue at the beginning of the run and only altering valid values, that would be one approach. Explicitly filling in invalid values with _FillValue would be another. But I don't see how there can be a useful distinction between missing_values explicitly filled in and _FillValue as the default value for the array before it is altered. Perhaps I've misunderstood, in which case, please set me straight.

@mgduda
Copy link
Contributor

mgduda commented Apr 23, 2020

@xylar Perhaps it's true from the perspective of certain post-processing tools that there's no distinction between missing_value and _FillValue, but I would disagree that there isn't a distinction in a broader sense. Consider the difference between these two statements:

  1. MPAS-Atmosphere did not process cell 42
  2. MPAS-Atmosphere processed cell 42 but the processing resulted in an invalid value

For a specific example, consider computing the geopotential height of the 700 hPa surface for a limited-area domain: below high mountains, the field is ill-defined, and outside of the domain, the field is neither computed nor written to an output file.

I agree that a post-processing tool that simply aimed, e.g., to average all valid values wouldn't care about the distinction between these statements; but it doesn't mean that other tools, or a model developer, would see no distinction.

Regardless, as I mentioned, it may have been that we decided to go with missing_value for purely performance reasons -- _FillValue pre-fills values in a field allocated on disk before that field is written. If I'm understanding correctly, this means a field is effectively written twice: once to pre-fill on disk, and a second time to write the user-provided field from memory. Note that the _FillValue attribute has special meaning to the netCDF library.

@xylar
Copy link
Collaborator

xylar commented Apr 23, 2020

@mgduda, thanks for clarifying further.

For the current project, we are only interested in _FillValue in the sense that it has meaning for the NetCDF library. It is pretty severely problematic for us that the MPAS components I work with (Ocean, Land and Seaice) produce files that set missing or invalid values to a bunch of arbitrary things and do not set the _FillValue attribute to anything at all. We want to get to the point where both _FillValue takes on a useful value that propagates to files as a NetCDF attribute (the purpose of this PR) and where invalid values are, in fact, filled in with that _FillValue.

Having missing_value as a netCDF attribute, particularly if it is a different value from _FillValue seems both confusing and problematic to me. It isn't clear to me why an MPAS components need to have a missing_value that is different from _FillValue, but I don't think it's a good idea to have that missing_value as an attribute even if it does because this is simply not compatible with post-processing and isn't CF-compliant.

A quick grep for missing_value and _FillValue suggests that no components (at least not versions in develop) are using either of these attributes. So how these attributes could be used in components as opposed to NetCDF output seems like a purely academic argument, since they're not being used.

Regardless, as I mentioned, it may have been that we decided to go with missing_value for purely performance reasons -- _FillValue pre-fills values in a field allocated on disk before that field is written. If I'm understanding correctly, this means a field is effectively written twice: once to pre-fill on disk, and a second time to write the user-provided field from memory. Note that the _FillValue attribute has special meaning to the netCDF library.

I can't pretend that I have a very thorough understand either NetCDF Fortran output or MPAS's wrapper around it. Are you saying that if we set the _FillValue attribute as part of NetCDF output, the array gets written twice and that was the reason for avoiding that particular attribute name? If so, that's quite problematic, since only that attribute name produces the desired post-processing results, namely that these invalid values get treated as such and not as valid numbers, precisely the giant headache we're trying to alleviate here. At the same time, I agree that we would like very much to avoid doubling the I/O because presumably we are writing out the entire array (both valid and invalid values) in any case.

If that's not what you're saying, could you try to clarify further?

@czender
Copy link

czender commented Apr 23, 2020

I can shed some light here. First, know that when CAM/EAM and CLM/ELM define a _FillValue (which is rare for CAM/EAM) they always define a missing_value attribute identical to the _FillValue. I think this is for back-compatibility to the 1990s when missing_value was what most people used. Then netCDF3 introduced library-dependent behavior for _FillValue and most applications switched either entirely to that or to duplicating _FillValue with missing_value. I do not think there need be a performance hit to using _FillValue for at least one or two reasons: First, NCO sets (and CAM/EAM may set, I'm unsure) NC_NOFILL when creating datasets: https://www.unidata.ucar.edu/software/netcdf/docs/group__datasets.html#ga610e6fadb14a51f294b322a1b8ac1bec. This prevents the duplicate filling @mgduda warns about. Second, it may be the case that NC_NOFILL is only needed for netCDF3 datasets, not netCDF4 datasets (though I cannot remember for sure, it may be worth asking the Unidata folks). In summary, 1. Defining _FillValue need not impart a disk performance hit if NC_NOFILL is set. 2. Writing a redundant missing_value attribute is fine unless its value differs from the _FillValue. That would not play well with any software that does not explicitly check each field value against both _FillValue and missing_value values, and most software I am aware of is not that pedantic because it makes code to complex and slow.

@@ -2031,6 +2031,7 @@
<var_struct name="state" time_levs="2">
<var name="normalVelocity" type="real" dimensions="nVertLevels nEdges Time" units="m s^{-1}"
description="horizonal velocity, normal component to an edge"
missing_value="FILLVAL"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenshanw, this spacing needs to be done with tabs rather than spaces to be compatible with the rest of the XML in MPAS. (I know it's hard to see the difference in a lot of editors.)

@xylar
Copy link
Collaborator

xylar commented Apr 29, 2020

@wenshanw, @czender and @mark-petersen, could we come up with a check-list of things that need to be done before this pull request (PR) is ready to merge? I mostly want to make sure we define what the scope of this PR and keep it simple.

  • Do we want to include adding missing values to other variables? Maybe layerThickness?
  • I assume we will leave the timeSeriesStatsMonthly and timeSeriesStatsDaily analysis members for later on?
  • Is there a need to rebase onto the latest changes? (I think this wouldn't hurt.)
  • Do we need to make the changes to the registry parser in develop before we proceed further?
  • On a related note, do we need to have an MPAS developers meeting to discuss the overall plan first?

@mgduda
Copy link
Contributor

mgduda commented May 7, 2020

@xylar @wenshanw @mark-petersen If I understood correctly, one of the items that came out of our discussion on 4 May was the desire to use the missing_value in the registry entry for a variable as the default value if no default_value attribute is specified for the variable. In order to track progress on this, I've opened up issue #551 . If you see anything inaccurate in the description of the issue, please feel free to comment on #551 . I think this is something that we should be able to implement in short order.

@xylar
Copy link
Collaborator

xylar commented Aug 26, 2020

@wenshanw, I think the changes from this branch are either in #571 or they have been brought in to ocean/develop via #616. I'm going to close this for now, but please re-open if that is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants