Ocean/develop: add CF-compliant _FillValue attribute #533

wenshanw · 2020-04-22T22:13:26Z

Modified registry XML files and registry parser to add the _FillValue attribute to both state and tracer variables.

mgduda · 2020-04-22T22:24:40Z

Although this is a PR to ocean/develop, these changes might eventually make their way into develop, so it may be worth considering that _FillValue and missing_value can have subtly different meaning.

xylar · 2020-04-23T07:22:21Z

@mgduda, thanks for the feedback.

these changes might eventually make their way into develop

Indeed, we would like these changes to affect other cores in E3SM, so the changes to the registry parser would almost certainly make their way to develop.

it may be worth considering that _FillValue and missing_value can have subtly different meaning.

Could you elaborate or point us to documentation about this? To me, they seem entirely redundant and I'm not aware of NetCDF supporting anything other than _FillValue. Would missing_value have a meaning at runtime that would differ form _FillValue in I/O?

xylar · 2020-04-23T08:44:12Z

@wenshanw, thanks very much for making this pull request. As you can see, this will be a good place to get some discussion going about the changes you are working on an make sure we avoid any unintended consequences.

xylar · 2020-04-23T08:45:14Z

This work is a first step at addressing #437

mark-petersen · 2020-04-23T12:45:49Z

This looks like a simple enough change. If the testing shows that the output files have a CF-compliant fill value, then the changes should be fine.

mark-petersen · 2020-04-23T12:52:05Z

src/tools/registry/gen_inc.c

 				// Uncomment to add _FillValue to match missing_value
-				// fortprintf(fd, "         call mpas_add_att(%s(%d) %% attLists(const_index) %% attList, '_FillValue', %s)\n", pointer_name, time_lev, missing_value);
+				fortprintf(fd, "         call mpas_add_att(%s(%d) %% attLists(const_index) %% attList, '_FillValue', %s)\n", pointer_name, time_lev, missing_value);


Since this is a framework file, we need to merge this into the develop branch instead, and then I will merge develop into the LANL branches: ocean/develop, seaice/develop, landice/develop. @wenshanw, don't worry about that right now. This PR is convenient for testing, and I can help with the different branches after testing.

Right, I was thinking that would be the right approach, too, and that @wenshanw shouldn't have to worry about that. We need to do more testing anyway.

mark-petersen · 2020-04-23T13:09:32Z

src/core_ocean/tracer_groups/Registry_activeTracers.xml

@@ -61,7 +61,7 @@

 	<var_struct name="state" time_levs="2">
 		<var_struct name="tracers" time_levs="2">
-			<var_array name="activeTracers" dimensions="nVertLevels nCells Time" type="real" packages="activeTracersPKG" >
+			<var_array name="activeTracers" dimensions="nVertLevels nCells Time" type="real" packages="activeTracersPKG" missing_value="FILLVAL">


Will we need to add missing_value="FILLVAL" to every var_array and var(not within a var_array) in the registry? It would be nice to set missing_value="FILLVAL" as the default so no changes need to be made to the Registry.

@mgduda would a default setting of the missing_value="FILLVAL" cause problems for the atmospheric core?

An alternative is to have a global variable or attribute at the top of the registry file so each core can specify the missing_value for all variables with a single line.

@czender seemed to think that we would want to manually mark variables with _FillValue and that the default should be that variable don't have a fill value. It isn't good practice to mark variable that are valid everywhere as having a _FillValue, for example. But I'll let @czender further clarify.

I don't know what missing_value is for so I won't comment on that until I get clarity from @mgduda.

That is correct. Variables that are valid everywhere (e.g., most of EAM) will never use a _FillValue and therefore should not a the _FillValue attribute. The only reason why is efficiency: _FillValue slows-down code. In-line code and generic analysis tools must check every single value of a variable with _FillValue before operating on that value. Lots of if's. When variables do not have _FillValue we can assume all values are valid then the analysis of the variable needs no if's. That said, it is perfect legal to have a _FillValue attribute that is never used. It is just more efficient not to.

xylar · 2020-04-23T14:49:27Z

This looks like a simple enough change. If the testing shows that the output files have a CF-compliant fill value, then the changes should be fine.

To be clear, all this change does so far is add the attribute to NetCDF output with the correct value. Changes (likely many of them) still need to be made in the code so the missing value within data arrays actually gets filled in with the proper value. We have been exceedingly bad about this in MPAS development so far. So this will be a big pain for @wenshanw to fix, with my help and hopefully help from others from the various MPAS teams.

xylar · 2020-04-23T14:50:34Z

We thought it was a good idea to break the process into smaller PRs that do no harm, rather than a mega PR that affects a large number of files at once.

matthewhoffman · 2020-04-23T14:59:58Z

@wenshanw and @xylar , thank you for your work on this! This is long overdue and will be useful.
@xylar , that seems wise to break things into pieces. Is there an outline for the all the changes required for the complete implementation? That would help to envision how it all works. I'm also open to an old-school MPAS Developers telecon to hash out the details.

xylar · 2020-04-23T15:15:12Z

@matthewhoffman, the work is on an E3SM confluence page. That may end up being a bit of an unfortunate place for the discussion only because @mgduda and others not in E3SM can't access it. At the same time, this is work being supported by the E3SM infrastructure team so that does see like an appropriate place to keep track of the work. Suggestions?

xylar · 2020-04-23T15:18:12Z

I'm also open to an old-school MPAS Developers telecon to hash out the details.

I think we need a developer's telecon anyway to discuss #472, so I would be very eager to have one scheduled. I agree, this would be another issue for the agenda.

matthewhoffman · 2020-04-23T15:20:22Z

I think that Confluence page makes the most sense for detailed discussion, unless @mgduda prefers something else. We can email @mgduda a PDF export of the page, and assuming his feedback doesn't become too involved, we can manually add his input to the page.

czender · 2020-04-23T15:42:51Z

it may be worth considering that _FillValue and missing_value can have subtly different meaning.

Could you elaborate or point us to documentation about this? To me, they seem entirely redundant and I'm not aware of NetCDF supporting anything other than _FillValue. Would missing_value have a meaning at runtime that would differ form _FillValue in I/O?

Indeed there is a slight difference between _FillValue and missing_value as described in the NUG https://www.unidata.ucar.edu/software/netcdf/docs/attribute_conventions.html and in the CF-Conventions http://cfconventions.org/cf-conventions/cf-conventions.html#missing-data. My takeaway is that _FillValue is almost always what should be used. Moreover, have distinct _FillValue and missing_value attributes will lead to a world of hurt during analysis because that means tools would have to separately check values against both! Highly inefficient, and few if any tools do that.

mgduda · 2020-04-23T16:09:13Z

It may be that my recollection is fuzzy. I had been thinking that our intent back c. 2016 was that _FillValue should indicate values that were not actually written to disk by MPAS (since _FillValue is used to pre-fill variables allocated on disk), and missing_value should indicate values that were written but that do not contain valid values. However, I'm wondering now whether the decision to go with missing_value was simply one of performance.

For reference, commit 88bff9c seems to have been the original commit that added both missing_value and _FillValue attributes. Commit bc9bbb8 commented out the code to write the _FillValue attribute, leaving just the missing_value attribute.

mgduda · 2020-04-23T16:12:19Z

Adding missing_value="FILLVAL" everywhere wouldn't cause problems for MPAS-Atmosphere from a pre- or post-processing perspective -- our tools should work without problem (not really through brilliant design, but simply because they ignore most attributes).

However, I'm concerned that there may be a performance penalty to pre-writing all variables with _FillValue, then rewriting those variables with valid data at the netCDF library level. And, as @czender says, it probably isn't good practice to add a _FillValue to fields that are valid everywhere, which is the case for all MPAS-Atmosphere fields that I can think of.

mgduda · 2020-04-23T16:15:55Z

I'm definitely in favor of the suggestion by @xylar to break development into small PRs that can be easily reviewed rather than creating one large PR spanning several parts of the code. As @matthewhoffman said it can be really helpful to have an outline of the smaller steps that are planned to accomplish the larger objective.

xylar · 2020-04-23T18:11:14Z

@mgduda, it sounds like the difference between missing_value and _FillValue is to do with how cores would choose to fill in invalid values. It seems to me that this distinction is irrelevant in file output and should not propagate through to file I/O. All invalid values should be have the value specified by _FillValue by the output stage.

In terms of implementation, it is hard for me to see how first filling the array with _FillValue and later filling in certain invalid fields with missing_value would lead to something useful. Instead, I would think one would simply not assign a _FillValue or missing_value to fields where it is know that all entries will be valid. One would assign a _FillValue to fields that need it, and would explicitly fill that field at some point (either before other operations or during model run) with that value where it is invalid. If cores wanted to do this by setting the full array to _FillValue at the beginning of the run and only altering valid values, that would be one approach. Explicitly filling in invalid values with _FillValue would be another. But I don't see how there can be a useful distinction between missing_values explicitly filled in and _FillValue as the default value for the array before it is altered. Perhaps I've misunderstood, in which case, please set me straight.

mgduda · 2020-04-23T20:29:42Z

@xylar Perhaps it's true from the perspective of certain post-processing tools that there's no distinction between missing_value and _FillValue, but I would disagree that there isn't a distinction in a broader sense. Consider the difference between these two statements:

MPAS-Atmosphere did not process cell 42
MPAS-Atmosphere processed cell 42 but the processing resulted in an invalid value

For a specific example, consider computing the geopotential height of the 700 hPa surface for a limited-area domain: below high mountains, the field is ill-defined, and outside of the domain, the field is neither computed nor written to an output file.

I agree that a post-processing tool that simply aimed, e.g., to average all valid values wouldn't care about the distinction between these statements; but it doesn't mean that other tools, or a model developer, would see no distinction.

Regardless, as I mentioned, it may have been that we decided to go with missing_value for purely performance reasons -- _FillValue pre-fills values in a field allocated on disk before that field is written. If I'm understanding correctly, this means a field is effectively written twice: once to pre-fill on disk, and a second time to write the user-provided field from memory. Note that the _FillValue attribute has special meaning to the netCDF library.

xylar · 2020-04-23T20:58:22Z

@mgduda, thanks for clarifying further.

For the current project, we are only interested in _FillValue in the sense that it has meaning for the NetCDF library. It is pretty severely problematic for us that the MPAS components I work with (Ocean, Land and Seaice) produce files that set missing or invalid values to a bunch of arbitrary things and do not set the _FillValue attribute to anything at all. We want to get to the point where both _FillValue takes on a useful value that propagates to files as a NetCDF attribute (the purpose of this PR) and where invalid values are, in fact, filled in with that _FillValue.

Having missing_value as a netCDF attribute, particularly if it is a different value from _FillValue seems both confusing and problematic to me. It isn't clear to me why an MPAS components need to have a missing_value that is different from _FillValue, but I don't think it's a good idea to have that missing_value as an attribute even if it does because this is simply not compatible with post-processing and isn't CF-compliant.

A quick grep for missing_value and _FillValue suggests that no components (at least not versions in develop) are using either of these attributes. So how these attributes could be used in components as opposed to NetCDF output seems like a purely academic argument, since they're not being used.

Regardless, as I mentioned, it may have been that we decided to go with missing_value for purely performance reasons -- _FillValue pre-fills values in a field allocated on disk before that field is written. If I'm understanding correctly, this means a field is effectively written twice: once to pre-fill on disk, and a second time to write the user-provided field from memory. Note that the _FillValue attribute has special meaning to the netCDF library.

I can't pretend that I have a very thorough understand either NetCDF Fortran output or MPAS's wrapper around it. Are you saying that if we set the _FillValue attribute as part of NetCDF output, the array gets written twice and that was the reason for avoiding that particular attribute name? If so, that's quite problematic, since only that attribute name produces the desired post-processing results, namely that these invalid values get treated as such and not as valid numbers, precisely the giant headache we're trying to alleviate here. At the same time, I agree that we would like very much to avoid doubling the I/O because presumably we are writing out the entire array (both valid and invalid values) in any case.

If that's not what you're saying, could you try to clarify further?

czender · 2020-04-23T21:30:11Z

I can shed some light here. First, know that when CAM/EAM and CLM/ELM define a _FillValue (which is rare for CAM/EAM) they always define a missing_value attribute identical to the _FillValue. I think this is for back-compatibility to the 1990s when missing_value was what most people used. Then netCDF3 introduced library-dependent behavior for _FillValue and most applications switched either entirely to that or to duplicating _FillValue with missing_value. I do not think there need be a performance hit to using _FillValue for at least one or two reasons: First, NCO sets (and CAM/EAM may set, I'm unsure) NC_NOFILL when creating datasets: https://www.unidata.ucar.edu/software/netcdf/docs/group__datasets.html#ga610e6fadb14a51f294b322a1b8ac1bec. This prevents the duplicate filling @mgduda warns about. Second, it may be the case that NC_NOFILL is only needed for netCDF3 datasets, not netCDF4 datasets (though I cannot remember for sure, it may be worth asking the Unidata folks). In summary, 1. Defining _FillValue need not impart a disk performance hit if NC_NOFILL is set. 2. Writing a redundant missing_value attribute is fine unless its value differs from the _FillValue. That would not play well with any software that does not explicitly check each field value against both _FillValue and missing_value values, and most software I am aware of is not that pedantic because it makes code to complex and slow.

xylar · 2020-04-29T08:31:33Z

src/core_ocean/Registry.xml

@@ -2031,6 +2031,7 @@
 	<var_struct name="state" time_levs="2">
 		<var name="normalVelocity" type="real" dimensions="nVertLevels nEdges Time" units="m s^{-1}"
 			 description="horizonal velocity, normal component to an edge"
+       missing_value="FILLVAL"


@wenshanw, this spacing needs to be done with tabs rather than spaces to be compatible with the rest of the XML in MPAS. (I know it's hard to see the difference in a lot of editors.)

xylar · 2020-04-29T08:36:56Z

@wenshanw, @czender and @mark-petersen, could we come up with a check-list of things that need to be done before this pull request (PR) is ready to merge? I mostly want to make sure we define what the scope of this PR and keep it simple.

Do we want to include adding missing values to other variables? Maybe layerThickness?
I assume we will leave the timeSeriesStatsMonthly and timeSeriesStatsDaily analysis members for later on?
Is there a need to rebase onto the latest changes? (I think this wouldn't hurt.)
Do we need to make the changes to the registry parser in develop before we proceed further?
On a related note, do we need to have an MPAS developers meeting to discuss the overall plan first?

mgduda · 2020-05-07T21:21:44Z

@xylar @wenshanw @mark-petersen If I understood correctly, one of the items that came out of our discussion on 4 May was the desire to use the missing_value in the registry entry for a variable as the default value if no default_value attribute is specified for the variable. In order to track progress on this, I've opened up issue #551 . If you see anything inaccurate in the description of the issue, please feel free to comment on #551 . I think this is something that we should be able to implement in short order.

xylar · 2020-08-26T11:25:02Z

@wenshanw, I think the changes from this branch are either in #571 or they have been brought in to ocean/develop via #616. I'm going to close this for now, but please re-open if that is not the case.

wenshanw added 2 commits March 29, 2020 11:55

test push

e307647

Modify Registry parser to add _FillValue attribute

1f1d752

xylar added in progress Ocean labels Apr 23, 2020

xylar requested review from mark-petersen, mgduda, matthewhoffman and akturner April 23, 2020 08:45

xylar assigned mark-petersen Apr 23, 2020

xylar mentioned this pull request Apr 23, 2020

Add a daily output test case to QU240 #535

Merged

mark-petersen reviewed Apr 23, 2020

View reviewed changes

xylar reviewed Apr 29, 2020

View reviewed changes

xylar mentioned this pull request May 24, 2020

Add _FillValue to all active and debug tracer fields #571

Closed

xylar closed this Aug 26, 2020

matthewhoffman added in progress Ocean labels Mar 17, 2021

Ocean/develop: add CF-compliant _FillValue attribute #533

Ocean/develop: add CF-compliant _FillValue attribute #533

Uh oh!

Conversation

wenshanw commented Apr 22, 2020

Uh oh!

mgduda commented Apr 22, 2020

Uh oh!

xylar commented Apr 23, 2020

Uh oh!

xylar commented Apr 23, 2020

Uh oh!

xylar commented Apr 23, 2020

Uh oh!

mark-petersen commented Apr 23, 2020

Uh oh!

mark-petersen Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

xylar Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

mark-petersen Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

xylar Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

czender Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

xylar commented Apr 23, 2020

Uh oh!

xylar commented Apr 23, 2020

Uh oh!

matthewhoffman commented Apr 23, 2020

Uh oh!

xylar commented Apr 23, 2020

Uh oh!

xylar commented Apr 23, 2020

Uh oh!

matthewhoffman commented Apr 23, 2020

Uh oh!

czender commented Apr 23, 2020

Uh oh!

mgduda commented Apr 23, 2020

Uh oh!

mgduda commented Apr 23, 2020

Uh oh!

mgduda commented Apr 23, 2020

Uh oh!

xylar commented Apr 23, 2020

Uh oh!

mgduda commented Apr 23, 2020

Uh oh!

xylar commented Apr 23, 2020

Uh oh!

czender commented Apr 23, 2020

Uh oh!

xylar Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

xylar commented Apr 29, 2020

Uh oh!

mgduda commented May 7, 2020

Uh oh!

xylar commented Aug 26, 2020

Uh oh!

Uh oh!