Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File metadata change when subsetting data #198

Open
atmodatcode opened this issue Oct 27, 2021 · 5 comments
Open

File metadata change when subsetting data #198

atmodatcode opened this issue Oct 27, 2021 · 5 comments
Assignees

Comments

@atmodatcode
Copy link

Description

I retrieved data (variable tas) from https://cds.climate.copernicus.eu/cdsapp#!/dataset/projections-cmip6?tab=form in two manners:

  1. original (unsubsetted) CMIP6 historical data
  2. subsetted CMIP6 historical data (subset in time and region).

I noted that the file metadata differ dependent on this choice.
For example, in contrast to the original data, the subsetted data contain
time:_FillValue = NaN ;
lat:_FillValue = NaN ;
time_bnds:_FillValue = NaN ;
time_bnds:coordinates = "height" ;
and so forth, which are not present in the original data.
Having the coordinate variables with NaN is not typical and does not follow the CF recommendations (http://cfconventions.org/cf-conventions/cf-conventions.html#missing-data) where it is stated that the _FillValue should have the same units as the variable itself.
Also, some software cannot handle missing values defined with NaN ( e.g. CDO produce weird results in such cases).
The NaN issue does not affect the data variables, but the coordinate variables and the bound variables.

Then, having the auxiliary coordinate variable "height" associated with the time_bnds variable makes no sense.
I'm here showing what the command diff shows when comparing the ncdump -h outputs of the original and the subsetted results:

diff tas_Amon_MPI-ESM1-2-LR_historical_r1i1p1f1_gn_185001-186912_v20190710.ncd_header ../subset/*ncd_header
1c1
< netcdf tas_Amon_MPI-ESM1-2-LR_historical_r1i1p1f1_gn_185001-186912_v20190710 {
---
> netcdf tas_Amon_MPI-ESM1-2-LR_historical_r1i1p1f1_gn_19000116-19011216_v20190710 {
3,5c3
< 	time = UNLIMITED ; // (240 currently)
< 	lat = 96 ;
< 	lon = 192 ;
---
> 	time = UNLIMITED ; // (24 currently)
6a5,6
> 	lat = 10 ;
> 	lon = 22 ;
8a9
> 		time:_FillValue = NaN ;
10,11d10
< 		time:units = "days since 1850-1-1 00:00:00" ;
< 		time:calendar = "proleptic_gregorian" ;
14a14,15
> 		time:units = "days since 1850-01-01" ;
> 		time:calendar = "proleptic_gregorian" ;
15a17,18
> 		time_bnds:_FillValue = NaN ;
> 		time_bnds:coordinates = "height" ;
16a20
> 		lat:_FillValue = NaN ;
22a27,28
> 		lat_bnds:_FillValue = NaN ;
> 		lat_bnds:coordinates = "height" ;
23a30
> 		lon:_FillValue = NaN ;
29a37,38
> 		lon_bnds:_FillValue = NaN ;
> 		lon_bnds:coordinates = "height" ;
30a40
> 		height:_FillValue = NaN ;
36a47
> 		tas:_FillValue = 1.e+20f ;
43c54
< 		tas:history = "2019-09-11T14:13:17Z altered by CMOR: Treated scalar dimension: \'height\'. 2019-09-11T14:13:17Z altered by CMOR: replaced missing value flag (-9e+33) and corresponding data with standard missing value (1e+20). 2019-09-11T14:13:18Z altered by CMOR: Inverted axis: lat." ;
---
> 		tas:history = "2019-09-04T13:21:25Z altered by CMOR: Treated scalar dimension: \'height\'. 2019-09-04T13:21:25Z altered by CMOR: replaced missing value flag (-9e+33) and corresponding data with standard missing value (1e+20). 2019-09-04T13:21:25Z altered by CMOR: Inverted axis: lat." ;
46d56
< 		tas:_FillValue = 1.e+20f ;
55c65
< 		:creation_date = "2019-09-11T14:13:17Z" ;
---
> 		:creation_date = "2019-09-04T13:21:25Z" ;
65c75
< 		:history = "2019-09-11T14:13:17Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards." ;
---
> 		:history = "2019-09-04T13:21:25Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards." ;
82,92c92,93
< 		:references = "MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPI‐M Earth System Model version 1.2 (MPI‐ESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,\n",
< 			"Mueller, W.A. et al. (2018): A high‐resolution version of the Max Planck Institute Earth System Model MPI‐ESM1.2‐HR. J. Adv. Model. EarthSyst.,10,1383–1413, doi:10.1029/2017MS001217" ;
< 		:source = "MPI-ESM1.2-LR (2017): \n",
< 			"aerosol: none, prescribed MACv2-SP\n",
< 			"atmos: ECHAM6.3 (spectral T63; 192 x 96 longitude/latitude; 47 levels; top level 0.01 hPa)\n",
< 			"atmosChem: none\n",
< 			"land: JSBACH3.20\n",
< 			"landIce: none/prescribed\n",
< 			"ocean: MPIOM1.63 (bipolar GR1.5, approximately 1.5deg; 256 x 220 longitude/latitude; 40 levels; top grid cell 0-12 m)\n",
< 			"ocnBgchem: HAMOCC6\n",
< 			"seaIce: unnamed (thermodynamic (Semtner zero-layer) dynamic (Hibler 79) sea ice model)" ;
---
> 		string :references = "MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPI‐M Earth System Model version 1.2 (MPI‐ESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,\nMueller, W.A. et al. (2018): A high‐resolution version of the Max Planck Institute Earth System Model MPI‐ESM1.2‐HR. J. Adv. Model. EarthSyst.,10,1383–1413, doi:10.1029/2017MS001217" ;
> 		:source = "MPI-ESM1.2-LR (2017): \naerosol: none, prescribed MACv2-SP\natmos: ECHAM6.3 (spectral T63; 192 x 96 longitude/latitude; 47 levels; top level 0.01 hPa)\natmosChem: none\nland: JSBACH3.20\nlandIce: none/prescribed\nocean: MPIOM1.63 (bipolar GR1.5, approximately 1.5deg; 256 x 220 longitude/latitude; 40 levels; top grid cell 0-12 m)\nocnBgchem: HAMOCC6\nseaIce: unnamed (thermodynamic (Semtner zero-layer) dynamic (Hibler 79) sea ice model)" ;
104c105
< 		:tracking_id = "hdl:21.14100/6b679cba-17b8-45eb-90dc-23d170c1998c" ;
---
> 		:tracking_id = "hdl:21.14100/5026a155-3fbd-4232-b32f-37f5576f86eb" ;

Maybe you can have a look at this when you find time.
Thanks and cheers
Angelika

@agstephens
Copy link
Collaborator

@atmodatcode We spotted this a while back and I thought that the new version of xarray fixed it. However, I can reproduce it with version 0.19.0, as follows:

import xarray as xr

ds = xr.open_dataset("/badc/cmip6/data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-LR_historical_r1i1p1f1_gn_185001-186912.nc", use_cftime=True)

ds = ds.sel(time=slice("1860-01-01", "1861-01-01"))

ds.to_netcdf("a.nc")

And:

$ ncdump -h a.nc
netcdf a {
dimensions:
        time = UNLIMITED ; // (12 currently)
        bnds = 2 ;
        lat = 96 ;
        lon = 192 ;
variables:
        double time(time) ;
                time:_FillValue = NaN ;
                time:bounds = "time_bnds" ;
                time:axis = "T" ;
                time:long_name = "time" ;
                time:standard_name = "time" ;
                time:units = "days since 1850-01-01" ;
                time:calendar = "proleptic_gregorian" ;
        double time_bnds(time, bnds) ;
                time_bnds:_FillValue = NaN ;
                time_bnds:coordinates = "height" ;
        double lat(lat) ;
                lat:_FillValue = NaN ;
                lat:bounds = "lat_bnds" ;
                lat:units = "degrees_north" ;
                lat:axis = "Y" ;
                lat:long_name = "Latitude" ;
                lat:standard_name = "latitude" ;

We will look into it further.

@ellesmith88 can you remember if we looked into this problem with xarray attaching NaN as the _FillValue to coordinate variables that should not have a missing value at all?

@ellesmith88
Copy link
Collaborator

@agstephens We have written fixes to solve these issues (as part of the decadal work), so they're not implemented for all files yet. Each dataset would need to have a fix for this to be removed. See functions remove_fill_values and remove_coord_attr on decadal_fixes branch.(https://github.com/roocs/daops/blob/decadal_fixes/daops/data_utils/attr_utils.py). They are both currently included as a fix for each of the decadal datasets.

The fix I wrote in xarray allows the coordinate attribute can be removed (https://github.com/pydata/xarray/pull/5514/files)

Removing the _FillValue can also be done with xarray using ds[coord_id].encoding["_FillValue"] = None

The fill value fix could be used for all files during the processing but removing the coordinate attribute requires us to know which variables have had this added by xarray.

@agstephens
Copy link
Collaborator

Thanks @ellesmith88

I think the correct fix for this is in xarray itself. But we might want to patch it in clisops first. I would imagine we can (i) detect if the coordinate variable has no NaNs in it; (ii) if so, then remove the _FillValue attr.

@ellesmith88
Copy link
Collaborator

@agstephens Agreed, they have an issue open for it pydata/xarray#2037 that has been open since 2018, so it's not a priority. I had a go at it in June but didn't solve it. I can put the patch into clisops.

@agstephens
Copy link
Collaborator

Thanks @ellesmith88, please do that in a new PR. That would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants