Label TIF optimizations #3

moradology · 2020-08-05T19:25:54Z

While attempting to use the labels described within this repo, it became apparent that a couple optimizations are advisable:

Because the data has 3 possible values (-1, 0, 1), the use of int16 tifs is significant overkill. A byte tif (int8) would save considerable space/transfer time
At the moment, these tifs have a NoData value of −32768. It is likely more appropriate for these tifs to have a NoData value of -1, given the fact that this tracks the advertised semantics more closely and experience teaches that incorrectly set NoData values are sometimes problematic for downstream processes

The text was updated successfully, but these errors were encountered:

tyler-c2s · 2020-08-05T20:48:04Z

Thanks for the feedback

This is a good point, with our initial use case we had not considered transfer time between GCS / AWS and clients. I would be happy to reprocess labels with this update. S2 and S1 imagery will stay as uint16 and float32.
This is a great point, mostly an oversight on our end as we converted formats initially.

Additionally, this might be a good point to switch the compression from lzw to deflate if space and transfer time is of concern.

moradology · 2020-08-05T23:25:12Z

If reprocessing the dataset, it might also be sensible to map all labels up one (-1 -> 0, 0 ->1, 1 -> 2). Torch models can only work with labels >=0, so that might provide some slight convenience in ML use cases. The GDAL command would be something like gdal_calc.py -A labels.tif --outfile="/path/to/output.tif" --calc="A+1"

tyler-c2s · 2020-08-06T00:02:34Z

I think this is something I would support, however this would likely mark a new version of the data in order to make keep the an older version that has values consistent as reported in the accompanying paper

tyler-c2s · 2020-08-14T22:20:16Z

@moradology I have reproccessed the dataset (not in the bucket yet) to correctly flag nodata values as outlined in the paper as well as DEFLATE compression and BAND interleaving as suggested in #5. While reprocessing a few more things worth mentioning.

Sentinel-1 is stored as float32 with nan as the nodata value. I am planning on leaving it as such with properly set in the image metadata, but as you mentioned PyTorch friendliness I wanted to discuss before committing. There does not seem to be an common standard for S1, however this discussion suggests -9999 could be an acceptable value.
rio-cogeo which we use for writing the chips does not seem to support writing int8 only uint8 so for now the the original style values of (-1, 0, 1) will continue to be int16.

tyler-c2s self-assigned this Aug 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label TIF optimizations #3

Label TIF optimizations #3

moradology commented Aug 5, 2020

tyler-c2s commented Aug 5, 2020

moradology commented Aug 5, 2020 •

edited

Loading

tyler-c2s commented Aug 6, 2020

tyler-c2s commented Aug 14, 2020 •

edited

Loading

Label TIF optimizations #3

Label TIF optimizations #3

Comments

moradology commented Aug 5, 2020

tyler-c2s commented Aug 5, 2020

moradology commented Aug 5, 2020 • edited Loading

tyler-c2s commented Aug 6, 2020

tyler-c2s commented Aug 14, 2020 • edited Loading

moradology commented Aug 5, 2020 •

edited

Loading

tyler-c2s commented Aug 14, 2020 •

edited

Loading