Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label TIF optimizations #3

Open
moradology opened this issue Aug 5, 2020 · 4 comments
Open

Label TIF optimizations #3

moradology opened this issue Aug 5, 2020 · 4 comments
Assignees

Comments

@moradology
Copy link

While attempting to use the labels described within this repo, it became apparent that a couple optimizations are advisable:

  1. Because the data has 3 possible values (-1, 0, 1), the use of int16 tifs is significant overkill. A byte tif (int8) would save considerable space/transfer time
  2. At the moment, these tifs have a NoData value of −32768. It is likely more appropriate for these tifs to have a NoData value of -1, given the fact that this tracks the advertised semantics more closely and experience teaches that incorrectly set NoData values are sometimes problematic for downstream processes
@tyler-c2s
Copy link
Contributor

Thanks for the feedback

  1. This is a good point, with our initial use case we had not considered transfer time between GCS / AWS and clients. I would be happy to reprocess labels with this update. S2 and S1 imagery will stay as uint16 and float32.
  2. This is a great point, mostly an oversight on our end as we converted formats initially.

Additionally, this might be a good point to switch the compression from lzw to deflate if space and transfer time is of concern.

@tyler-c2s tyler-c2s self-assigned this Aug 5, 2020
@moradology
Copy link
Author

moradology commented Aug 5, 2020

If reprocessing the dataset, it might also be sensible to map all labels up one (-1 -> 0, 0 ->1, 1 -> 2). Torch models can only work with labels >=0, so that might provide some slight convenience in ML use cases. The GDAL command would be something like gdal_calc.py -A labels.tif --outfile="/path/to/output.tif" --calc="A+1"

@tyler-c2s
Copy link
Contributor

I think this is something I would support, however this would likely mark a new version of the data in order to make keep the an older version that has values consistent as reported in the accompanying paper

@tyler-c2s
Copy link
Contributor

tyler-c2s commented Aug 14, 2020

@moradology I have reproccessed the dataset (not in the bucket yet) to correctly flag nodata values as outlined in the paper as well as DEFLATE compression and BAND interleaving as suggested in #5. While reprocessing a few more things worth mentioning.

  1. Sentinel-1 is stored as float32 with nan as the nodata value. I am planning on leaving it as such with properly set in the image metadata, but as you mentioned PyTorch friendliness I wanted to discuss before committing. There does not seem to be an common standard for S1, however this discussion suggests -9999 could be an acceptable value.

  2. rio-cogeo which we use for writing the chips does not seem to support writing int8 only uint8 so for now the the original style values of (-1, 0, 1) will continue to be int16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants