Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifier function false positives #12

Open
astrofrog opened this issue May 23, 2016 · 5 comments
Open

Identifier function false positives #12

astrofrog opened this issue May 23, 2016 · 5 comments

Comments

@astrofrog
Copy link
Member

At the moment, the is_geospatial function is recognizing pretty much any RGB image file:

In [1]: from glue_geospatial.data_factory import is_geospatial

In [2]: is_geospatial('random_image.png')
Out[2]: True

@robintw - I wonder whether there is some kind of meta-data we can look for that would identify files as being specifically geospatial data?

@robintw
Copy link
Collaborator

robintw commented May 23, 2016

We could look to see if a Co-ordinate Reference System (CRS) is defined, that should show whether it is spatial data or not.

Something like:

r = rasterio.open(filename)
if len(r.crs) == 0:
   # No spatial data
else:
   # We have spatial data

@robintw
Copy link
Collaborator

robintw commented May 25, 2016

Actually, the problem with this is that various 'geospatial' datasets actually have no CRS defined. This can be for various reasons, including lazy programmers (eg. some of my test algorithms don't propagate the CRS info between files properly), processing errors, a deliberate choice not to provide georeferencing information in the file itself (sometimes it is provided in a separate metadata file, for some unknown reason).

Is it a particular issue if random_inage.png is picked up by this DataFactory? Would you prefer that it wasn't picked up at all? Or was picked up by another factory?

@astrofrog
Copy link
Member Author

@robintw - if you have a random PNG file (say of a cat), then the main difference between the current RGB data factory and the geospatial one is that the names of the components will be Red, Green, Blue, and Band 1, Band 2, and Band 3 respectively.

Are there a limited number of extensions that are used for geospatial data, or are JPEG and PNG used for instance?

I guess we just need to decide on the priority of the data factories - we could for instance give the generic RGB reader priority if and only if no metadata is present in the RGB file. But in this case, would you still want the components named Band 1, Band 2, Band 3?

@robintw
Copy link
Collaborator

robintw commented May 26, 2016

Yes, we can probably do this based on extensions: satellite data are never (to my knowledge) in JPG or PNG. Some are, however, in JPEG2000 (extension .jp2).

I have no particular preferences about standard RGB data: probably Red, Green and Blue are better as names of components for them.

How do we set the priorities for DataFactories? Is it a single static constant for each factory, or can it change as you get more information (eg. we try getting metadata using rasterio, if we can't find any then we decrease the relative priority of the geospatial reader, etc.).

@astrofrog
Copy link
Member Author

@robintw - the priority is set by an argument in the @data_factory decorator:

https://github.com/glue-viz/glue/blob/master/glue/core/data_factories/hdf5.py#L41

I hadn't thought of having the identifier return the priority - that would be even better, since it would allow more fine tuning as you suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants