Skip to content

Add BalancedRandomGeoSampler balancing positives and negatives #1883

Description

Summary

Lets the user balance positives and negatives, sampled from IntersectionDataset of RastersDataset and VectorDataset.

Rationale

May help reduce bias and increase generalisation. The RandomGeoSampler currently does not support this feature, and you may end up with heavily unbalanced datasets.

Implementation

def __iter__():

  1. Choose a random element (hit) from the index.
  2. Create a regular grid covering the bounds of the hit similar to GridGeoSampler.
  3. Find intersection (vector area) between raster footprint (valid-pixels) and the VectorDataset (orignal vector-mask).
  4. Using this intersection, split the grid cells into two sets, positives and negatives.
  5. Yield random grid cells, and balance picking between positive and negative grid cells.
  6. Start over from step 1. until reaching length.

Should somehow cache the resulting grids for the next time the same hit is randomly chosen. Or pre-compute these.
One thought is to extend the rtree-index with these grid cells, even extending the index dimension with the label (class = positive or negative). The query could then become (minx, maxx, miny, maxy, mint, maxt, class) in some order. (But this would require the GeoDataset to know the desired patch_size though...)

Alternatives

One good alternative/evolution that I see is replacing of the rtree-index created in IntersectionDataset. Instead of the current rtree-index, add the vectorized raster-footprints and/or vector data (actual features/shapes from e.g. shapefiles) to GeoPandas GeoDataFrames, and use their implementation of rtree to rapidly find areas where they intersect. The temporal dimension would not be part of the index (not supported), but can be added as a filter once spatial matches are found. At the end of this article they split their polygons [raster-footprints] into small grid cells [of patch size] and can rapidly retrieve cells [samples] where there is overlap with the other data set [labels], or no overlap. The underlying rtree-indices in RasterDatasets and VectorDatasets will be the same, and still be used to read the data upon sampling.

Additional information

Probably depends on #1881 to be merged.

EDIT: add alternative

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    samplersSamplers for indexing datasets

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions