Skip to content

Conversation

@favyen2
Copy link
Collaborator

@favyen2 favyen2 commented Nov 18, 2025

  • Switch from manual parsing to pydantic models for the bulk of the dataset config.
  • Switch from manual parsing to jsonargparse for initializing data sources. Now, the dataset config has class_path and init_args fields for the data source, where the latter is a dict handled by jsonargparse. Some backwards compatibility is maintained.
  • Add DataSourceContext to pass the dataset path and LayerConfig to the data source. Most data sources use the context to do things like adjust file paths that are relative to the dataset root directory. The context is passed to the data source by injecting it into the init_args. Not the cleanest solution, but it seemed better than passing it via a method after instantiation.
  • Remove the RasterFormats and VectorFormats class registries, and the manual argument parsing. Instead, these are now also initialized via jsonargparse, and the class_path is directly set from the dataset config. Some backwards compatibility for RasterFormats is maintained.
  • Remove the Materializers registry. We have been only using RasterMaterializer and VectorMaterializer for quite some time, so now we just directly initialize them depending on the layer type.
  • Remove rslearn.data_sources.raster_source. This module provided is_raster_needed, but that was only still used in gcp_public_data, and I have changed gcp_public_data now to determine the needed assets upon initialization (similar to other data sources).
  • Remove TileStore configuration backwards compatibility. Now only the jsonargparse format is accepted. This shouldn't break much because we rarely specify the tile_store in the dataset config, and the old format has been deprecated for a while (although it looks like it was still copied and pasted recently into a few places, mostly in tests which I have updated).

This change will require modifications to all existing dataset configs.

- Switch from manual parsing to pydantic models for the bulk of the dataset config.
- Use jsonargparse for initializing data sources. The dataset config has a class_path
  and init_args for the data source, where the latter is arbitrary dict that is to be
  handled by jsonargparse.
- Add DataSourceContext to pass the dataset and LayerConfig to the data source. Most
  data sources use the context to do things like adjust file paths that are relative
  to the dataset root directory. The context is passed to the data source by injecting
  it into the init_args.
- Remove the RasterFormats and VectorFormats class registries. Instead, these are now
  also initialized via jsonargparse, and the class_path is directly set from the
  dataset config.
- Remove the Materializers registry. We have been only using RasterMaterializer and
  VectorMaterializer for quite some time, so now we just directly initialize them
  depending on the layer type.
- Remove rslearn.data_sources.raster_source, it provides is_raster_needed but this was
  only used in gcp_public_data and I changed that now to determine the needed assets
  upon initialization (similar to other data sources).
- Remove tile store backwards compatibility, now only the jsonargparse format is
  accepted. This shouldn't break much because we rarely specify the tile_store in the
  dataset config.
@favyen2 favyen2 requested a review from APatrickJ November 18, 2025 21:36
@cmwilhelm
Copy link
Contributor

@favyen2 can you elaborate on the scope of the changes you're referring to here?

This change will require modifications to all existing dataset configs.

I haven't looked at this PR deeply; the summary notes seem positive. Still, I'm wondering if changes that break core API contracts should be presented in design docs to the broader group at this point.

@favyen2
Copy link
Collaborator Author

favyen2 commented Nov 18, 2025

Here is an example.

Old version:

    "sentinel2": {
      "band_sets": [{
          "bands": ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
          "dtype": "uint16"
      }],
      "data_source": {
        "cache_dir": "cache/planetary_computer",
        "duration": "270d",
        "harmonize": true,
        "ingest": false,
        "name": "rslearn.data_sources.planetary_computer.Sentinel2",
        "query_config": {
          "max_matches": 6,
          "min_matches": 6,
          "period_duration": "30d",
          "space_mode": "PER_PERIOD_MOSAIC"
        },
        "sort_by": "eo:cloud_cover",
        "time_offset": "-90d"
      },
      "type": "raster"
    }

New version:

    "sentinel2": {
      "band_sets": [{
          "bands": ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
          "dtype": "uint16"
      }],
      "data_source": {
        "class_path": "rslearn.data_sources.planetary_computer.Sentinel2",
        "init_args": {
          "cache_dir": "cache/planetary_computer",
          "harmonize": true,
          "sort_by": "eo:cloud_cover",
        },
        "duration": "270d",
        "time_offset": "-90d",
        "ingest": false,
        "query_config": {
          "max_matches": 6,
          "min_matches": 6,
          "period_duration": "30d",
          "space_mode": "PER_PERIOD_MOSAIC"
        }
      },
      "type": "raster"
    }

The main change is the separation of generic data source configuration options (like duration, time_offset, and query_config) from source-specific ones (like cache_dir, harmonize, and sort_by that are arguments to rslearn.data_sources.planetary_computer.Sentinel2). It is hard to avoid since in some ways that is the point of this change, otherwise there isn't a good way to e.g. throw an error if an unknown key is passed, because different parts of the system won't know if there are extra config options that will be read from the same config section by other parts of the system.

@favyen2 favyen2 closed this Nov 18, 2025
@favyen2 favyen2 reopened this Nov 18, 2025
Copy link
Collaborator

@APatrickJ APatrickJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on!

@favyen2 favyen2 requested a review from APatrickJ November 21, 2025 17:18
Copy link
Collaborator

@APatrickJ APatrickJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Made a suggestion about testing the backward compatibility converter

@favyen2 favyen2 merged commit cec4819 into master Nov 24, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants