Improve dataset configuration #371

favyen2 · 2025-11-18T21:36:08Z

Switch from manual parsing to pydantic models for the bulk of the dataset config.
Switch from manual parsing to jsonargparse for initializing data sources. Now, the dataset config has class_path and init_args fields for the data source, where the latter is a dict handled by jsonargparse. Some backwards compatibility is maintained.
Add DataSourceContext to pass the dataset path and LayerConfig to the data source. Most data sources use the context to do things like adjust file paths that are relative to the dataset root directory. The context is passed to the data source by injecting it into the init_args. Not the cleanest solution, but it seemed better than passing it via a method after instantiation.
Remove the RasterFormats and VectorFormats class registries, and the manual argument parsing. Instead, these are now also initialized via jsonargparse, and the class_path is directly set from the dataset config. Some backwards compatibility for RasterFormats is maintained.
Remove the Materializers registry. We have been only using RasterMaterializer and VectorMaterializer for quite some time, so now we just directly initialize them depending on the layer type.
Remove rslearn.data_sources.raster_source. This module provided is_raster_needed, but that was only still used in gcp_public_data, and I have changed gcp_public_data now to determine the needed assets upon initialization (similar to other data sources).
Remove TileStore configuration backwards compatibility. Now only the jsonargparse format is accepted. This shouldn't break much because we rarely specify the tile_store in the dataset config, and the old format has been deprecated for a while (although it looks like it was still copied and pasted recently into a few places, mostly in tests which I have updated).

This change will require modifications to all existing dataset configs. - Switch from manual parsing to pydantic models for the bulk of the dataset config. - Use jsonargparse for initializing data sources. The dataset config has a class_path and init_args for the data source, where the latter is arbitrary dict that is to be handled by jsonargparse. - Add DataSourceContext to pass the dataset and LayerConfig to the data source. Most data sources use the context to do things like adjust file paths that are relative to the dataset root directory. The context is passed to the data source by injecting it into the init_args. - Remove the RasterFormats and VectorFormats class registries. Instead, these are now also initialized via jsonargparse, and the class_path is directly set from the dataset config. - Remove the Materializers registry. We have been only using RasterMaterializer and VectorMaterializer for quite some time, so now we just directly initialize them depending on the layer type. - Remove rslearn.data_sources.raster_source, it provides is_raster_needed but this was only used in gcp_public_data and I changed that now to determine the needed assets upon initialization (similar to other data sources). - Remove tile store backwards compatibility, now only the jsonargparse format is accepted. This shouldn't break much because we rarely specify the tile_store in the dataset config.

cmwilhelm · 2025-11-18T21:45:54Z

@favyen2 can you elaborate on the scope of the changes you're referring to here?

This change will require modifications to all existing dataset configs.

I haven't looked at this PR deeply; the summary notes seem positive. Still, I'm wondering if changes that break core API contracts should be presented in design docs to the broader group at this point.

favyen2 · 2025-11-18T22:11:27Z

Here is an example.

Old version:

    "sentinel2": {
      "band_sets": [{
          "bands": ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
          "dtype": "uint16"
      }],
      "data_source": {
        "cache_dir": "cache/planetary_computer",
        "duration": "270d",
        "harmonize": true,
        "ingest": false,
        "name": "rslearn.data_sources.planetary_computer.Sentinel2",
        "query_config": {
          "max_matches": 6,
          "min_matches": 6,
          "period_duration": "30d",
          "space_mode": "PER_PERIOD_MOSAIC"
        },
        "sort_by": "eo:cloud_cover",
        "time_offset": "-90d"
      },
      "type": "raster"
    }

New version:

    "sentinel2": {
      "band_sets": [{
          "bands": ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
          "dtype": "uint16"
      }],
      "data_source": {
        "class_path": "rslearn.data_sources.planetary_computer.Sentinel2",
        "init_args": {
          "cache_dir": "cache/planetary_computer",
          "harmonize": true,
          "sort_by": "eo:cloud_cover",
        },
        "duration": "270d",
        "time_offset": "-90d",
        "ingest": false,
        "query_config": {
          "max_matches": 6,
          "min_matches": 6,
          "period_duration": "30d",
          "space_mode": "PER_PERIOD_MOSAIC"
        }
      },
      "type": "raster"
    }

The main change is the separation of generic data source configuration options (like duration, time_offset, and query_config) from source-specific ones (like cache_dir, harmonize, and sort_by that are arguments to rslearn.data_sources.planetary_computer.Sentinel2). It is hard to avoid since in some ways that is the point of this change, otherwise there isn't a good way to e.g. throw an error if an unknown key is passed, because different parts of the system won't know if there are extra config options that will be read from the same config section by other parts of the system.

… are parsed in same as before

…aset-config

APatrickJ

Thanks for taking this on!

rslearn/data_sources/planetary_computer.py

rslearn/data_sources/usgs_landsat.py

rslearn/data_sources/planet.py

rslearn/config/dataset.py

APatrickJ

LGTM!

Made a suggestion about testing the backward compatibility converter

…aset-config

favyen2 requested a review from APatrickJ November 18, 2025 21:36

favyen2 closed this Nov 18, 2025

favyen2 reopened this Nov 18, 2025

favyen2 added 5 commits November 18, 2025 14:26

rename layer_type back to type and capitalize the enum values so they…

0f25817

… are parsed in same as before

fix test_s3_dataset test

ca78f83

actually fix test

101950c

Merge remote-tracking branch 'origin/master' into favyen/20251118-dat…

59afec7

…aset-config

various test fixes

0d65f6d

This was referenced Nov 19, 2025

max_cloud_cover is ignored by Planetary Computer Sentinel‑2 data source #361

Open

Configuration overhaul #2

Closed

APatrickJ reviewed Nov 20, 2025

View reviewed changes

rslearn/data_sources/planetary_computer.py Outdated Show resolved Hide resolved

rslearn/data_sources/planetary_computer.py Show resolved Hide resolved

rslearn/data_sources/usgs_landsat.py Outdated Show resolved Hide resolved

rslearn/data_sources/planet.py Outdated Show resolved Hide resolved

favyen2 added 3 commits November 20, 2025 10:46

address feedback

0638496

fix missing commit

2a13157

Add backwards compatibility.

0d2a40d

favyen2 requested a review from APatrickJ November 21, 2025 17:18

APatrickJ reviewed Nov 21, 2025

View reviewed changes

rslearn/config/dataset.py Show resolved Hide resolved

APatrickJ approved these changes Nov 21, 2025

View reviewed changes

favyen2 added 4 commits November 24, 2025 06:40

Merge remote-tracking branch 'origin/master' into favyen/20251118-dat…

9a1b427

…aset-config

Add tests for config backwards compatibility.

aa51921

bump version

9703cec

add test for raster format config compatibility

06d0fc5

favyen2 merged commit cec4819 into master Nov 24, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve dataset configuration #371

Improve dataset configuration #371

Uh oh!

favyen2 commented Nov 18, 2025 •

edited

Loading

Uh oh!

cmwilhelm commented Nov 18, 2025

Uh oh!

favyen2 commented Nov 18, 2025 •

edited

Loading

Uh oh!

APatrickJ left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

APatrickJ left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Improve dataset configuration #371

Improve dataset configuration #371

Uh oh!

Conversation

favyen2 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmwilhelm commented Nov 18, 2025

Uh oh!

favyen2 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

APatrickJ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

APatrickJ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

favyen2 commented Nov 18, 2025 •

edited

Loading

favyen2 commented Nov 18, 2025 •

edited

Loading