[GENERAL] Current State &  Future (Refactor) Tasks

### Feature description

**Current State**
The pipeline has over the past months matured to be able to run yield studies on large regions. For reference some runtimes on a HPC node with 80 used cores: It is now possible to e.g download RS imagery and create daily LAI data for all of Ukraine for 1 year within _4 hours_. Also running a yield study on a country with the dimensions of Ukraine at Admin Level 2 (~600 regions) for one year is possible in less than _5 hours_.

**Limitations & Refactoring**
The current pipeline design has worked well for our studies so far, however we are now reaching a point at which the current design is hitting limitations or should be refactored for the sake of maintainability. The following areas should be revisited for refactoring and improving:

**Input & Setup**
- The processing from shapefile to GeoJSONS is decoupled from the pipeline. While I understand the original intention behind this, this is making the pipeline execution more tedious currently and also makes it hard to backtrack which shapefile was used in common scenarios where numerous shapefiles are flying around for the same country.
- It is currently also very tedious to switch out APSIM files. E.g the user has updates the APSIM file that was used for 50% of the regions and now has to create a complete new study through the helper script. 
- While the current helper scripts facilitates this process, it is not ideal for running numerous experiments
=> We might want to set the shapefile as the input to the Snakemake pipeline (in the `config`) and extract geojsons in the pipeline.
=> We might want to have a parameter `APMSIM_files` in the Snakemake config. Here the user can specify APSIM files and assign them to: 
- specific years,
- regions,
- regions through a column in the attribute table to avoid specifying an APSIM file for each region
This could be automatically resolved as a first step in the pipeline to create the base directory and assign APSIM files to regions and then replace the Clock parameter to match the simulation timepoint settings.
This is very similar to what the `helper script` is currently doing.

**Collecting CHIRPS data**
- Currently the CHIRPS data for all regions is extracted in a single job and saved to parquet. This avoids having to open ten thousand of global tiffs (historical data) in each job and also only needs to rasterize the polygons once. However, this is also not yet super fast, especially for many regions as a single job is computing stats for every polygon and every day.
=> We might want to download and store CHIRPS data directly in a format that allows us to read timeseries per pixel from disk e.g in a `ZARR` store. However might also not yet be worth the implementation overhead as it is fine to have a job for extracting the CHIRPS data even if it takes an hour since other jobs can run e.g LAI extraction etc. Current runtime is approx 20 minutes for 600 regions.


**Result Aggregation & Evaluation**
- Currently the evaluation process requires a csv file with the reported yield ant the region names for each year. 
- The aggregation is limited to aggregation of the simulation regions by specific columns in their shapefile , so no custom shapes can currently be used to aggregate the results.
=> Provide a set of shapefiles for aggregation. We can accumulate the estimated yield pixels within the polygons from the provided shapefiles to produce aggregated results at custom levels.
=> Provided shapefiles should contain a column with the reported yield. This keeps the geometry and reference data also consistent as it often is unsure whether reported data matches inference geometries.

- The aggregation scripts accumulate results by iterating over subdirectories in the timepoint-basedirectory. Sometimes there are remaining artifacts from a previous run or some regions are actually not part of the yield study (as defined in `config -> regions`), however they will still be included in statistics. While this is not a big problem if being run as intended, it is likely that users will create inconsistent data here.

**LAI Matching**
- Currently LAI matching is performed for each simulation region. However, in the future we might want to run simulations on the level with the highest information density, which would be the met-data pixel size and to have the best possible matching, only the matching could be performed per field
=> See https://github.com/JPLMLIA/vercye_ops/issues/35 for details on possible caching implementation


**LAI processing**
- Currently we have two pipelines in place to produce LAI - one for `GEE` that is based on `snakemake` and one for producing from a `STAC` catalog based on a simple python runner. For the sake of consistency the new version should also be moved to `snakemake`.
- The new version uses a script that parallelizes the LAI inference internally, instead of having multiple processes spawn externally (e.g from `snakemake`). 
- Both pipelines share some parts of the code (e.g the model), however this code is duplicated.
- Also the LAI inference code is in two version for each pipeline and should likely be unified. This avoids doing changes in multiple places for example when adding a new model etc. For this we should ensure that imagery output from both pipeline has exactly the same format instead of having two LAI inference scripts handling these differences.
=> Refactor to share more elements like the model and use Snakmake for both pipelines. Standardize the outputs from both pipelines, so a single LAI prediction script can exists, as this should be independent from the pipeline and just predict on the input if it matches the bandorder.
=> @nikhilsrajan proposed to save LAI data as uint16, which I agree would make a lot of sense to reduce our storage requirements for multiyear national-scale studies.

**Unit- & Integration Tests**
- Currently I've not really been doing best practice here and have run a number of manual tests to validate correctness of new features and consistency with previous results.
=> To avoid accidentally introducing errors at some point, we should set up a CI running unit and integration test on each new push. We can add this step by step from now on.

### Suggested solution

See above.

### Additional context

These are just some things I noticed and are open for discussion whether we want to really implement this. Just collected this here instead of opening a issue for each, but we can do that after aligning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GENERAL] Current State & Future (Refactor) Tasks #83

Feature description

Suggested solution

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[GENERAL] Current State & Future (Refactor) Tasks #83

Description

Feature description

Suggested solution

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions