Pipeline for generating monthly Sentinel-2 satellite imagery basemaps over large areas with state-of-the-art cloudmasking.
This pipeline is setup with processing on Vast AI for the GPU in mind. Use create_env.sh to quickly get an environment called mosaics ready, then activate it. If you are on your own GPU and already have conda, then just create the environment with environment.yml.
I use an A100 or something similar and in my setup (monthly mosaic for Arkansas), the whole processing time is around 3 hours costing around $5.
For the state of Arkansas, for a single month in 2025 where Sentinel A, B and C are operational, it requires around 300GB of disk space. You can calculate roughly how many GB you will need based on this formula based on your area of interest and the number of days in your mosaic:
Required_GB = square_miles * days * 0.00019
Note that this is a rough calculation and you might want to bump it up to be safe. The pipeline is set to use as little storage as possible, so the clipping script (step 2) after download will delete the downloaded files after they are clipped to the AOI.
You can create a Slack webhook and add it to the config.json to get updates at each step. The pipeline will send notifications after key processing stages.
OmniCloudMask: Deep learning model processes RGB+NIR bands, outputs classification (0=clear, 1=thick cloud, 2=thin cloud, 3=shadow)
SCL Filter: Sentinel-2 Scene Classification Layer masks clouds, shadows, saturated pixels, and snow
Combined Logic: Pixel is invalid if (OmniCloudMask > 0) OR (SCL in invalid_classes) OR (nodata)
For each pixel location, the algorithm:
- Collects all valid observations across scenes
- Computes 25th percentile per band (configurable in
config.json)- This percentile approach selects relatively darker/clearer pixel values while avoiding extreme outliers
The final mosaic contains 4 bands, RGB and NIR. It is clipped to the provided GeoPackage.
An example mosaic for the state of Arkansas using the Arkansas_5000mbuffer.gpkg file for the month of November, 2025
bash pipeline.sh- Download & Discovery:
downloadS2.pyqueries STAC API and downloads from AWS - Clipping:
clipData_parallel.pyclips scenes to AOI, removes originals - Cloud Masking:
omnimask.pygenerates OmniCloudMask per scene - Mosaic Generation:
generate_mosaic.pycreates percentile composite - Finalization:
finalize_and_upload.pyuploads to cloud storage, sends notifications
Edit config.json:
Essential Parameters:
target_crs: Output coordinate system (e.g., "EPSG:32615")bbox: AOI geometry file (e.g., "Arkansas_5000mbuffer.gpkg")percentileValue: Compositing percentile (default: 25)output_filename_base: Base name for output files (e.g., "Arkansas_S2_RGBNIR")
Processing:
download_dir,cleaned_dir,output_dir: Data directoriesremove_original_files_after_clipping: Save disk space (default: true)
{output_filename_base}_{YYYY_MM_DD}_to_{YYYY_MM_DD}.tif: 4-band RGBNIR composite (COG format)metadata.json: Source URLs, timestamps, Copernicus attribution
Final files are named according to the actual date range from start_date to end_date in config.json.
GPU Memory Issues:
Reduce chunk_size in omnimask.py if you get OOM error.
Reduce GPU_CHUNK_SIZE in generate_mosaic.py
generate_mosaic.py fails if target_crs is not in UTM due to the TARGET_RES_X and TARGET_RES_Y being in meters