Skip to content

EOPF-Explorer/data-pipeline

Repository files navigation

EOPF Explorer Samples Data Pipeline

Kubernetes pipeline: Sentinel Zarr β†’ Cloud-Optimized GeoZarr + STAC Registration

Automated pipeline for converting Sentinel-1/2 Zarr datasets to cloud-optimized GeoZarr format with STAC catalog integration and interactive visualization.


What It Does

Transforms Sentinel-1/2 satellite data into web-ready visualizations:

Input: STAC item URL β†’ Output: Interactive web map (~15-20 min)

Pipeline: Convert β†’ Register

Supported Missions:

  • Sentinel-2 L2A (Multi-spectral optical)
  • Sentinel-1 GRD (SAR backscatter)

Setup

Environments

The data pipeline is deployed in two Kubernetes namespaces:

  • devseed-staging - Testing and validation environment
  • devseed - Production data pipeline

This documentation uses devseed-staging in examples. For production, replace with devseed.

Prerequisites

  • Kubernetes cluster with platform-deploy (Argo Workflows, STAC API, TiTiler)
  • Python 3.13+ with uv
  • GDAL installed (on MacOS: brew install gdal)
  • kubectl installed

If needed, configure kubectl

Download kubeconfig from OVH Manager β†’ Kubernetes (Access and Security tab).

mv ~/Downloads/kubeconfig.yml .work/kubeconfig
export KUBECONFIG=$(pwd)/.work/kubeconfig
kubectl get nodes  # Verify: should list several nodes

Quick verification:

kubectl get wf -n devseed-staging

Add Harbor Registry credentials to .env file

Make sure you have an HARBOR_USERNAME and HARBOR_PASSWORD for OVH container registry added to the .env file.

Setup port forwarding for webhook access

See operator-tools/README.md for webhook port forwarding setup.

For development

Make sure all dependencies are installed by running

make setup

To test new code

  • Authenticate with Harbor registry:
source .env
echo $HARBOR_PASSWORD | docker login w9mllyot.c1.de1.container-registry.ovh.net -u $HARBOR_USERNAME --password-stdin
  • Build the new version of the code:

On macOS, the linux architecture needs to be specified when building the image with the flag --platform linux/amd64 :

docker build -f docker/Dockerfile --network host -t w9mllyot.c1.de1.container-registry.ovh.net/eopf-sentinel-zarr-explorer/data-pipeline:v1-staging --platform linux/amd64 .

on linux:

docker build -f docker/Dockerfile --network host -t w9mllyot.c1.de1.container-registry.ovh.net/eopf-sentinel-zarr-explorer/data-pipeline:v1-staging  .
  • Push to container registry:
docker push w9mllyot.c1.de1.container-registry.ovh.net/eopf-sentinel-zarr-explorer/data-pipeline:v1-staging
  • Once the new image is pushed, run the example Notebook and verify that workflows are running in Argo Workflows

Submit Workflow

Method 1: HTTP Webhook (Recommended)

Use the operator tools to submit STAC items via HTTP webhook. See operator-tools/README.md for:

  • Interactive notebook for batch submissions
  • Python script for single item testing
  • Port forwarding setup
  • Common actions and target collections

Method 2: kubectl (Testing - Direct Workflow Submission)

Direct workflow submission:

kubectl create -n devseed-staging -f - <<'EOF'
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: geozarr-
spec:
  workflowTemplateRef:
    name: geozarr-pipeline
  arguments:
    parameters:
    - name: source_url
      value: "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a/items/S2A_MSIL2A_20251022T094121_N0511_R036_T34TDT_20251022T114817"
    - name: register_collection
      value: "sentinel-2-l2a-dp-test"
EOF

kubectl get wf -n devseed-staging --watch

Monitor: Argo Workflows UI

View Results:

πŸ’‘ Tip: Login to EOxHub workspace for seamless authentication.


Pipeline

Flow: STAC item URL β†’ Extract zarr β†’ Convert to GeoZarr β†’ Upload S3 β†’ Register STAC item β†’ Optimize Storage β†’ Add visualization links

Processing Steps:

V0 Pipeline (2 steps):

  1. Convert - Fetch STAC item, extract zarr URL, convert to cloud-optimized GeoZarr, upload to S3
  2. Register - Create STAC item with asset hrefs, add projection metadata and TiTiler links, register to catalog

V1 Pipeline (3 steps):

  1. Convert - S2-optimized conversion with enhanced performance
  2. Register - Enhanced registration with alternate extension and consolidated assets
  3. Change Storage Tier - Optimize storage costs by moving data to appropriate S3 storage class (default: EXPRESS_ONEZONE)

Runtime: ~15-20 minutes per item

Stack:

  • Processing: eopf-geozarr, Dask, Python 3.13
  • Storage: S3 (OVH)
  • Catalog: pgSTAC, TiTiler

Infrastructure & Workflow Details: For complete workflow architecture, event flow, and deployment configuration, see platform-deploy data-pipeline README


Payload Format

βœ… CORRECT

# Sentinel-2
source_url: "https://stac.core.eopf.eodc.eu/collections/sentinel-2-l2a/items/S2A_MSIL2A_..."

# Sentinel-1
source_url: "https://stac.core.eopf.eodc.eu/collections/sentinel-1-l1-grd/items/S1A_IW_GRDH_..."

❌ WRONG

source_url: "https://objectstore.eodc.eu/.../product.zarr"  # Direct zarr URLs not supported

Why? Pipeline extracts zarr URL from STAC item assets automatically.

Find valid URLs:

kubectl get wf -n devseed-staging --sort-by=.metadata.creationTimestamp \
  -o jsonpath='{range .items[?(@.status.phase=="Succeeded")]}{.spec.arguments.parameters[?(@.name=="source_url")].value}{"\n"}{end}' \
  | tail -n 5

Repository Structure

scripts/
β”œβ”€β”€ convert_v0.py              # Generic Zarr β†’ GeoZarr converter (V0 pipeline)
β”œβ”€β”€ convert_v1_s2.py           # S2-optimized GeoZarr converter (V1 pipeline)
β”œβ”€β”€ register_v0.py             # Basic STAC registration (V0 pipeline)
β”œβ”€β”€ register_v1.py             # Enhanced STAC registration (V1 pipeline)
β”œβ”€β”€ change_storage_tier.py     # S3 storage tier optimization (V1 pipeline step 3)
β”œβ”€β”€ test_complete_workflow.py  # Workflow testing script
β”œβ”€β”€ test_gateway_format.py     # Gateway format testing
└── README_storage_tier.md     # Storage tier management documentation

operator-tools/
β”œβ”€β”€ manage_collections.py           # STAC collection management (create/clean/update)
β”œβ”€β”€ submit_test_workflow_wh.py      # HTTP webhook submission script
β”œβ”€β”€ submit_stac_items_notebook.ipynb # Batch submission notebook
β”œβ”€β”€ README.md                       # Operator tools documentation
└── README_collections.md           # Collection management guide

docker/Dockerfile     # Container image
tests/                # Unit and integration tests

Deployment Configuration: Kubernetes manifests and infrastructure are maintained in platform-deploy


Monitor

# Watch workflows
kubectl get wf -n devseed-staging --watch

# View workflow logs
kubectl logs -n devseed-staging -l workflows.argoproj.io/workflow=<name> --tail=100

# Running workflows only
kubectl get wf -n devseed-staging --field-selector status.phase=Running

Web UI: Argo Workflows


Troubleshoot

Problem Solution
"No group found in store" Using direct zarr URL instead of STAC item URL
"Webhook not responding" See operator-tools troubleshooting
Workflow not starting Check webhook submission returned success, verify port-forward
S3 access denied Contact infrastructure team to verify S3 credentials
Workflow stuck/failed Check workflow logs: kubectl logs -n devseed-staging -l workflows.argoproj.io/workflow=<name>

For infrastructure issues, see platform-deploy troubleshooting: staging | production


Related Projects

Documentation

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 7