Skip to content

Fix STAC Geoparquet export #328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 24 additions & 2 deletions datasets/hls2/collection/hls2-l30/template.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,14 @@
"id": "hls2-l30",
"title": "Harmonized Landsat Sentinel-2 (HLS) Version 2.0, Landsat Data",
"description": "{{ collection.description }}",
"license": "Data Citation Guidance: https://lpdaac.usgs.gov/data/data-citations-and-guidelines",
"links": [],
"license": "proprietary",
"links": [
{
"rel": "license",
"href": "https://lpdaac.usgs.gov/data/data-citation-and-policies/",
"title": "LP DAAC - Data Citation and Policies"
}
],
"stac_extensions": [
"https://stac-extensions.github.io/item-assets/v1.0.0/schema.json",
"https://stac-extensions.github.io/table/v1.2.0/schema.json",
Expand Down Expand Up @@ -49,6 +55,22 @@
"type": "image/webp",
"href": "https://ai4edatasetspublicassets.blob.core.windows.net/assets/pc_thumbnails/hls2-l30.webp",
"title": "HLS2 Landsat Collection Thumbnail"
},
"geoparquet-items": {
"href": "abfs://items/hls2-l30.parquet",
"type": "application/x-parquet",
"roles": [
"stac-items"
],
"title": "GeoParquet STAC items",
"description": "Snapshot of the collection's STAC items exported to GeoParquet format.",
"msft:partition_info": {
"is_partitioned": true,
"partition_frequency": "W-MON"
},
"table:storage_options": {
"account_name": "pcstacitems"
}
}
},
"summaries": {
Expand Down
26 changes: 24 additions & 2 deletions datasets/hls2/collection/hls2-s30/template.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,14 @@
"id": "hls2-s30",
"title": "Harmonized Landsat Sentinel-2 (HLS) Version 2.0, Sentinel-2 Data",
"description": "{{ collection.description }}",
"license": "Data Citation Guidance: https://lpdaac.usgs.gov/data/data-citations-and-guidelines",
"links": [],
"license": "proprietary",
"links": [
{
"rel": "license",
"href": "https://lpdaac.usgs.gov/data/data-citation-and-policies/",
"title": "LP DAAC - Data Citation and Policies"
}
],
"stac_extensions": [
"https://stac-extensions.github.io/item-assets/v1.0.0/schema.json",
"https://stac-extensions.github.io/table/v1.2.0/schema.json",
Expand Down Expand Up @@ -56,6 +62,22 @@
"type": "image/webp",
"href": "https://ai4edatasetspublicassets.blob.core.windows.net/assets/pc_thumbnails/hls2-s30.webp",
"title": "HLS2 Sentinel Collection Thumbnail"
},
"geoparquet-items": {
"href": "abfs://items/hls2-s30.parquet",
"type": "application/x-parquet",
"roles": [
"stac-items"
],
"title": "GeoParquet STAC items",
"description": "Snapshot of the collection's STAC items exported to GeoParquet format.",
"msft:partition_info": {
"is_partitioned": true,
"partition_frequency": "W-MON"
},
"table:storage_options": {
"account_name": "pcstacitems"
}
}
},
"summaries": {
Expand Down
46 changes: 14 additions & 32 deletions datasets/stac-geoparquet/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,72 +1,54 @@
FROM ubuntu:20.04
FROM mcr.microsoft.com/azurelinux/base/python:3.12

# Setup timezone info
ENV TZ=UTC

ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV UV_SYSTEM_PYTHON=TRUE

RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

RUN apt-get update && apt-get install -y software-properties-common
RUN tdnf install build-essential jq unzip ca-certificates awk wget curl git azure-cli -y \
&& tdnf clean all

RUN add-apt-repository ppa:ubuntugis/ppa && \
apt-get update && \
apt-get install -y build-essential python3-dev python3-pip \
jq unzip ca-certificates wget curl git && \
apt-get autoremove && apt-get autoclean && apt-get clean

RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 10

# See https://github.com/mapbox/rasterio/issues/1289
ENV CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

# Install Python 3.11
RUN curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh" \
&& bash "Mambaforge-$(uname)-$(uname -m).sh" -b -p /opt/conda \
&& rm -rf "Mambaforge-$(uname)-$(uname -m).sh"

ENV PATH /opt/conda/bin:$PATH
ENV LD_LIBRARY_PATH /opt/conda/lib/:$LD_LIBRARY_PATH

RUN mamba install -y -c conda-forge python=3.11 gdal pip setuptools cython numpy

RUN python -m pip install --upgrade pip
# RUN python3 -m pip install --upgrade pip
RUN pip install --upgrade uv

# Install common packages
COPY requirements-task-base.txt /tmp/requirements.txt
RUN python -m pip install --no-build-isolation -r /tmp/requirements.txt
RUN uv pip install --no-build-isolation -r /tmp/requirements.txt

#
# Copy and install packages
#

COPY pctasks/core /opt/src/pctasks/core
RUN cd /opt/src/pctasks/core && \
pip install .
uv pip install .

COPY pctasks/cli /opt/src/pctasks/cli
RUN cd /opt/src/pctasks/cli && \
pip install .
uv pip install .

COPY pctasks/task /opt/src/pctasks/task
RUN cd /opt/src/pctasks/task && \
pip install .
uv pip install .

COPY pctasks/client /opt/src/pctasks/client
RUN cd /opt/src/pctasks/client && \
pip install .
uv pip install .

# COPY pctasks/ingest /opt/src/pctasks/ingest
# RUN cd /opt/src/pctasks/ingest && \
# pip install .
# uv pip install .

# COPY pctasks/dataset /opt/src/pctasks/dataset
# RUN cd /opt/src/pctasks/dataset && \
# pip install .
# uv pip install .

COPY datasets/stac-geoparquet /opt/src/datasets/stac-geoparquet
RUN python3 -m pip install -r /opt/src/datasets/stac-geoparquet/requirements.txt
RUN uv pip install -r /opt/src/datasets/stac-geoparquet/requirements.txt

# Setup Python Path to allow import of test modules
ENV PYTHONPATH=/opt/src:$PYTHONPATH
Expand Down
39 changes: 35 additions & 4 deletions datasets/stac-geoparquet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,58 @@ Generates the `stac-geoparquet` collection-level assets for the [Planetary Compu

## Container Images

Test the build with;
```shell
$ az acr build -r pccomponents -t pctasks-stac-geoparquet:latest -t pctasks-stac-geoparquet:2023.7.10.0 -f datasets/stac-geoparquet/Dockerfile .
docker build -t stac-geoparquet -f datasets/stac-geoparquet/Dockerfile .
```

Then publish to the ACR with:

```shell
az acr build -r pccomponents -t pctasks-stac-geoparquet:latest -t pctasks-stac-geoparquet:2023.7.10.0 -f datasets/stac-geoparquet/Dockerfile .
```

## Permissions

This requires the following permissions

* Storage Data Table Reader on the config tables (`pcapi/bluecollectoinconfig`, `pcapi/greencollectionconfig`)
* Storage Data Table Reader on the config tables (`pcapi/bluecollectionconfig`, `pcapi/greencollectionconfig`)
* Storage Blob Data Contributor on the `pcstacitems` container.

## Arguments

By default, this workflow will generate geoparquet assets for all collections.
If you want to select a subset of collections, you can use either:

1. `extra_skip`: This will skip certain collections
1. `collections`: This will only generate geoparquet for the specified collection(s).

## Updates

The workflow used for updates was registered with

```shell
pctasks workflow update datasets/stac-geoparquet/workflow.yaml
```

It can be manually invoked with:

```shell
pctasks workflow submit stac-geoparquet
```
pctasks workflow update datasets/workflows/stac-geoparquet.yaml
```

## Run Locally

You can debug the geoparquet export locally like this:

```shell
export STAC_GEOPARQUET_CONNECTION_INFO="secret"
export STAC_GEOPARQUET_TABLE_NAME="greencollectionconfig"
export STAC_GEOPARQUET_TABLE_ACCOUNT_URL="https://pcapi.table.core.windows.net"
export STAC_GEOPARQUET_STORAGE_OPTIONS_ACCOUNT_NAME="pcstacitems"

python3 pc_stac_geoparquet.py --collection hls2-l30
```

Apart from the Postgres connection string, you will need PIM activations for
`Storage Blob Data Contributor` to be able to write to the production storage account.
Loading