Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 28 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,34 @@

[![Pull Request, Any Branch](https://github.com/sparkgeo/STAC-API-Serverless/actions/workflows/pull-request-all.yml/badge.svg)](https://github.com/sparkgeo/STAC-API-Serverless/actions/workflows/pull-request-all.yml)

A [stac-fastapi](https://github.com/stac-utils/stac-fastapi) backend that indexes a catalog to Parquet to make it searchable. The ability to work with static files, and the lack of a need for another persistent data store such as a database, mean this backend can run in a serverless environment.
A [stac-fastapi](https://github.com/stac-utils/stac-fastapi) backend that indexes a STAC catalog to Parquet to make it searchable. The ability to work with static files, and the lack of a need for another persistent data store such as a database, mean this backend can run in a serverless environment. See [Overview](#overview) for more information on this approach.

This backend does not support transactions.
Is this project suitable for your use-case? See [Suitability](./docs/suitability.md) for a review of advantages, disadvantages, and limitations.

## Known Issues

See [Known Issues](./docs/known-issues.md) for more information.

## Development

See [Development](./docs/development.md) for guidance on how to work on this project.
See [Development](./docs/development.md) for detailed guidance on how to work on this project.

## Quickstart
### Quickstart

To get a quick demo up and running execute any of the following scripts and navigate to http://localhost:8123/api.html

```sh
scripts/run-with-local-s3.sh # loads a sample dataset into minio, indexes it, loads the index into minio, and runs the API
scripts/run-with-local-file.sh # indexes a sample dataset on the filesystem and runs the API
scripts/run-with-local-http.sh # loads a sample dataset into a HTTP fileserver, indexes it, and runs the API
scripts/run-with-remote-http.sh https://capella-open-data.s3.us-west-2.amazonaws.com/stac/catalog.json # indexes a public static STAC catalog over HTTPS and runs the API
scripts/run-with-remote-source.sh https://capella-open-data.s3.us-west-2.amazonaws.com/stac/catalog.json # indexes a public static STAC catalog over HTTPS and runs the API
```

## Overview

This repository supports two related but distinct behaviours. The `stac-index.*` packages defined in [./stac_index/](./stac_index/) manage indexing of a STAC catalog. The `stac-fastapi.indexed` package defined in [./stac_fastapi/](./stac_fastapi/) implements a stac-fastapi backend that works with the index created by `stac-index.*` packages.
This repository supports two related but distinct behaviours. The `stac-index.*` packages defined in [stac_index/](./packages/stac-index/) manage indexing of a STAC catalog. The `stac-fastapi.indexed` package defined in [stac_fastapi/](./src/stac_fastapi/) implements a stac-fastapi backend that works with the index created by `stac-index.*` packages.

The indexer might only be run once on a stable dataset, or may be run repeatedly for a changing dataset (see [Limitations / Incremental Updates](#incremental-updates) for multiple-run concerns). The stac-fastapi backend will consult the index once for every request the API receives.
The indexer might only be run once on a stable dataset, or may be run repeatedly for a changing dataset. The stac-fastapi backend will consult the index for every request the API receives.

There is no requirement for the API to run in a serverless environment - in some cases the associated performance concerns may be unacceptable - though its ability to run in a serverless environment is a key factor in its design.

Expand All @@ -40,9 +40,9 @@ There is no requirement for the API to run in a serverless environment - in some
Source and Index Readers support extensibility in the design and determine where and how this design can be used.

STAC objects (catalog, collections, items) are identified by a unique URI. Each URI's format determines which reader must be used to read its content.
- The [S3 Reader](./stac_index/reader/s3/) is used to read any content identified by a URI with a `s3://` prefix.
- The [Filesystem Reader](./stac_index/reader/filesystem/) is used to read any content identified by a URI with a `/` prefix.
- The [HTTPS Reader](./stac_index/reader/https/) is used to read any content identified by a URI with a `http(s)://` prefix.
- The [S3 Reader](./packages/stac-index/src/stac_index/io/readers/s3/) is used to read any content identified by a URI with a `s3://` prefix.
- The [Filesystem Reader](./packages/stac-index/src/stac_index/io/readers/filesystem/) is used to read any content identified by a URI with a `/` prefix.
- The [HTTPS Reader](./packages/stac-index/src/stac_index/io/readers/https/) is used to read any content identified by a URI with a `http(s)://` prefix.

An error will be thrown if an unsupported URI prefix is encountered.

Expand All @@ -54,6 +54,8 @@ The indexer selects a suitable reader for each URI as described in [Readers](#re

The indexer creates an index entry for every STAC collection and item in the catalog. By default each item's index entry includes the item's ID and collection, which together form its unique identifier, and values of the minimum set of properties required to support the API's base requirements. This minimum set of properties includes datetimes, BBOX, and the URI of the item's full content.

If the root catalog contains sub-catalogs these are recursively read to ensure all collections are discovered. Information about which sub-catalog(s) provided which collection(s) is not recorded by the indexer and all collections are referenced as direct children of the root catalog.

Only necessary properties from the STAC catalog are indexed, rather than copying the entire catalog to Parquet, to maximise query performance and minimise data duplication.

The indexer can be configured with any number of queryable and / or sortable fields. Any fields identified as queryable or sortable will also be indexed by the indexer. See [Index Configuration](./docs/index-config.md) for more information on indexer configuration.
Expand All @@ -64,22 +66,28 @@ The indexer produces a number of Parquet files, which must be made accessible in

### API

The API is confiugured with an index source URI which identifies the location of Parquet index files. Index files are enumerated using that URI via a reader as described in [Readers](#readers). During API operation Parquet files are read by DuckDB and therefore can only be accessed via a method supported by DuckDB. While there is flexibility in how STAC content can be accessed by both the indexer and the API, Parquet index files should only be accessed via either the S3 or Filesystem readers. The API accesses Parquet index files in read-only mode and cannot modify them.
The API is configured with an index manifest URI which identifies the location of Parquet index files. During API operation Parquet files are read by DuckDB and therefore can only be accessed via a method supported by DuckDB. While there is flexibility in how STAC content can be accessed by both the indexer and the API, Parquet index files should only be accessed via either the S3 or Filesystem readers. The API accesses Parquet index files in read-only mode and cannot modify them.

For each reqeust the API constructs an SQL query to run against the STAC catalog's index. DuckDB is responsible for satisfying SQL queries and manages interaction with Parquet data. The SQL query returns zero or more URIs referencing collections or items that satisfy the request. The API retrieves the full STAC content from those URIs, using the appropriate reader, and returns a response.

![Diagram showing the process of handling an API request](./docs/diagrams/exports/Query%20Process.png "API Request Process")

## Limitations

This project is currently subject to a number of limitations. These limitations are not believed to be insurmountable; they have simply not yet received sufficient attention.

### Incremental Updates
#### Pagination

Development and testing has so far assumed a one-time indexing process and no need to support rolling incremental updates to the STAC catalog and its index.
As defined by the [STAC API specification](https://github.com/radiantearth/stac-api-spec/tree/v1.0.0/item-search#pagination), STAC items returned by the `/search` and `/collections/{collection_id}/items` endpoints support page limits and pagination behaviour. The specification does not currently describe appropriate paging behaviour around data changes.

For STAC catalogs with infrequent and atomic updates, and a reasonably-sized catalog, it may be sufficient to re-index the catalog following each update and entirely replace the Parquet index files as required.
Consider the following scenario:
- An API client issues a search request and receives the first page of a multi-page response, which is sorted according to the request's `sortby` property (if supported) or a default sort field, with a link to the next page.
- A STAC data update occurs which:
- removes one or more STAC items included in the first page of results,
- adds one or more STAC items that would have been matched by the search request, and / or
- alters one or more STAC items such that they would or would not have been matched by the search request.
- The API client requests the second page of search results using the "next" link from the first page.

New development work will be required to support STAC catalogs with frequent, incremental, or unpredictable updates. In principle there is no reason that the indexer could not support change detection and the ability to add, update, or remove only affected data. This process would likely be complicated by the addition or removal of queryable or sortable fields after the initial indexer run.
In this scenario the API client receives the wrong data. Possible outcomes include (this list is not exhaustive):
- One or more items from the first page may be repeated.
- One or more items that would - with the updated data - have been included in the first page were not provided to the client.
- There are 0 items as the result is now <= 1 page.
- There is an internal error as the result is now <= 1 page and the API implementation does not handle this scenario gracefully.

There is an open question regarding how the API should handle data updates during normal operation. There is no limit on how much time can pass between a client receiving a multi-page response with a `next` or `previous` paging link and when it accesses those links. It is possible for the set matching the initial request to change during a client's paging. The size of the set or its order could change between pages, creating inconsistency. If a request includes queryable or sortable properties it is possible that one or more fields are no longer queryable or sortable following an update, likely resulting in an error. The STAC API specification does not define expected behaviour in data update scenarios and [this question](https://github.com/radiantearth/stac-api-spec/issues/453) on the subject is unanswered at the time of writing.
This project prohibits pagination across data changes. If the API determines that a data update has occurred between page requests a `409 Conflict` will be returned with a message instructing the caller to reissue their request.
5 changes: 3 additions & 2 deletions docs/development.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Development

Development requires Python >= 3.12, Docker, and Bash.
Development requires Python >= 3.12, Docker, Bash, and uv.

## Configure Local Environment

Expand Down Expand Up @@ -35,7 +35,8 @@ scripts/tests/integration-test.sh --dump-log

This project does not currently support Continuous Deployment. Deployments are automated via AWS CDK but must be initiated manually.

The indexer's "first run" below refers to an indexer execution with no `manifest.json` at the default publish URI. The default publish URI is `s3://<deployment-data-bucket>/index/manifest.json`. The first execution of a newly-deployed indexer will be its first run, and you can recreate the first run by deleting `manifest.json`.
> [!NOTE]
> The indexer's "first run" below refers to an indexer execution with no `manifest.json` at the default publish URI. The default publish URI is `s3://<deployment-data-bucket>/index/manifest.json`. The first execution of a newly-deployed indexer will be its first run, and you can recreate first run behaviour by deleting `manifest.json` before a run.

```sh
# --aws-account and --aws-region are always required
Expand Down
3 changes: 1 addition & 2 deletions docs/index-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The indexer requires knowledge of the DuckDB data type that can be used to store

Entries in `queryables` and `sortables` must have a corresponding entry in `indexables`.

Each queryable and sortable property must include a list of collections for which the property is queryable or sortable. The `*` wildcard value can be used to indicate all collections.
Each queryable and sortable property must include a list of collections for which the property is queryable or sortable. The `*` wildcard value can be used to indicate all collections. It is **not** currently possible to wildcard partial collection IDs, such as `collection-*`.

### Queryables

Expand All @@ -26,7 +26,6 @@ Queryables require a `json_schema` property containing a schema that could be us

```json
{
"root_catalog_uri": "/data/catalog.json",
"indexables": {
"gsd": {
"storage_type": "DOUBLE",
Expand Down
67 changes: 67 additions & 0 deletions docs/suitability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Suitability

This document explores questions that may determine whether this project is suitable for a given use-case. This project assumes that a number of limitations are acceptable trade-offs for a solution with lower operating costs and which can add STAC API behaviour to existing static STAC catalogs.

## Known Limitations

### Transactions

This stac-fastapi backend does not support transactions and never will. If you need the ability to modify data through the API consider using [stac-fastapi-pgstac](https://github.com/stac-utils/stac-fastapi-pgstac) or several other projects that support transactions.

### Data Duplication

The indexing approach requires that some STAC data be duplicated in Parquet index files, and these files are the API's source of truth about that STAC data. If the STAC data changes and the Parquet index files are not updated before an API request is received it is possible for the API to return incorrect data or an error.

This risk can be mitigated somewhat by a shorter indexer repeat cycle, or by event-driven item updates as are intended by [#157](https://github.com/sparkgeo/STAC-API-Serverless/issues/157), but the risk cannot be eliminated entirely.

[#160](https://github.com/sparkgeo/STAC-API-Serverless/issues/160) could also help to address some data duplication risks.

### Performance

Efforts have been made to optimise performance in this project, however not all performance risks can be addressed.

The indexer can be used to index STAC data hosted on third-party infrastructure. Both the indexer's **and** the API's runtime performance can be impacted by performance issues affecting that infrastructure. Similarly, any availability issues affecting third-party infrastructure will affect the indexer and the API.

The need for the API to fetch content from the STAC data source may also impact performance. Network requests to S3 storage, HTTPS URIs, or other non-local data stores may adversely impact performance.

#### Metrics

Benchmarking during development shows reasonable performance.

For a small STAC catalog (1 collection, <100 items) where both STAC data and Parquet index files are stored in S3, deployed using AWS API Gateway and Lambda with a "warm" instance available, basic search requests have completed in 600-700 milliseconds.

For a large STAC catalog (>20 collections, >1,700,000 items) in the same deployment configuration basic search requests have completed in 1.5-2 seconds.

#### Index Location

If necessary, performance can be improved by accessing Parquet index files locally on disk. This could be achieved by building the files directly in the container image or with a filesystem mount.

## Alternatives

A number of alternative STAC API strategies are available that might avoid some or all of this project's [known limitations](#known-limitations).

### Data Management Platforms

Several solutions exist that rely on data management platforms such as [stac-fastapi-pgstac](https://github.com/stac-utils/stac-fastapi-pgstac) or [stac-fastapi-elasticsearch-opensearch](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch). These solutions treat their data store as the sole source of truth and the API _should_ guarantee accurate responses.

### STAC-GeoParquet

STAC-GeoParquet is a specification for storing STAC data in GeoParquet files. Its primary advantage is that large amounts of STAC data can be queried quickly thanks to the analysis-friendly nature of Parquet and some excellent tooling that is compatible with the format. See [the docs](https://stac-utils.github.io/stac-geoparquet/latest/) for more information.

Because Parquet files are cloud-optimised and can be queried remotely, e.g. using DuckDB and its [httpfs extension](https://duckdb.org/docs/stable/core_extensions/httpfs/overview.html) or [rustac](https://stac-utils.github.io/rustac/), it is not necessary for a data consumer to download potentially large files prior to query.

STAC-GeoParquet does not provide a REST API interface and is instead accessed programmatically or via SQL. If a REST API is not required, and the primary goal is to query STAC data without the overhead of a data management platform, then STAC-GeoParquet may be a suitable solution.

#### Limitations

STAC-GeoParquet is not typically the primary storage format for STAC data. In some cases, such as some [Planetary Computer STAC collections](https://planetarycomputer.microsoft.com/api/stac/v1/collections/3dep-seamless), Parquet files are provided as collection-level assets so that API clients can access STAC item data without having to page through items via the STAC API. This approach requires data duplication and may experience some of the same limitations identified for this project. In cases where data duplication is required, the entire STAC item must be duplicated, not just certain indexed properties.

All STAC item properties, which would normally reside in a nested object within a STAC item object, must be "promoted" to top-level properties such that they can stored and queried as columns in a Parquet file. As a result property names are not allowed to duplicate the names of other top-level properties. Any properties that are not promoted do not exist in the resulting data.

STAC collections are stored as Parquet metadata, rather than as columnar data, and therefore are not searchable. Prior to introduction of the [collection-search STAC API extension](https://github.com/stac-api-extensions/collection-search) this would not be considered a limitation, as collection search was not previously standardised.

Since 1.1.0 STAC-GeoParquet does not _require_ each collection to exist in a different Parquet file, but this approach is encouraged. Where a STAC data provider has followed this convention it may be more difficult to effectively search across multiple collections.

#### stac-fastapi-geoparquet

The [stac-fastapi-geoparquet](https://pypi.org/project/stac-fastapi-geoparquet/) project aims to augment STAC-GeoParquet with a STAC API interface, however this project does not currently appear to offer a production-ready solution.