PixETL reads raster and vector source data and converts data into Cloud Optimized GeoTIFF (without overviews) clipped to specified grid sizes.
It will upload all tiles to GFW data lake following GFW naming convention.
./scripts/setup
- GDAL 2.4.x or 3.x
- libpq-dev
██████╗ ███████╗██╗ ██╗ ██████╗ ██╗██╗ ██╗███████╗████████╗██╗
██╔════╝ ██╔════╝██║ ██║ ██╔══██╗██║╚██╗██╔╝██╔════╝╚══██╔══╝██║
██║ ███╗█████╗ ██║ █╗ ██║ ██████╔╝██║ ╚███╔╝ █████╗ ██║ ██║
██║ ██║██╔══╝ ██║███╗██║ ██╔═══╝ ██║ ██╔██╗ ██╔══╝ ██║ ██║
╚██████╔╝██║ ╚███╔███╔╝ ██║ ██║██╔╝ ██╗███████╗ ██║ ███████╗
╚═════╝ ╚═╝ ╚══╝╚══╝ ╚═╝ ╚═╝╚═╝ ╚═╝╚══════╝ ╚═╝ ╚══════╝
Usage: pixetl [OPTIONS] LAYER_JSON
LAYER_JSON: Layer specification in JSON
Options:
-d, --dataset TEXT Name of dataset to process [required]
-v, --version TEXT Version of dataset to process [required]
--subset TEXT Subset of tiles to process
-o, --overwrite Overwrite existing tile in output location
--help Show this message and exit.
pixetl -d umd_tree_cover_density_2000 -v v1.6 '{"source_type": "raster", "pixel_meaning": "percent", "data_type": "uint8", "nbits": 7, "grid": "10/40000", "source_uri": "s3://gfw-files/2018_update/tcd_2000/tiles.geojson", "resampling": "average"}'
You define layer sources in JSON as the one required argument
Layer source definitions follow this pattern
{
"option": "value",
...
}
Supported Options:
Option | Mandatory | Description |
---|---|---|
source_type | yes | Always "raster" |
pixel_meaning | yes | A string indicating the value represented by pixel. This can either be a field name or a unit. Always use lower caps, unless when specifying a unit that uses capital letters |
data_type | yes | Data type of output file (boolean, uint8, int8, uint16, int16, uint32, int32, float32, float64) |
grid | yes | Grid size of output dataset |
no_data | no | Integer value, for float datatype use NAN . If left out or set to null output dataset will have no no_data value |
nbits | no | Max number of bits used for given datatype |
source_uri | yes | List of URIs of source folders or tiles.geojson file(s) |
resampling | no | Resampling method (nearest, mod, avg, etc), default `nearest |
calc | no | Numpy expression to transform array. Use namespace np , not numpy when using numpy functions. When using multiple input bands, reference each band with uppercase letter in alphabetic order (A,B,C,..). To output multiband raster, wrap list of bands in a masked array ie np.ma.array([A, B, C]) . |
symbology | no | Add optional symbology to the output raster |
compute_stats | no | Compute band statistics and add to tiles.geojson |
compute_histogram | no | Compute band histograms and add to tile.geojson |
process_locally | no | When set to True, forces PixETL to download all source files prior to processing. Default False |
photometric | no | Color interpretations of bands |
NOTE:
File listed in source_uri
must be stored on S3 and accessible to PixETL. The file path must use the s3 protocol (s3://
).
File content must be of format geoJSON
. The geojson must contain a FeatureColletion
where each feature represents one geoTIFF file.
The feature geometry describes the extent of the geoTIFF, the property name
the path to the geotiff using GDAL vsi
notation.
You can reference file hosted on S3 (/vsis3/
), GCS (/vsigs/
) or anywhere else accessible over http (/vsicurl/
)
You can use the pixetl_prep
script to generate the tile.geojson file.
GeoTIFFs hosted on S3 must be accessible by the AWS profile used by PixETL.
When referencing geotiffs hosted on GCS, you must set the ENV variable GOOGLE_APPLICATION_CREDENTIALS
which points to
a json
file in the file system which holds the GCS private key of the google service account you will use to access the data.
You can store the private key in AWS Secret Manager. In that case set AWS_GCS_KEY_SECRET_ARN
to specify the secret id
together with GOOGLE_APPLICATION_CREDENTIALS
. PixETL with then attempt to download the private key from AWS Secret Manager
and store it the json
file specified.
Goolge Cloud Storage support is experimental only. It should work as documented, but we don't have the tools in place to fully test this feature locally. The only way we can test right now is with integration tests after we deployed code in staging. For local tests to past AWS_GCS_KEY_SECRET_ARN must NOT be set as we currently have issues running tests with a second moto server on Github Actions.
For example here is a pretty-printed sample raster layer definition followed by the command that one would issue to process it:
{
"source_type": "raster",
"pixel_meaning": "percent",
"data_type": "uint8",
"nbits": 7,
"grid": "10/40000",
"source_uri": "s3://gfw-files/2018_update/tcd_2000/tiles.geojson",
"resampling": "average"
}
Option | Mandatory | Description |
---|---|---|
source_type | yes | Always "vector" |
pixel_meaning | yes | Field in source table used for pixel value |
data_type | yes | Data type of output file (boolean, uint, int, uint16, int16, uint32, int32, float32, float64) |
grid | yes | Grid size of output dataset |
no_data | no | Integer value to use for no data value. |
nbits | no | Max number of bits used for given datatype |
order | no | How to order field values of source table (asc, desc) |
rasterize_method | no | How to rasterize tile (value or count). value uses value from table, count counts number of features intersecting with pixel |
calc | no | PostgreSQL expression (ie CASE to use to reformat input values |
symbology | no | Add optional symbology to the output raster |
compute_stats | no | Compute band statistics and add to tiles.geojson |
compute_histogram | no | Compute band histograms and add to tile.geojson |
NOTE:
Source files must be loaded into a PostgreSQL database prior to running this pipeline.
PixETL will look for a PostgreSQL schema named after the dataset
and a table named after the version
.
Make sure, geometries are of type Polygon
or MultiPolygon
and valid before running PixETL.
PixETL will look for a field named after pixel_meaning
parameter. This field must be a field of type integer.
If you need to reference a non-integer field, make use of the calc
parameter. Use a PostGreSQL CASE
expression
to map desires field values to integer values.
When using vector sources, PixETL will need access to the PostgreSQL database. Use the standard PostgreSQL environment variables to configure the connection.
This is probably the easiest way to run PixETL locally since you won't need to install any of the required dependencies.
The master branch of this repo is linked to Dockerhub image globalforestwatch/pixetl:latest
.
You can either pull from here, or build you own local image using the provided dockerfile.
docker build . -t globalforestwatch/pixetl
Make sure you map a local directory to container's /tmp directory if you want to monitor temporary files created. Make sure you set all required ENV vars (see above). Also make sure that your docker container has all the required AWS privileges.
docker run -it -v /tmp:/tmp -v $HOME/.aws:/root/.aws:ro -e AWS_PROFILE=gfw-dev globalforestwatch/pixetl [OPTIONS] NAME # pragma: allowlist secret
The terraform module in this repo will add a PixETL owned compute environment to AWS Batch.
The compute environment will make use of EC2 instance which come with ephemeral storage
(ie instance of the instance families r5d
, r5ad
, r5nd
).
A bootstrap script on the instance will mount one of the ephemeral storage devices as folder /tmp
.
A second ephemeral storage device (if available) will be mounted as swap space.
Any other available ephemeral storage device of the instance will be ignored.
The swap space is only a safety net.
AWS Batch kills a tasks without further notice if it uses more than the allocated memory.
The swap space allows the batch job to use more memory than allocated. However, the job will become VERY slow.
It is hence always the best strategy to keep the Memory/CPU ration high, in particular when working with float
data types.
Terraform will also create a PixETL job definition, which reference the docker image (hosted on AWS ECR) and set required ENV variables for the container. In case you want to run jobs using a vector source, you will have to set the PostgreSQL ENV vars manually.
The job definition also maps the /tmp
volume of the host ec2 instance to the /tmp
folder of the docker container.
Hence, any data written to /tmp
inside the docker container will persist on the ec2 instance.
PixETL will create a subfolder in /tmp
using the BatchJOB ID to name space the data of a given job.
PixETL will write all temporary for a given job into that subfolder.
It will also clean up temporary data during runtime to avoid filling up the disc space.
This strategy should avoid running out of disc space. However, in some scenarios you might still experience issues.
For example if multiple jobs where killed by Batch (due to using too much memory) before PixETL was able to clean up.
The EC2 instance will stay available of other scheduled jobs and if this happens multiple times, the discs fills up,
and eventually you run out of space.
The AWS IAM role used for the docker container should have all the required permissions to run PixETL.
When creating a new PixETL job, it will be easiest to specify the job parameter using the JSON format,
not the default space-separated format. The entrypoint of the docker image used is pixetl
and you will only need to
specify the CLI options and arguments, not the binary itself.
When using the JSON format, you will have to escape the quotes inside= the LAYER_OPTION
objects
["-d", "umd_tree_cover_density_2000", "-v", "v1.6", "{\"source_type\": \"raster\", \"pixel_meaning\": \"percent\", \"data_type\": \"uint8\", \"nbits\": 7, \"grid\": \"10/40000\", \"source_uri\": \"s3://gfw-files/2018_update/tcd_2000/tiles.geojson\", \"resampling\": \"average\"}"]