opentargets · jdhayhurst · Apr 17, 2025 · Apr 24, 2025 · Apr 24, 2025 · Apr 24, 2025
diff --git a/.gitignore b/.gitignore
@@ -17,4 +17,7 @@ htmlcov
 .DS_Store
 
 # output working directory
-work/
+work/
+
+# credentials
+.credentials/
diff --git a/README.md b/README.md
@@ -13,29 +13,85 @@ data to BigQuery.
 Check out the [config.yaml](config/config.yaml) file to see the steps and the tasks that
 make them up.
 
-TODO:
-- [X] croissant
-- [X] prep data for loading
-- [X] load Clickhouse
-- [X] load OpenSearch
-- [X] create google disk snapshots for ch and os
-- [X] create data tarballs
-- [X] load BigQuery 
-- [ ] GCS release
-- [ ] FTP release
 
 ## Installation and running
 
-POS uses [UV](https://docs.astral.sh/uv/) as its package manager. It is compatible
+### Dependencies
+
+- [uv](https://docs.astral.sh/uv/) is the package manager POS. It is compatible
 with PIP, so you can also fall back to it if you feel more comfortable.
+- [just](https://just.systems/) is the POS interface which is a similar but more suitable alternative to GNU make for this purpose.
+- [terraform](https://developer.hashicorp.com/terraform) is the IaC tool by which the necessary infrastructure is assembled and destroyed.
+
+
+### Recipes
+
+```bash
+$ just
+Platform Output Support
+Set the profile with `just profile=foo <RECIPE>` to use `profiles/foo.tfvars`. Defaults to `profiles/default.tfvars` if no profile is set.
+    help
+    snapshots    # Create Google cloud disk snapshots (Clickhouse and OpenSearch).
+    clean        # Clean the credentials and the infrastructure used for creating the Google cloud disk snapshots
+    clean_all
+    bigquerydev  # Big Query Dev
+    bigqueryprod # Big Query Prod
+    gcssync      # Sync data to production Google Cloud Storage
+    ftpsync      # Sync data to FTP
+```
+
+Private recipes are prefixed with '_' in the [justfile](justfile).
+
+#### Configuring the profile for any of the recipes
+All the configuration you need should be possible by modifying a profile such as the [default one](profiles/default.tfvars). 
+This file is symlinked to the [terraform.tfvars](deployment/terraform.tfvars) when the recipes are executed. 
+If you want to use a different profile, copy/paste the default to foo.tfvars and whenever you run `just` do it like so (the profile param must come _before_ the recipe):
+```bash
+just profile=foo <RECIPE>` to use `profiles/foo.tfvars`
+```
+
+#### Create the data backend for the platform
+```bash
+just snapshots
+```
+- starts a Google compute engine with external drives (one for clickhouse, one for opensearch)
+- runs the otter steps for croissant, clickhouse and opensearch - see [startup script](deployment/startup.sh)
+  - _optional: create tarballs (see [Configuration](#configuration)) 
 
+#### Release data to BigQuery
+```bash
+# dev
+just bigquerydev
 
+# prod
+just bigqueryprod
+```
+- creates a local otter config based on the terraform.tfvars profile. 
+- runs the otter step for releasing to Google BigQuery.
+
+#### Release data to FTP
+```bash
+just ftpsync
+```
+- uses the terraform.tfvars profile as configuration.
+- runs a shell script that runs a gcloud container on the EBI HPC.
+- from the container it syncs the data from GCS to the EBI FTP.
+
+#### Release data to GCS
 ```bash
-uv run pos -h
+just gcssync
 ```
+- uses the terraform.tfvars profile as configuration.
+- runs a gcloud command to sync one GCS with another.
+
 
 ### Configuration
 
+You should only ever need to configure the terraform profile. This is used as the point of configuration even where terraform is not actually used.
+See [here](#configuring-the-profile-for-any-of-the-recipes) for details. 
+
+Terraform will apply this configuration, or in the cases where terraform is not used, an HCL library will read and apply the configuration as needed.
+
 A folder for all the configuration is [here](config), which has the following:
 
 - Main config for otter: [config.yaml](config/config.yaml)
@@ -45,24 +101,6 @@ A folder for all the configuration is [here](config), which has the following:
 
 It's configured by default to load all the necessary datasets, but it can be modified. Make sure that the dataset names in the config.yaml have a corresponding entry in the datasets.yaml and so on.
 
-### Create the OT Platform backend
-1. start a google vm and clone this repo, see installation.
-   1. ideally something like n2-highmem-96 - reserve half the mem for the JVM
-   2. external disk for opensearch
-   3. external disk for clickhouse
-2. opensearch (each step needs to be completed before starting the next)
-   1. `uv run pos -p 300 -c config/config.yaml -s open_search_prep_all`
-   2. `uv run pos -p 100 -c config/config.yaml -s open_search_load_all`
-   3. `uv run pos -c config/config.yaml -s open_search_stop`
-   4. `uv run pos -c config/config.yaml -s open_search_disk_snapshot`
-   5. `uv run pos -c config/config.yaml -s open_search_tarball`
-3. clickhouse (each step needs to be completed before starting the next)
-   1. `uv run pos -c config/config.yaml -s clickhouse_load_all`
-   2. `uv run pos -c config/config.yaml -s clickhouse_stop`
-   3. `uv run pos -c config/config.yaml -s clickhouse_disk_snapshot`
-   4. `uv run pos -c config/config.yaml -s clickhouse_tarball`
-
-
 
 ## Copyright
 

diff --git a/bin/ftp_sync.sh b/bin/ftp_sync.sh
@@ -0,0 +1,161 @@
+#!/bin/bash
+
+#SBATCH -J ot_platform_ebi_ftp_sync
+#SBATCH -t 14:00:00
+#SBATCH --mem=10G
+#SBATCH -e /nfs/ftp/private/otftpuser/slurm/logs/ot_platform_ebi_ftp_sync-%J.err
+#SBATCH -o /nfs/ftp/private/otftpuser/slurm/logs/ot_platform_ebi_ftp_sync-%J.out
+#SBATCH --mail-type=BEGIN,END,FAIL
+
+# This is an SLURM job that uploads Open Targets Platform release data to EBI FTP Service
+
+# Defaults
+[ -z "${RELEASE_ID_PROD}" ] && export RELEASE_ID_PROD='dev.default_release_id'
+[ -z "${DATA_LOCATION_SOURCE}" ] && export DATA_LOCATION_SOURCE="open-targets-pre-data-releases/${RELEASE_ID_PROD}"
+[ -z $PATH_OPS_ROOT_FOLDER ] && echo "PATH to operations root folder is required" && exit 1
+[ -z $PATH_OPS_CREDENTIALS ] && echo "PATH to operations credentials folder is required" && exit 1
+
+# TODO - Credentials file default
+
+# Helpers and environment
+export job_name="${SLURM_JOB_NAME}-${SLURM_JOB_ID}"
+export path_private_base='/nfs/ftp/private/otftpuser'
+export path_private_base_ftp_upload="${path_private_base}/opentargets_ebi_ftp_upload"
+export path_ebi_ftp_base='/nfs/ftp/public/databases/opentargets/platform'
+export path_ebi_ftp_destination="${path_ebi_ftp_base}/${RELEASE_ID_PROD}"
+export path_ebi_ftp_destination_latest="${path_ebi_ftp_base}/latest"
+export path_slurm_base="${path_private_base}/slurm"
+export path_slurm_logs="${path_slurm_base}/logs"
+export path_slurm_job_workdir="${path_slurm_base}/${job_name}"
+export path_slurm_job_logs="${path_slurm_logs}/${job_name}"
+export path_slurm_job_stderr="${path_slurm_job_logs}/output.err"
+export path_slurm_job_stdout="${path_slurm_job_logs}/output.out"
+export path_slurm_job_sbatch_stderr="${path_slurm_logs}/${job_name}.err"
+export path_slurm_job_sbatch_stdout="${path_slurm_logs}/${job_name}.out"
+export path_data_source="gs://${DATA_LOCATION_SOURCE}/"
+export filename_release_checksum="release_data_integrity"
+
+# Logging functions
+function log_heading {
+    tag=$1
+    shift
+    echo -e "[=[$tag]= ---| $@ |--- ]"
+}
+
+function log_body {
+    tag=$1
+    shift
+    echo -e "\t[$tag]---> $@"
+}
+
+function log_error {
+    echo -e "[ERROR] $@"
+}
+
+# Environment summary
+function print_summary {
+    echo -e "[=================================== JOB DATASHEET =====================================]"
+    echo -e "\t- Release Number                     : ${RELEASE_ID_PROD}"
+    echo -e "\t- Job Name                           : ${job_name}"
+    echo -e "\t- PATH Private base                  : ${path_private_base}"
+    echo -e "\t- PATH EBI FTP base destination      : ${path_ebi_ftp_base}"
+    echo -e "\t- PATH EBI FTP destination folder    : ${path_ebi_ftp_destination}"
+    echo -e "\t- PATH EBI FTP destination latest    : ${path_ebi_ftp_destination_latest}"
+    echo -e "\t- PATH SLURM base                    : ${path_slurm_base}"
+    echo -e "\t- PATH SLURM logs                    : ${path_slurm_logs}"
+    echo -e "\t- PATH SLURM Job workdir             : ${path_slurm_job_workdir}"
+    echo -e "\t- PATH SLURM Job logs stderr         : ${path_slurm_job_stderr}"
+    echo -e "\t- PATH SLURM Job logs stdout         : ${path_slurm_job_stdout}"
+    echo -e "\t- PATH SLURM SBATCH Job logs stderr  : ${path_slurm_job_sbatch_stderr}"
+    echo -e "\t- PATH SLURM SBATCH Job logs stdout  : ${path_slurm_job_sbatch_stdout}"
+    echo -e "\t- PATH Data Source                   : ${path_data_source}"
+    echo -e "\t- PATH Operations root folder        : ${PATH_OPS_ROOT_FOLDER}"
+    echo -e "\t- PATH Operations credentials folder : ${PATH_OPS_CREDENTIALS}"
+    echo -e "[===================================|==============|====================================]"
+}
+
+# Prepare destination folders
+function make_dirs {
+    log_body "MKDIR" "Check/Create ${path_slurm_base}"
+    sudo -u otftpuser -- bash -c "mkdir ${path_slurm_base} && chmod 770 ${path_slurm_base}"
+    log_body "MKDIR" "Check/Create ${path_slurm_logs}"
+    sudo -u otftpuser -- bash -c "mkdir ${path_slurm_logs} && chmod 770 ${path_slurm_logs}"
+    log_body "MKDIR" "Check/Create ${path_slurm_job_workdir}"
+    sudo -u otftpuser -- bash -c "mkdir ${path_slurm_job_workdir} && chmod 770 ${path_slurm_job_workdir}"
+    log_body "MKDIR" "Check/Create ${path_slurm_job_logs}"
+    sudo -u otftpuser -- bash -c "mkdir ${path_slurm_job_logs} && chmod 770 ${path_slurm_job_logs}"
+    log_body "MKDIR" "Check/Create ${path_ebi_ftp_destination}"
+    sudo -u otftpuser -- bash -c "mkdir ${path_ebi_ftp_destination} && chmod 775 ${path_ebi_ftp_destination}"
+}
+
+# GCP functions
+function activate_service_account {
+    log_heading "GCP" "Activating service account at '${PATH_OPS_CREDENTIALS}'"
+    singularity exec docker://google/cloud-sdk:latest gcloud auth activate-service-account --key-file=${PATH_OPS_CREDENTIALS}
+}
+
+function deactivate_service_account {
+    active_account=$(singularity exec docker://google/cloud-sdk:latest gcloud auth list --filter=status:ACTIVE --format="value(account)")
+    log_heading "GCP" "Deactivating service account '${active_account}'"
+    singularity exec docker://google/cloud-sdk:latest gcloud auth revoke ${active_account}
+}
+
+function pull_data_from_gcp {
+    log_heading "GCP" "Pulling data from GCP, '${path_data_source}' ---> to ---> '${path_ebi_ftp_destination}"
+    singularity exec --bind /nfs/ftp:/nfs/ftp docker://google/cloud-sdk:latest gcloud storage rsync -r -x ^input/fda-inputs/* -x ^output/etl/parquet/failedMatches/* -x ^output/etl/json/failedMatches/* ${path_data_source} ${path_ebi_ftp_destination}/
+    log_heading "PERMISSIONS" "Adjusting file tree permissions at '${path_ebi_ftp_destination}'"
+    # We don't really need to do this for the production folder, but it's nice to have the permissions set correctly (although you'd need to be 'otftpuser' to do it)
+    find ${path_ebi_ftp_destination} -type d -exec chmod 775 {} \;
+    find ${path_ebi_ftp_destination} -type f -exec chmod 644 {} \;
+    log_heading "GCP" "Done pulling data from GCP"
+}
+
+# Helper functions
+function compute_checksums {
+    log_heading "CHECKSUM" "Compute SHA1 checksum for all the files in this release"
+    current_dir=`pwd`
+    cd ${path_ebi_ftp_destination}
+    find . -type f ! -iname "${filename_release_checksum}*" -exec sha1sum \{} \; > ${filename_release_checksum}
+    sha1sum ${filename_release_checksum} > ${filename_release_checksum}.sha1
+    log_heading "DATA" "Add the data integrity information back to the source bucket"
+    singularity exec --bind /nfs/ftp:/nfs/ftp docker://google/cloud-sdk:latest gsutil cp ${filename_release_checksum}* ${path_data_source}
+    cd ${current_dir}
+    log_heading "CHECKSUM" "Done computing SHA1 checksum for all the files in this release"
+}
+
+function ftp_update_latest_symlink {
+    log_heading "FTP" "Update latest symlink"
+    log_heading "LATEST" "Update 'latest' link at '${path_ebi_ftp_destination_latest}' to point to '${path_ebi_ftp_destination}'"
+    ln -nsf $( basename ${path_ebi_ftp_destination} ) ${path_ebi_ftp_destination_latest}
+}
+
+# Bootstrap
+function bootstrap {
+    log_heading "BOOTSTRAP" "Bootstrapping"
+    activate_service_account
+    log_heading "FILESYSTEM" "Preparing destination folders"
+    make_dirs
+    log_heading "BOOTSTRAP" "Done"
+}
+
+# Cleanup
+function cleanup {
+    log_heading "CLEAN" "Cleaning up"
+    deactivate_service_account
+    log_body "CLEAN" "Remove operations folder at '${PATH_OPS_ROOT_FOLDER}'"
+    rm -rf ${PATH_OPS_ROOT_FOLDER}
+    log_heading "CLEAN" "Done"
+}
+
+
+
+
+# Main
+print_summary
+log_heading "JOB" "Starting job '${job_name}'"
+bootstrap
+pull_data_from_gcp
+compute_checksums
+ftp_update_latest_symlink
+cleanup
+log_heading "JOB" "END OF JOB ${job_name}"
diff --git a/bin/gcs_sync.sh b/bin/gcs_sync.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+
+# Arguments
+DATA_LOCATION_SOURCE=$1
+DATA_LOCATION_TARGET=$2
+IS_PARTNER_INSTANCE=$3
+
+# Check if the required arguments are provided
+if [ -z "${DATA_LOCATION_SOURCE}" ] || [ -z "${DATA_LOCATION_TARGET}" ] || [ -z "${IS_PARTNER_INSTANCE}" ]; then
+    echo "Usage: $0 <data_location_source> <data_location_target> <is_partner_instance>"
+    exit 1
+fi
+
+
+function add_trailing_slash {
+    local path=$1
+    if [[ "${path}" != */ ]]; then
+        path="${path}/"
+    fi
+    echo "${path}"
+}
+
+DATA_LOCATION_SOURCE=$(add_trailing_slash "${DATA_LOCATION_SOURCE}")
+DATA_LOCATION_TARGET=$(add_trailing_slash "${DATA_LOCATION_TARGET}")
+
+# Check if it's partner instance
+if [ "${IS_PARTNER_INSTANCE}" = true ]; then
+    echo "This is a PARTNER INSTANCE, SKIPPING the sync process"
+    exit 0
+fi
+
+# TODO: check which patterns to exclude
+# TODO: remove dry-run flag
+echo "=== Syncing data from: ${DATA_LOCATION_SOURCE} --> ${DATA_LOCATION_TARGET}"
+gcloud storage rsync -r --dry-run -x '^input/fda-inputs/*' -x '^output/etl/parquet/failedMatches/*' -x '^output/etl/json/failedMatches/*' $DATA_LOCATION_SOURCE $DATA_LOCATION_TARGET
+echo "=== Sync complete."