Skip to content

3867 terraform #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: 3742-create-data-backend
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
dbf5dae
added terraform to start vm, disks and install deps for user
jdhayhurst Apr 17, 2025
d85b252
terraform: templating the config, running the full pipeline
jdhayhurst Apr 24, 2025
ccad7fb
suppress gsutil rsync stdout and stderr from subprocess
jdhayhurst Apr 24, 2025
69b6efb
organise terraform, parallel run for ch and os, avoid conflicts with …
jdhayhurst Apr 24, 2025
ce7e9c7
gsutil to gcloud storage
jdhayhurst May 2, 2025
155f6de
don't pass through the uid and gid to dockerfile
jdhayhurst May 8, 2025
bc3091e
POS working except for croissant
jdhayhurst May 13, 2025
17f188b
WIP: added justfile with most recipes
jdhayhurst May 16, 2025
cabd548
enable BigQuery to run locally
jdhayhurst May 19, 2025
03a6050
fix bigquery recipes in justfile
jdhayhurst May 20, 2025
9f3137b
fix paths for croissant
jdhayhurst May 20, 2025
6a624a8
update startup script
jdhayhurst May 21, 2025
03aaa13
update ot_croissant dep
jdhayhurst May 22, 2025
fe13503
templating for bigquery from hcl config
jdhayhurst May 23, 2025
1645b97
justfile recipe for gcs sync, gcs sync script, update default vars.
jdhayhurst May 23, 2025
c49d564
added FTP sync scripts and justfile recipe
jdhayhurst May 23, 2025
a5da037
format justfile
jdhayhurst May 23, 2025
d183902
Update README, clean up HCL configuration, fix python types
jdhayhurst May 28, 2025
8a4eb7d
minor fixes
jdhayhurst Jun 2, 2025
6302d6d
update ot-croissant
jdhayhurst Jun 2, 2025
a0f4283
added protein coding coords dataset
jdhayhurst Jun 2, 2025
d212a0f
add protein coding coords to data prep step
jdhayhurst Jun 2, 2025
a24344c
added croissant step
jdhayhurst Jun 2, 2025
b590a30
bump ot croissant version
jdhayhurst Jun 3, 2025
a0bce1f
enable prometheus metrics in clickhouse
remo87 Jun 2, 2025
eb8013c
add metrics port
remo87 Jun 2, 2025
d89ea2f
update data version
remo87 Jun 2, 2025
b6ed67c
update branch
remo87 Jun 3, 2025
a797d29
bump croissant, add clickhouse summary
jdhayhurst Jun 5, 2025
facf639
add sync before snapshot, clean up
jdhayhurst Jun 5, 2025
7c49a9c
load croissant in opensearch
jdhayhurst Jun 11, 2025
2b8f1b4
minor fix to croissant
jdhayhurst Jun 11, 2025
34bf26a
write as ndjson
jdhayhurst Jun 11, 2025
146b560
update ot_croissant conf
jdhayhurst Jun 11, 2025
80118b2
revert to gsutil
jdhayhurst Jun 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,7 @@ htmlcov
.DS_Store

# output working directory
work/
work/

# credentials
.credentials/
98 changes: 68 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,29 +13,85 @@ data to BigQuery.
Check out the [config.yaml](config/config.yaml) file to see the steps and the tasks that
make them up.

TODO:
- [X] croissant
- [X] prep data for loading
- [X] load Clickhouse
- [X] load OpenSearch
- [X] create google disk snapshots for ch and os
- [X] create data tarballs
- [X] load BigQuery
- [ ] GCS release
- [ ] FTP release

## Installation and running

POS uses [UV](https://docs.astral.sh/uv/) as its package manager. It is compatible
### Dependencies

- [uv](https://docs.astral.sh/uv/) is the package manager POS. It is compatible
with PIP, so you can also fall back to it if you feel more comfortable.
- [just](https://just.systems/) is the POS interface which is a similar but more suitable alternative to GNU make for this purpose.
- [terraform](https://developer.hashicorp.com/terraform) is the IaC tool by which the necessary infrastructure is assembled and destroyed.


### Recipes

```bash
$ just
Platform Output Support
Set the profile with `just profile=foo <RECIPE>` to use `profiles/foo.tfvars`. Defaults to `profiles/default.tfvars` if no profile is set.
help
snapshots # Create Google cloud disk snapshots (Clickhouse and OpenSearch).
clean # Clean the credentials and the infrastructure used for creating the Google cloud disk snapshots
clean_all
bigquerydev # Big Query Dev
bigqueryprod # Big Query Prod
gcssync # Sync data to production Google Cloud Storage
ftpsync # Sync data to FTP
```

Private recipes are prefixed with '_' in the [justfile](justfile).

#### Configuring the profile for any of the recipes
All the configuration you need should be possible by modifying a profile such as the [default one](profiles/default.tfvars).
This file is symlinked to the [terraform.tfvars](deployment/terraform.tfvars) when the recipes are executed.
If you want to use a different profile, copy/paste the default to foo.tfvars and whenever you run `just` do it like so (the profile param must come _before_ the recipe):
```bash
just profile=foo <RECIPE>` to use `profiles/foo.tfvars`
```

#### Create the data backend for the platform
```bash
just snapshots
```
- starts a Google compute engine with external drives (one for clickhouse, one for opensearch)
- runs the otter steps for croissant, clickhouse and opensearch - see [startup script](deployment/startup.sh)
- _optional: create tarballs (see [Configuration](#configuration))

#### Release data to BigQuery
```bash
# dev
just bigquerydev

# prod
just bigqueryprod
```
- creates a local otter config based on the terraform.tfvars profile.
- runs the otter step for releasing to Google BigQuery.

#### Release data to FTP
```bash
just ftpsync
```
- uses the terraform.tfvars profile as configuration.
- runs a shell script that runs a gcloud container on the EBI HPC.
- from the container it syncs the data from GCS to the EBI FTP.

#### Release data to GCS
```bash
uv run pos -h
just gcssync
```
- uses the terraform.tfvars profile as configuration.
- runs a gcloud command to sync one GCS with another.


### Configuration

You should only ever need to configure the terraform profile. This is used as the point of configuration even where terraform is not actually used.
See [here](#configuring-the-profile-for-any-of-the-recipes) for details.

Terraform will apply this configuration, or in the cases where terraform is not used, an HCL library will read and apply the configuration as needed.

A folder for all the configuration is [here](config), which has the following:

- Main config for otter: [config.yaml](config/config.yaml)
Expand All @@ -45,24 +101,6 @@ A folder for all the configuration is [here](config), which has the following:

It's configured by default to load all the necessary datasets, but it can be modified. Make sure that the dataset names in the config.yaml have a corresponding entry in the datasets.yaml and so on.

### Create the OT Platform backend
1. start a google vm and clone this repo, see installation.
1. ideally something like n2-highmem-96 - reserve half the mem for the JVM
2. external disk for opensearch
3. external disk for clickhouse
2. opensearch (each step needs to be completed before starting the next)
1. `uv run pos -p 300 -c config/config.yaml -s open_search_prep_all`
2. `uv run pos -p 100 -c config/config.yaml -s open_search_load_all`
3. `uv run pos -c config/config.yaml -s open_search_stop`
4. `uv run pos -c config/config.yaml -s open_search_disk_snapshot`
5. `uv run pos -c config/config.yaml -s open_search_tarball`
3. clickhouse (each step needs to be completed before starting the next)
1. `uv run pos -c config/config.yaml -s clickhouse_load_all`
2. `uv run pos -c config/config.yaml -s clickhouse_stop`
3. `uv run pos -c config/config.yaml -s clickhouse_disk_snapshot`
4. `uv run pos -c config/config.yaml -s clickhouse_tarball`



## Copyright

Expand Down
161 changes: 161 additions & 0 deletions bin/ftp_sync.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
#!/bin/bash

#SBATCH -J ot_platform_ebi_ftp_sync
#SBATCH -t 14:00:00
#SBATCH --mem=10G
#SBATCH -e /nfs/ftp/private/otftpuser/slurm/logs/ot_platform_ebi_ftp_sync-%J.err
#SBATCH -o /nfs/ftp/private/otftpuser/slurm/logs/ot_platform_ebi_ftp_sync-%J.out
#SBATCH --mail-type=BEGIN,END,FAIL

# This is an SLURM job that uploads Open Targets Platform release data to EBI FTP Service

# Defaults
[ -z "${RELEASE_ID_PROD}" ] && export RELEASE_ID_PROD='dev.default_release_id'
[ -z "${DATA_LOCATION_SOURCE}" ] && export DATA_LOCATION_SOURCE="open-targets-pre-data-releases/${RELEASE_ID_PROD}"
[ -z $PATH_OPS_ROOT_FOLDER ] && echo "PATH to operations root folder is required" && exit 1
[ -z $PATH_OPS_CREDENTIALS ] && echo "PATH to operations credentials folder is required" && exit 1

# TODO - Credentials file default

# Helpers and environment
export job_name="${SLURM_JOB_NAME}-${SLURM_JOB_ID}"
export path_private_base='/nfs/ftp/private/otftpuser'
export path_private_base_ftp_upload="${path_private_base}/opentargets_ebi_ftp_upload"
export path_ebi_ftp_base='/nfs/ftp/public/databases/opentargets/platform'
export path_ebi_ftp_destination="${path_ebi_ftp_base}/${RELEASE_ID_PROD}"
export path_ebi_ftp_destination_latest="${path_ebi_ftp_base}/latest"
export path_slurm_base="${path_private_base}/slurm"
export path_slurm_logs="${path_slurm_base}/logs"
export path_slurm_job_workdir="${path_slurm_base}/${job_name}"
export path_slurm_job_logs="${path_slurm_logs}/${job_name}"
export path_slurm_job_stderr="${path_slurm_job_logs}/output.err"
export path_slurm_job_stdout="${path_slurm_job_logs}/output.out"
export path_slurm_job_sbatch_stderr="${path_slurm_logs}/${job_name}.err"
export path_slurm_job_sbatch_stdout="${path_slurm_logs}/${job_name}.out"
export path_data_source="gs://${DATA_LOCATION_SOURCE}/"
export filename_release_checksum="release_data_integrity"

# Logging functions
function log_heading {
tag=$1
shift
echo -e "[=[$tag]= ---| $@ |--- ]"
}

function log_body {
tag=$1
shift
echo -e "\t[$tag]---> $@"
}

function log_error {
echo -e "[ERROR] $@"
}

# Environment summary
function print_summary {
echo -e "[=================================== JOB DATASHEET =====================================]"
echo -e "\t- Release Number : ${RELEASE_ID_PROD}"
echo -e "\t- Job Name : ${job_name}"
echo -e "\t- PATH Private base : ${path_private_base}"
echo -e "\t- PATH EBI FTP base destination : ${path_ebi_ftp_base}"
echo -e "\t- PATH EBI FTP destination folder : ${path_ebi_ftp_destination}"
echo -e "\t- PATH EBI FTP destination latest : ${path_ebi_ftp_destination_latest}"
echo -e "\t- PATH SLURM base : ${path_slurm_base}"
echo -e "\t- PATH SLURM logs : ${path_slurm_logs}"
echo -e "\t- PATH SLURM Job workdir : ${path_slurm_job_workdir}"
echo -e "\t- PATH SLURM Job logs stderr : ${path_slurm_job_stderr}"
echo -e "\t- PATH SLURM Job logs stdout : ${path_slurm_job_stdout}"
echo -e "\t- PATH SLURM SBATCH Job logs stderr : ${path_slurm_job_sbatch_stderr}"
echo -e "\t- PATH SLURM SBATCH Job logs stdout : ${path_slurm_job_sbatch_stdout}"
echo -e "\t- PATH Data Source : ${path_data_source}"
echo -e "\t- PATH Operations root folder : ${PATH_OPS_ROOT_FOLDER}"
echo -e "\t- PATH Operations credentials folder : ${PATH_OPS_CREDENTIALS}"
echo -e "[===================================|==============|====================================]"
}

# Prepare destination folders
function make_dirs {
log_body "MKDIR" "Check/Create ${path_slurm_base}"
sudo -u otftpuser -- bash -c "mkdir ${path_slurm_base} && chmod 770 ${path_slurm_base}"
log_body "MKDIR" "Check/Create ${path_slurm_logs}"
sudo -u otftpuser -- bash -c "mkdir ${path_slurm_logs} && chmod 770 ${path_slurm_logs}"
log_body "MKDIR" "Check/Create ${path_slurm_job_workdir}"
sudo -u otftpuser -- bash -c "mkdir ${path_slurm_job_workdir} && chmod 770 ${path_slurm_job_workdir}"
log_body "MKDIR" "Check/Create ${path_slurm_job_logs}"
sudo -u otftpuser -- bash -c "mkdir ${path_slurm_job_logs} && chmod 770 ${path_slurm_job_logs}"
log_body "MKDIR" "Check/Create ${path_ebi_ftp_destination}"
sudo -u otftpuser -- bash -c "mkdir ${path_ebi_ftp_destination} && chmod 775 ${path_ebi_ftp_destination}"
}

# GCP functions
function activate_service_account {
log_heading "GCP" "Activating service account at '${PATH_OPS_CREDENTIALS}'"
singularity exec docker://google/cloud-sdk:latest gcloud auth activate-service-account --key-file=${PATH_OPS_CREDENTIALS}
}

function deactivate_service_account {
active_account=$(singularity exec docker://google/cloud-sdk:latest gcloud auth list --filter=status:ACTIVE --format="value(account)")
log_heading "GCP" "Deactivating service account '${active_account}'"
singularity exec docker://google/cloud-sdk:latest gcloud auth revoke ${active_account}
}

function pull_data_from_gcp {
log_heading "GCP" "Pulling data from GCP, '${path_data_source}' ---> to ---> '${path_ebi_ftp_destination}"
singularity exec --bind /nfs/ftp:/nfs/ftp docker://google/cloud-sdk:latest gcloud storage rsync -r -x ^input/fda-inputs/* -x ^output/etl/parquet/failedMatches/* -x ^output/etl/json/failedMatches/* ${path_data_source} ${path_ebi_ftp_destination}/
log_heading "PERMISSIONS" "Adjusting file tree permissions at '${path_ebi_ftp_destination}'"
# We don't really need to do this for the production folder, but it's nice to have the permissions set correctly (although you'd need to be 'otftpuser' to do it)
find ${path_ebi_ftp_destination} -type d -exec chmod 775 {} \;
find ${path_ebi_ftp_destination} -type f -exec chmod 644 {} \;
log_heading "GCP" "Done pulling data from GCP"
}

# Helper functions
function compute_checksums {
log_heading "CHECKSUM" "Compute SHA1 checksum for all the files in this release"
current_dir=`pwd`
cd ${path_ebi_ftp_destination}
find . -type f ! -iname "${filename_release_checksum}*" -exec sha1sum \{} \; > ${filename_release_checksum}
sha1sum ${filename_release_checksum} > ${filename_release_checksum}.sha1
log_heading "DATA" "Add the data integrity information back to the source bucket"
singularity exec --bind /nfs/ftp:/nfs/ftp docker://google/cloud-sdk:latest gsutil cp ${filename_release_checksum}* ${path_data_source}
cd ${current_dir}
log_heading "CHECKSUM" "Done computing SHA1 checksum for all the files in this release"
}

function ftp_update_latest_symlink {
log_heading "FTP" "Update latest symlink"
log_heading "LATEST" "Update 'latest' link at '${path_ebi_ftp_destination_latest}' to point to '${path_ebi_ftp_destination}'"
ln -nsf $( basename ${path_ebi_ftp_destination} ) ${path_ebi_ftp_destination_latest}
}

# Bootstrap
function bootstrap {
log_heading "BOOTSTRAP" "Bootstrapping"
activate_service_account
log_heading "FILESYSTEM" "Preparing destination folders"
make_dirs
log_heading "BOOTSTRAP" "Done"
}

# Cleanup
function cleanup {
log_heading "CLEAN" "Cleaning up"
deactivate_service_account
log_body "CLEAN" "Remove operations folder at '${PATH_OPS_ROOT_FOLDER}'"
rm -rf ${PATH_OPS_ROOT_FOLDER}
log_heading "CLEAN" "Done"
}




# Main
print_summary
log_heading "JOB" "Starting job '${job_name}'"
bootstrap
pull_data_from_gcp
compute_checksums
ftp_update_latest_symlink
cleanup
log_heading "JOB" "END OF JOB ${job_name}"
36 changes: 36 additions & 0 deletions bin/gcs_sync.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash

# Arguments
DATA_LOCATION_SOURCE=$1
DATA_LOCATION_TARGET=$2
IS_PARTNER_INSTANCE=$3

# Check if the required arguments are provided
if [ -z "${DATA_LOCATION_SOURCE}" ] || [ -z "${DATA_LOCATION_TARGET}" ] || [ -z "${IS_PARTNER_INSTANCE}" ]; then
echo "Usage: $0 <data_location_source> <data_location_target> <is_partner_instance>"
exit 1
fi


function add_trailing_slash {
local path=$1
if [[ "${path}" != */ ]]; then
path="${path}/"
fi
echo "${path}"
}

DATA_LOCATION_SOURCE=$(add_trailing_slash "${DATA_LOCATION_SOURCE}")
DATA_LOCATION_TARGET=$(add_trailing_slash "${DATA_LOCATION_TARGET}")

# Check if it's partner instance
if [ "${IS_PARTNER_INSTANCE}" = true ]; then
echo "This is a PARTNER INSTANCE, SKIPPING the sync process"
exit 0
fi

# TODO: check which patterns to exclude
# TODO: remove dry-run flag
echo "=== Syncing data from: ${DATA_LOCATION_SOURCE} --> ${DATA_LOCATION_TARGET}"
gcloud storage rsync -r --dry-run -x '^input/fda-inputs/*' -x '^output/etl/parquet/failedMatches/*' -x '^output/etl/json/failedMatches/*' $DATA_LOCATION_SOURCE $DATA_LOCATION_TARGET
echo "=== Sync complete."
Loading